1 00:00:00,085 --> 00:00:04,948 today we're talking about AI again, but more specifically about the 2 00:00:04,998 --> 00:00:11,269 training data sets for generative AI, where it comes from, what some 3 00:00:11,269 --> 00:00:13,722 of the legal issues are, what is it? 4 00:00:13,932 --> 00:00:17,802 when we think about chat, GBT and other AI tools, I know I talk 5 00:00:17,812 --> 00:00:19,222 about chat, GBT all the time. 6 00:00:20,097 --> 00:00:21,247 Frankly, the one that I use. 7 00:00:21,247 --> 00:00:24,226 And so the one I'm most familiar with, but this applies to, 8 00:00:24,276 --> 00:00:25,966 all generative AI platforms. 9 00:00:26,211 --> 00:00:30,083 You hear about the vast amounts of data that they utilize. 10 00:00:30,394 --> 00:00:34,993 And as you can imagine, trading data plays a crucial role in the 11 00:00:34,993 --> 00:00:36,413 development and the efficiency. 12 00:00:36,553 --> 00:00:43,336 effectiveness of generative AI platforms, but where does all of that data come from? 13 00:00:43,483 --> 00:00:46,806 And I know you have lots of questions about that because you're worried 14 00:00:46,806 --> 00:00:48,976 about that is coming from your website. 15 00:00:49,193 --> 00:00:52,053 So let's start with what is training data? 16 00:00:52,263 --> 00:00:56,153 training data is the backbone of any machine learning. 17 00:00:56,209 --> 00:00:58,838 project, which is what generative AI is. 18 00:00:59,116 --> 00:01:04,824 It consists of large sets of information that's used to teach algorithm how to 19 00:01:04,824 --> 00:01:07,821 recognize patterns and make predictions. 20 00:01:08,053 --> 00:01:10,491 That's how it is creative, i. 21 00:01:10,521 --> 00:01:10,801 e. 22 00:01:10,811 --> 00:01:11,721 generative. 23 00:01:11,959 --> 00:01:17,738 And so you put in this vast amount of data and it's labeled in certain ways. 24 00:01:17,969 --> 00:01:23,021 I don't know how it does this, but it learns the patterns and then it 25 00:01:23,021 --> 00:01:28,971 can make informed predictions and create new content based on that. 26 00:01:29,221 --> 00:01:34,518 So given the scale of modern AI, requirements, the data sets 27 00:01:34,518 --> 00:01:40,348 are absolutely enormous, often encompassing billions of parameters. 28 00:01:40,564 --> 00:01:41,694 and that, of course, will. 29 00:01:41,869 --> 00:01:45,024 Change depending on the size and complexity of the 30 00:01:45,034 --> 00:01:47,261 model that is being trained. 31 00:01:47,484 --> 00:01:51,884 So the primary sources of training data, or I should say, traditionally, 32 00:01:52,026 --> 00:01:57,408 the sources of training data for platforms like open AI was 33 00:01:57,418 --> 00:02:00,518 scraped from the internet for free. 34 00:02:00,518 --> 00:02:06,168 And that was used to train the first generative AI models like chat, GBT, and 35 00:02:06,168 --> 00:02:10,916 they've done a pretty good job, I'd say, of learning to mimic human creativity. 36 00:02:11,316 --> 00:02:16,073 of course, thought, believe, and I think they're still sticking to this story, that 37 00:02:16,083 --> 00:02:22,909 it was legal and ethical for them to do so, relying on some prior cases that, you 38 00:02:22,939 --> 00:02:28,761 can use, publicly available information, so long as it's transformative, 39 00:02:28,811 --> 00:02:31,488 essentially making a fair use argument. 40 00:02:31,488 --> 00:02:35,360 I'm not going to go into the fair use argument, but, that is the basis of 41 00:02:35,360 --> 00:02:37,300 why they thought they could do this. 42 00:02:37,620 --> 00:02:41,760 as you probably know, there have been a number of high profile 43 00:02:41,760 --> 00:02:43,580 lawsuits about their use. 44 00:02:43,770 --> 00:02:46,090 So we will see, and there has not been resolved yet. 45 00:02:46,257 --> 00:02:50,987 And so we will see if their reasoning and their defenses hold up. 46 00:02:51,180 --> 00:02:56,380 So, to discuss a few of the, ways that they do get training data. 47 00:02:56,572 --> 00:02:58,652 web scraping, which you've already talked about. 48 00:02:58,852 --> 00:03:03,882 so there would be crawlers, they send out, scours the internet. 49 00:03:04,069 --> 00:03:08,175 It should only be scouring for things that are publicly available, 50 00:03:08,175 --> 00:03:10,105 that are not behind a paywall. 51 00:03:10,319 --> 00:03:15,754 However, there well, you can ask the crawler, I'm assuming, to go behind the 52 00:03:15,754 --> 00:03:21,184 paywall, which obviously would be a breach of, the, terms and conditions of a site if 53 00:03:21,184 --> 00:03:26,884 you go around their paywall, and also, you know, Even if there is no paywall, many 54 00:03:26,884 --> 00:03:30,714 sites will have terms and conditions would say you're not allowed to use crawlers. 55 00:03:30,975 --> 00:03:36,295 if you don't, comply with those terms conditions, then you're also, obviously. 56 00:03:36,416 --> 00:03:40,196 breaching those terms and conditions of that, as well as when they're 57 00:03:40,386 --> 00:03:44,356 scraping that data off many times, if not always, because like, we 58 00:03:44,356 --> 00:03:47,616 can't really quite see what's in the black box of that training data. 59 00:03:47,803 --> 00:03:54,336 They're taking off any copyright notices, and it is a violation of the Copyright 60 00:03:54,396 --> 00:03:56,226 Act to take off copyright notices. 61 00:03:56,706 --> 00:03:59,446 So there's a number of issues, involved with it. 62 00:03:59,516 --> 00:04:00,610 web scraping. 63 00:04:00,820 --> 00:04:03,503 that obviously is falling in disfavor. 64 00:04:03,743 --> 00:04:07,600 So what is replacing that licensed data sets? 65 00:04:07,890 --> 00:04:11,550 Very large data sets that are licensed from entities that 66 00:04:11,600 --> 00:04:14,120 own large amounts of data. 67 00:04:14,290 --> 00:04:18,483 I read this, regarding this new, path forward. 68 00:04:18,693 --> 00:04:20,630 There is a rush right now. 69 00:04:20,776 --> 00:04:25,226 To go for copyright holders that have private collections of stuff that is 70 00:04:25,226 --> 00:04:29,916 not available to be scraped, so this is from a lawyer who is advising content 71 00:04:29,946 --> 00:04:35,213 owners on deals worth tens of millions of dollars apiece to license archives 72 00:04:35,253 --> 00:04:38,023 of photos, movies, and books for AI. 73 00:04:38,583 --> 00:04:39,133 training. 74 00:04:39,288 --> 00:04:43,713 Bruder spoke to more than 30 people with knowledge of AI data deals, 75 00:04:43,975 --> 00:04:47,910 including current and former executives of companies involved, the lawyers and 76 00:04:47,910 --> 00:04:53,241 consultants to provide the first in depth exploration of this fledgling market and 77 00:04:53,241 --> 00:04:57,203 Detailing the types of content that's being bought, the prices that they're 78 00:04:57,293 --> 00:05:03,170 getting, and any emerging concerns that come from harvesting this type of 79 00:05:03,170 --> 00:05:09,240 data, even if it's licensed, because of the personal data risks that go along 80 00:05:09,421 --> 00:05:14,400 with harvesting large amounts of data where, the personal data of the, human 81 00:05:14,400 --> 00:05:18,706 that it belongs to, is done without the knowledge or consent of that person. 82 00:05:18,956 --> 00:05:20,456 Who are these huge licensees? 83 00:05:20,635 --> 00:05:21,628 There's of them. 84 00:05:21,821 --> 00:05:27,952 We have tech companies who have been quietly, buying, content 85 00:05:27,962 --> 00:05:34,488 that is behind locked paywalls and behind login screens from companies 86 00:05:34,578 --> 00:05:37,995 like Instacart, Meta, Microsoft. 87 00:05:38,261 --> 00:05:39,683 X and zoom. 88 00:05:39,924 --> 00:05:46,347 And so this might be some long forgotten chat blogs or long forgotten photos 89 00:05:46,591 --> 00:05:50,340 from old apps that are being licensed. 90 00:05:50,857 --> 00:05:55,343 tumblers, parent company automatic said last month, and I'm recording 91 00:05:55,343 --> 00:05:57,737 this in, April 2024, right? 92 00:05:57,940 --> 00:06:01,457 It was sharing content with select AI companies. 93 00:06:01,730 --> 00:06:06,670 And in February, that'd be 2024, Reuters reported Reddit struck a deal with 94 00:06:06,750 --> 00:06:11,640 Google to make its content available for training the latter's AI models. 95 00:06:11,957 --> 00:06:15,967 of course there's going to be some customer blowback. 96 00:06:16,173 --> 00:06:21,585 while this type of licensed content is accelerating, there will probably be 97 00:06:21,595 --> 00:06:26,865 some amendments still to it because, yes, meta goes in and it changes his terms 98 00:06:26,865 --> 00:06:32,904 of use, but does anybody read the terms of use of meta or of X or of zoom even. 99 00:06:33,091 --> 00:06:33,561 so. 100 00:06:33,716 --> 00:06:36,726 They're going in changing their terms and conditions without anyone 101 00:06:36,726 --> 00:06:39,896 kind of without it saying in bright red letters, Hey, we're going to be 102 00:06:39,896 --> 00:06:42,522 selling your data now to AI training. 103 00:06:42,756 --> 00:06:44,406 what comes from that. 104 00:06:44,601 --> 00:06:44,811 All right. 105 00:06:44,811 --> 00:06:50,207 Then there are archives that are owned such as the Associated Press 106 00:06:50,467 --> 00:06:54,207 and Getty images, or say aggregator. 107 00:06:54,227 --> 00:06:55,637 They don't own all those images. 108 00:06:55,924 --> 00:07:00,114 And so you can go to them and license their entire archives. 109 00:07:00,433 --> 00:07:04,467 And that provides a great amount of data for your data sets. 110 00:07:04,703 --> 00:07:10,001 Universities and research institutions are also owners or controllers of 111 00:07:10,011 --> 00:07:14,911 vast amounts of data that can be licensed all in one fell swoop. 112 00:07:15,110 --> 00:07:19,924 And then there are some nonprofit organizations that want to encourage 113 00:07:20,040 --> 00:07:24,992 the use of AI Just as we've had, other types of nonprofits in the 114 00:07:24,992 --> 00:07:28,356 past, such as creative commons, who want to help people get more 115 00:07:28,366 --> 00:07:31,019 access to, copyrightable materials. 116 00:07:31,239 --> 00:07:34,066 now there are some who feel the same way about. 117 00:07:34,259 --> 00:07:36,769 Making AI, data more accessible. 118 00:07:36,934 --> 00:07:41,692 for instance, this, nonprofit Allen Institute for AI released a data set 119 00:07:41,844 --> 00:07:47,791 of 3 million tokens from a diverse mix of web content, academic publications, 120 00:07:47,841 --> 00:07:50,731 code books, and encyclopedic materials. 121 00:07:50,957 --> 00:07:56,737 Now, another source is synthetic data when this is a new one to me, but it 122 00:07:56,834 --> 00:08:00,254 really points to how powerful AI can be. 123 00:08:00,457 --> 00:08:05,521 So synthetic data generation means that you use one generative AI 124 00:08:05,541 --> 00:08:08,217 tool to create synthetic data. 125 00:08:08,429 --> 00:08:12,947 And then you use that data, that synthetic data to train another. 126 00:08:12,982 --> 00:08:14,456 Generative AI tool. 127 00:08:14,639 --> 00:08:18,506 So let's say you're developing a customer service AI model. 128 00:08:18,696 --> 00:08:23,819 You could use another generative AI tool to create fictional customers 129 00:08:24,182 --> 00:08:26,779 and situations and interactions. 130 00:08:27,089 --> 00:08:31,032 And then you can use those fictional customer situations and 131 00:08:31,152 --> 00:08:36,766 interactions as the training data for your public facing AI model. 132 00:08:36,929 --> 00:08:42,789 So that way you're not at risk of exposing private information. 133 00:08:42,799 --> 00:08:46,456 If you were to directly put your customer information into your AI 134 00:08:46,666 --> 00:08:51,156 tool, first, you kind of anonymize it using one generative AI tool. 135 00:08:51,379 --> 00:08:55,396 And it's not just enough to de identify it because there could be customer 136 00:08:55,396 --> 00:08:59,336 situations that are so specific that you could only point to one person. 137 00:08:59,521 --> 00:09:00,199 It's possible. 138 00:09:00,377 --> 00:09:03,112 So you also have to make up perhaps new situations, new 139 00:09:03,112 --> 00:09:04,352 backgrounds, things like that. 140 00:09:04,554 --> 00:09:09,329 But then you can use that as your fictional customer for your AI, 141 00:09:09,491 --> 00:09:14,636 govern customer service model to then use that to train to help provide 142 00:09:14,636 --> 00:09:16,596 customer service on an AI basis. 143 00:09:17,136 --> 00:09:19,836 So we will see this with hospitals and banks as well 144 00:09:19,969 --> 00:09:21,806 that have sensitive information. 145 00:09:21,836 --> 00:09:25,196 Obviously they cannot use their customer's sensitive information. 146 00:09:25,419 --> 00:09:30,549 as training data, but they do want to have access to what is really kind of part 147 00:09:30,549 --> 00:09:35,973 of doing business these days of having some sort of a I based training systems. 148 00:09:36,233 --> 00:09:40,700 And then, of course, not last and not least, is the data 149 00:09:40,700 --> 00:09:42,420 that comes from you and me. 150 00:09:42,666 --> 00:09:43,906 So, what. 151 00:09:44,126 --> 00:09:49,745 Does that mean when we are using AI generated, uh, AI platforms, when 152 00:09:49,755 --> 00:09:56,261 we input our prompts, if we, put in something that we've written and ask 153 00:09:56,321 --> 00:10:01,605 it to, create a summary of it, if we put in a transcript from something 154 00:10:01,645 --> 00:10:06,761 and ask it to create a show notes, like everything that we put into that. 155 00:10:06,938 --> 00:10:11,611 has the potential to become training data for that platform. 156 00:10:11,818 --> 00:10:15,755 And so if we are doing that, we need to be aware of the 157 00:10:15,755 --> 00:10:18,235 terms of use of that platform. 158 00:10:18,458 --> 00:10:22,688 most of them will tell you that it can be part of the training data. 159 00:10:22,881 --> 00:10:28,896 And it might also end up being an output for someone who puts in a query, 160 00:10:28,896 --> 00:10:32,726 a prompt that what you put in as a perfect answer for, you just don't know. 161 00:10:32,916 --> 00:10:37,896 And so we need to be careful about what we are putting in as prompts 162 00:10:37,896 --> 00:10:41,750 or as, the input for whatever the AI platform that you're using. 163 00:10:41,946 --> 00:10:44,673 Make sure you are aware of their terms and conditions. 164 00:10:44,893 --> 00:10:48,270 Do not use any confidential information in there. 165 00:10:48,463 --> 00:10:50,393 whether it's yours or your clients. 166 00:10:50,580 --> 00:10:53,736 So make sure that you're really aware of that. 167 00:10:54,053 --> 00:11:00,306 some, AI platforms, I'm thinking in particular of they do use AI. 168 00:11:00,366 --> 00:11:05,350 And obviously, when you're using, uh, DocuSign, there are legal agreements that 169 00:11:05,350 --> 00:11:09,550 are going in there that have identifiable information of the parties, commercial 170 00:11:09,550 --> 00:11:11,850 terms, and things like that are in there. 171 00:11:12,050 --> 00:11:18,636 And so DocuSign, said that they, strip out any identifying data from that, so 172 00:11:18,636 --> 00:11:23,040 that they do use the agreements, for training data, but that they do strip 173 00:11:23,100 --> 00:11:25,140 out identifying information from it. 174 00:11:25,356 --> 00:11:27,133 So things to be aware of. 175 00:11:27,353 --> 00:11:31,653 in summary, the legal issues, I think we've covered, but just to sum them 176 00:11:31,653 --> 00:11:37,468 up, there are the copyright issues of Putting data into the database. 177 00:11:37,780 --> 00:11:41,203 I believe it was last week, I talked about the copyright 178 00:11:41,203 --> 00:11:43,213 ability issues of the output. 179 00:11:43,410 --> 00:11:46,450 So now I'm talking about the copyright issues with the input, 180 00:11:46,601 --> 00:11:52,856 whether or not the AI platform or you have the right to, add it. 181 00:11:52,970 --> 00:11:56,831 Information to the training data set, whether or not that 182 00:11:56,851 --> 00:11:58,481 is a copyright infringement. 183 00:11:58,531 --> 00:12:00,891 Is that fair use of that data? 184 00:12:01,023 --> 00:12:06,316 one of the issues, in the copyright side is, sometimes the output will literally 185 00:12:06,336 --> 00:12:13,150 be an exact replica of what went in and it's hard to make a fair use argument when 186 00:12:13,350 --> 00:12:18,420 a verbatim, uh, paragraphs, in the case of the New York Times, which is the basis 187 00:12:18,420 --> 00:12:23,560 of their lawsuit against OpenAI, verbatim paragraph comes out as the output. 188 00:12:23,740 --> 00:12:25,000 Where's the fair use there? 189 00:12:25,196 --> 00:12:27,123 Same with Getty, images. 190 00:12:27,415 --> 00:12:33,143 They've had exact replicas of their images come out of an AI platform. 191 00:12:33,383 --> 00:12:35,273 So that's obviously an issue. 192 00:12:35,483 --> 00:12:38,243 addition to copyright issues, we have privacy concerns. 193 00:12:38,463 --> 00:12:43,423 maybe real images of people where there are instances of real images of people 194 00:12:43,460 --> 00:12:48,480 coming out most certainly, private photos that are from somebody's, old 195 00:12:48,480 --> 00:12:51,623 Facebook or old blog posts, old journals. 196 00:12:51,633 --> 00:12:55,530 Think about what original, blogs were kind of like journals, right? 197 00:12:55,530 --> 00:12:58,666 And people would use them, as a journal and they're probably 198 00:12:58,666 --> 00:12:59,856 hanging around somewhere. 199 00:12:59,856 --> 00:13:02,706 Think about, I mean, I'm thinking about a blog, I guess it was 200 00:13:02,786 --> 00:13:04,816 at the time that I was started. 201 00:13:04,816 --> 00:13:07,560 I mean, it didn't last very long and no one ever saw it. 202 00:13:07,760 --> 00:13:08,921 But it's still somewhere. 203 00:13:08,921 --> 00:13:11,741 Like, I don't know if I could find it today, but it's still out there and 204 00:13:11,968 --> 00:13:13,828 somebody's a web crawl that could find it. 205 00:13:13,848 --> 00:13:16,200 I don't think they'd be very interested, but it's there. 206 00:13:16,396 --> 00:13:18,950 And so we do have the privacy concerns. 207 00:13:19,206 --> 00:13:25,470 And then we have the contract breach of, if we are using say a client's 208 00:13:25,470 --> 00:13:30,693 confidential information, we're entering it into a AI chat, bot. 209 00:13:30,938 --> 00:13:34,465 And It's the potential to be shared. 210 00:13:34,555 --> 00:13:38,265 We are breaching our contractual obligations to our clients. 211 00:13:38,295 --> 00:13:44,571 If we're doing that without permission, even if it is silent with the specifics 212 00:13:44,611 --> 00:13:49,245 of whether or not you can use a I and some contracts are being explicit about it. 213 00:13:49,435 --> 00:13:50,771 But even if it's silent. 214 00:13:50,958 --> 00:13:54,423 And you are obligated to use the client's information, keep it 215 00:13:54,423 --> 00:13:58,663 confidential and only share it under very specific circumstances, putting 216 00:13:58,663 --> 00:14:03,060 it into an AI platform is probably not one of those permitted uses. 217 00:14:03,383 --> 00:14:05,963 So you do have issues there as well. 218 00:14:06,180 --> 00:14:06,576 All right. 219 00:14:06,666 --> 00:14:11,693 So that is what I wanted to cover today regarding AI training data. 220 00:14:11,866 --> 00:14:14,516 as you know, this is a fast moving. 221 00:14:14,618 --> 00:14:17,908 matter, you know, who knows what will come next week. 222 00:14:17,908 --> 00:14:22,471 I'll try to keep you up to date, but always feel free to connect with me and 223 00:14:22,471 --> 00:14:23,901 let me know what your questions are. 224 00:14:23,901 --> 00:14:26,141 I'm always happy to answer them. 225 00:14:26,368 --> 00:14:26,955 Thanks again. 226 00:14:26,975 --> 00:14:29,055 And don't forget IP is fuel.