1
00:00:00,085 --> 00:00:04,948
today we're talking about AI again,
but more specifically about the

2
00:00:04,998 --> 00:00:11,269
training data sets for generative
AI, where it comes from, what some

3
00:00:11,269 --> 00:00:13,722
of the legal issues are, what is it?

4
00:00:13,932 --> 00:00:17,802
when we think about chat, GBT
and other AI tools, I know I talk

5
00:00:17,812 --> 00:00:19,222
about chat, GBT all the time.

6
00:00:20,097 --> 00:00:21,247
Frankly, the one that I use.

7
00:00:21,247 --> 00:00:24,226
And so the one I'm most familiar
with, but this applies to,

8
00:00:24,276 --> 00:00:25,966
all generative AI platforms.

9
00:00:26,211 --> 00:00:30,083
You hear about the vast amounts
of data that they utilize.

10
00:00:30,394 --> 00:00:34,993
And as you can imagine, trading
data plays a crucial role in the

11
00:00:34,993 --> 00:00:36,413
development and the efficiency.

12
00:00:36,553 --> 00:00:43,336
effectiveness of generative AI platforms,
but where does all of that data come from?

13
00:00:43,483 --> 00:00:46,806
And I know you have lots of questions
about that because you're worried

14
00:00:46,806 --> 00:00:48,976
about that is coming from your website.

15
00:00:49,193 --> 00:00:52,053
So let's start with what is training data?

16
00:00:52,263 --> 00:00:56,153
training data is the backbone
of any machine learning.

17
00:00:56,209 --> 00:00:58,838
project, which is what generative AI is.

18
00:00:59,116 --> 00:01:04,824
It consists of large sets of information
that's used to teach algorithm how to

19
00:01:04,824 --> 00:01:07,821
recognize patterns and make predictions.

20
00:01:08,053 --> 00:01:10,491
That's how it is creative, i.

21
00:01:10,521 --> 00:01:10,801
e.

22
00:01:10,811 --> 00:01:11,721
generative.

23
00:01:11,959 --> 00:01:17,738
And so you put in this vast amount of
data and it's labeled in certain ways.

24
00:01:17,969 --> 00:01:23,021
I don't know how it does this, but
it learns the patterns and then it

25
00:01:23,021 --> 00:01:28,971
can make informed predictions and
create new content based on that.

26
00:01:29,221 --> 00:01:34,518
So given the scale of modern
AI, requirements, the data sets

27
00:01:34,518 --> 00:01:40,348
are absolutely enormous, often
encompassing billions of parameters.

28
00:01:40,564 --> 00:01:41,694
and that, of course, will.

29
00:01:41,869 --> 00:01:45,024
Change depending on the
size and complexity of the

30
00:01:45,034 --> 00:01:47,261
model that is being trained.

31
00:01:47,484 --> 00:01:51,884
So the primary sources of training
data, or I should say, traditionally,

32
00:01:52,026 --> 00:01:57,408
the sources of training data
for platforms like open AI was

33
00:01:57,418 --> 00:02:00,518
scraped from the internet for free.

34
00:02:00,518 --> 00:02:06,168
And that was used to train the first
generative AI models like chat, GBT, and

35
00:02:06,168 --> 00:02:10,916
they've done a pretty good job, I'd say,
of learning to mimic human creativity.

36
00:02:11,316 --> 00:02:16,073
of course, thought, believe, and I think
they're still sticking to this story, that

37
00:02:16,083 --> 00:02:22,909
it was legal and ethical for them to do
so, relying on some prior cases that, you

38
00:02:22,939 --> 00:02:28,761
can use, publicly available information,
so long as it's transformative,

39
00:02:28,811 --> 00:02:31,488
essentially making a fair use argument.

40
00:02:31,488 --> 00:02:35,360
I'm not going to go into the fair use
argument, but, that is the basis of

41
00:02:35,360 --> 00:02:37,300
why they thought they could do this.

42
00:02:37,620 --> 00:02:41,760
as you probably know, there have
been a number of high profile

43
00:02:41,760 --> 00:02:43,580
lawsuits about their use.

44
00:02:43,770 --> 00:02:46,090
So we will see, and there
has not been resolved yet.

45
00:02:46,257 --> 00:02:50,987
And so we will see if their
reasoning and their defenses hold up.

46
00:02:51,180 --> 00:02:56,380
So, to discuss a few of the, ways
that they do get training data.

47
00:02:56,572 --> 00:02:58,652
web scraping, which you've
already talked about.

48
00:02:58,852 --> 00:03:03,882
so there would be crawlers, they
send out, scours the internet.

49
00:03:04,069 --> 00:03:08,175
It should only be scouring for
things that are publicly available,

50
00:03:08,175 --> 00:03:10,105
that are not behind a paywall.

51
00:03:10,319 --> 00:03:15,754
However, there well, you can ask the
crawler, I'm assuming, to go behind the

52
00:03:15,754 --> 00:03:21,184
paywall, which obviously would be a breach
of, the, terms and conditions of a site if

53
00:03:21,184 --> 00:03:26,884
you go around their paywall, and also, you
know, Even if there is no paywall, many

54
00:03:26,884 --> 00:03:30,714
sites will have terms and conditions would
say you're not allowed to use crawlers.

55
00:03:30,975 --> 00:03:36,295
if you don't, comply with those terms
conditions, then you're also, obviously.

56
00:03:36,416 --> 00:03:40,196
breaching those terms and conditions
of that, as well as when they're

57
00:03:40,386 --> 00:03:44,356
scraping that data off many times,
if not always, because like, we

58
00:03:44,356 --> 00:03:47,616
can't really quite see what's in
the black box of that training data.

59
00:03:47,803 --> 00:03:54,336
They're taking off any copyright notices,
and it is a violation of the Copyright

60
00:03:54,396 --> 00:03:56,226
Act to take off copyright notices.

61
00:03:56,706 --> 00:03:59,446
So there's a number of
issues, involved with it.

62
00:03:59,516 --> 00:04:00,610
web scraping.

63
00:04:00,820 --> 00:04:03,503
that obviously is falling in disfavor.

64
00:04:03,743 --> 00:04:07,600
So what is replacing
that licensed data sets?

65
00:04:07,890 --> 00:04:11,550
Very large data sets that are
licensed from entities that

66
00:04:11,600 --> 00:04:14,120
own large amounts of data.

67
00:04:14,290 --> 00:04:18,483
I read this, regarding
this new, path forward.

68
00:04:18,693 --> 00:04:20,630
There is a rush right now.

69
00:04:20,776 --> 00:04:25,226
To go for copyright holders that have
private collections of stuff that is

70
00:04:25,226 --> 00:04:29,916
not available to be scraped, so this is
from a lawyer who is advising content

71
00:04:29,946 --> 00:04:35,213
owners on deals worth tens of millions
of dollars apiece to license archives

72
00:04:35,253 --> 00:04:38,023
of photos, movies, and books for AI.

73
00:04:38,583 --> 00:04:39,133
training.

74
00:04:39,288 --> 00:04:43,713
Bruder spoke to more than 30 people
with knowledge of AI data deals,

75
00:04:43,975 --> 00:04:47,910
including current and former executives
of companies involved, the lawyers and

76
00:04:47,910 --> 00:04:53,241
consultants to provide the first in depth
exploration of this fledgling market and

77
00:04:53,241 --> 00:04:57,203
Detailing the types of content that's
being bought, the prices that they're

78
00:04:57,293 --> 00:05:03,170
getting, and any emerging concerns
that come from harvesting this type of

79
00:05:03,170 --> 00:05:09,240
data, even if it's licensed, because of
the personal data risks that go along

80
00:05:09,421 --> 00:05:14,400
with harvesting large amounts of data
where, the personal data of the, human

81
00:05:14,400 --> 00:05:18,706
that it belongs to, is done without the
knowledge or consent of that person.

82
00:05:18,956 --> 00:05:20,456
Who are these huge licensees?

83
00:05:20,635 --> 00:05:21,628
There's of them.

84
00:05:21,821 --> 00:05:27,952
We have tech companies who have
been quietly, buying, content

85
00:05:27,962 --> 00:05:34,488
that is behind locked paywalls and
behind login screens from companies

86
00:05:34,578 --> 00:05:37,995
like Instacart, Meta, Microsoft.

87
00:05:38,261 --> 00:05:39,683
X and zoom.

88
00:05:39,924 --> 00:05:46,347
And so this might be some long forgotten
chat blogs or long forgotten photos

89
00:05:46,591 --> 00:05:50,340
from old apps that are being licensed.

90
00:05:50,857 --> 00:05:55,343
tumblers, parent company automatic
said last month, and I'm recording

91
00:05:55,343 --> 00:05:57,737
this in, April 2024, right?

92
00:05:57,940 --> 00:06:01,457
It was sharing content
with select AI companies.

93
00:06:01,730 --> 00:06:06,670
And in February, that'd be 2024, Reuters
reported Reddit struck a deal with

94
00:06:06,750 --> 00:06:11,640
Google to make its content available
for training the latter's AI models.

95
00:06:11,957 --> 00:06:15,967
of course there's going to
be some customer blowback.

96
00:06:16,173 --> 00:06:21,585
while this type of licensed content is
accelerating, there will probably be

97
00:06:21,595 --> 00:06:26,865
some amendments still to it because, yes,
meta goes in and it changes his terms

98
00:06:26,865 --> 00:06:32,904
of use, but does anybody read the terms
of use of meta or of X or of zoom even.

99
00:06:33,091 --> 00:06:33,561
so.

100
00:06:33,716 --> 00:06:36,726
They're going in changing their
terms and conditions without anyone

101
00:06:36,726 --> 00:06:39,896
kind of without it saying in bright
red letters, Hey, we're going to be

102
00:06:39,896 --> 00:06:42,522
selling your data now to AI training.

103
00:06:42,756 --> 00:06:44,406
what comes from that.

104
00:06:44,601 --> 00:06:44,811
All right.

105
00:06:44,811 --> 00:06:50,207
Then there are archives that are
owned such as the Associated Press

106
00:06:50,467 --> 00:06:54,207
and Getty images, or say aggregator.

107
00:06:54,227 --> 00:06:55,637
They don't own all those images.

108
00:06:55,924 --> 00:07:00,114
And so you can go to them and
license their entire archives.

109
00:07:00,433 --> 00:07:04,467
And that provides a great amount
of data for your data sets.

110
00:07:04,703 --> 00:07:10,001
Universities and research institutions
are also owners or controllers of

111
00:07:10,011 --> 00:07:14,911
vast amounts of data that can be
licensed all in one fell swoop.

112
00:07:15,110 --> 00:07:19,924
And then there are some nonprofit
organizations that want to encourage

113
00:07:20,040 --> 00:07:24,992
the use of AI Just as we've had,
other types of nonprofits in the

114
00:07:24,992 --> 00:07:28,356
past, such as creative commons,
who want to help people get more

115
00:07:28,366 --> 00:07:31,019
access to, copyrightable materials.

116
00:07:31,239 --> 00:07:34,066
now there are some who
feel the same way about.

117
00:07:34,259 --> 00:07:36,769
Making AI, data more accessible.

118
00:07:36,934 --> 00:07:41,692
for instance, this, nonprofit Allen
Institute for AI released a data set

119
00:07:41,844 --> 00:07:47,791
of 3 million tokens from a diverse mix
of web content, academic publications,

120
00:07:47,841 --> 00:07:50,731
code books, and encyclopedic materials.

121
00:07:50,957 --> 00:07:56,737
Now, another source is synthetic data
when this is a new one to me, but it

122
00:07:56,834 --> 00:08:00,254
really points to how powerful AI can be.

123
00:08:00,457 --> 00:08:05,521
So synthetic data generation means
that you use one generative AI

124
00:08:05,541 --> 00:08:08,217
tool to create synthetic data.

125
00:08:08,429 --> 00:08:12,947
And then you use that data, that
synthetic data to train another.

126
00:08:12,982 --> 00:08:14,456
Generative AI tool.

127
00:08:14,639 --> 00:08:18,506
So let's say you're developing
a customer service AI model.

128
00:08:18,696 --> 00:08:23,819
You could use another generative AI
tool to create fictional customers

129
00:08:24,182 --> 00:08:26,779
and situations and interactions.

130
00:08:27,089 --> 00:08:31,032
And then you can use those
fictional customer situations and

131
00:08:31,152 --> 00:08:36,766
interactions as the training data
for your public facing AI model.

132
00:08:36,929 --> 00:08:42,789
So that way you're not at risk
of exposing private information.

133
00:08:42,799 --> 00:08:46,456
If you were to directly put your
customer information into your AI

134
00:08:46,666 --> 00:08:51,156
tool, first, you kind of anonymize
it using one generative AI tool.

135
00:08:51,379 --> 00:08:55,396
And it's not just enough to de identify
it because there could be customer

136
00:08:55,396 --> 00:08:59,336
situations that are so specific that
you could only point to one person.

137
00:08:59,521 --> 00:09:00,199
It's possible.

138
00:09:00,377 --> 00:09:03,112
So you also have to make up
perhaps new situations, new

139
00:09:03,112 --> 00:09:04,352
backgrounds, things like that.

140
00:09:04,554 --> 00:09:09,329
But then you can use that as your
fictional customer for your AI,

141
00:09:09,491 --> 00:09:14,636
govern customer service model to then
use that to train to help provide

142
00:09:14,636 --> 00:09:16,596
customer service on an AI basis.

143
00:09:17,136 --> 00:09:19,836
So we will see this with
hospitals and banks as well

144
00:09:19,969 --> 00:09:21,806
that have sensitive information.

145
00:09:21,836 --> 00:09:25,196
Obviously they cannot use their
customer's sensitive information.

146
00:09:25,419 --> 00:09:30,549
as training data, but they do want to
have access to what is really kind of part

147
00:09:30,549 --> 00:09:35,973
of doing business these days of having
some sort of a I based training systems.

148
00:09:36,233 --> 00:09:40,700
And then, of course, not last
and not least, is the data

149
00:09:40,700 --> 00:09:42,420
that comes from you and me.

150
00:09:42,666 --> 00:09:43,906
So, what.

151
00:09:44,126 --> 00:09:49,745
Does that mean when we are using AI
generated, uh, AI platforms, when

152
00:09:49,755 --> 00:09:56,261
we input our prompts, if we, put in
something that we've written and ask

153
00:09:56,321 --> 00:10:01,605
it to, create a summary of it, if we
put in a transcript from something

154
00:10:01,645 --> 00:10:06,761
and ask it to create a show notes,
like everything that we put into that.

155
00:10:06,938 --> 00:10:11,611
has the potential to become
training data for that platform.

156
00:10:11,818 --> 00:10:15,755
And so if we are doing that,
we need to be aware of the

157
00:10:15,755 --> 00:10:18,235
terms of use of that platform.

158
00:10:18,458 --> 00:10:22,688
most of them will tell you that it
can be part of the training data.

159
00:10:22,881 --> 00:10:28,896
And it might also end up being an
output for someone who puts in a query,

160
00:10:28,896 --> 00:10:32,726
a prompt that what you put in as a
perfect answer for, you just don't know.

161
00:10:32,916 --> 00:10:37,896
And so we need to be careful about
what we are putting in as prompts

162
00:10:37,896 --> 00:10:41,750
or as, the input for whatever the
AI platform that you're using.

163
00:10:41,946 --> 00:10:44,673
Make sure you are aware of
their terms and conditions.

164
00:10:44,893 --> 00:10:48,270
Do not use any confidential
information in there.

165
00:10:48,463 --> 00:10:50,393
whether it's yours or your clients.

166
00:10:50,580 --> 00:10:53,736
So make sure that you're
really aware of that.

167
00:10:54,053 --> 00:11:00,306
some, AI platforms, I'm thinking
in particular of they do use AI.

168
00:11:00,366 --> 00:11:05,350
And obviously, when you're using, uh,
DocuSign, there are legal agreements that

169
00:11:05,350 --> 00:11:09,550
are going in there that have identifiable
information of the parties, commercial

170
00:11:09,550 --> 00:11:11,850
terms, and things like that are in there.

171
00:11:12,050 --> 00:11:18,636
And so DocuSign,  said that they, strip
out any identifying data from that, so

172
00:11:18,636 --> 00:11:23,040
that they do use the agreements, for
training data, but that they do strip

173
00:11:23,100 --> 00:11:25,140
out identifying information from it.

174
00:11:25,356 --> 00:11:27,133
So things to be aware of.

175
00:11:27,353 --> 00:11:31,653
in summary, the legal issues, I think
we've covered, but just to sum them

176
00:11:31,653 --> 00:11:37,468
up, there are the copyright issues
of Putting data into the database.

177
00:11:37,780 --> 00:11:41,203
I believe it was last week,
I talked about the copyright

178
00:11:41,203 --> 00:11:43,213
ability issues of the output.

179
00:11:43,410 --> 00:11:46,450
So now I'm talking about the
copyright issues with the input,

180
00:11:46,601 --> 00:11:52,856
whether or not the AI platform
or you have the right to, add it.

181
00:11:52,970 --> 00:11:56,831
Information to the training
data set, whether or not that

182
00:11:56,851 --> 00:11:58,481
is a copyright infringement.

183
00:11:58,531 --> 00:12:00,891
Is that fair use of that data?

184
00:12:01,023 --> 00:12:06,316
one of the issues, in the copyright side
is, sometimes the output will literally

185
00:12:06,336 --> 00:12:13,150
be an exact replica of what went in and
it's hard to make a fair use argument when

186
00:12:13,350 --> 00:12:18,420
a verbatim, uh, paragraphs, in the case
of the New York Times, which is the basis

187
00:12:18,420 --> 00:12:23,560
of their lawsuit against OpenAI, verbatim
paragraph comes out as the output.

188
00:12:23,740 --> 00:12:25,000
Where's the fair use there?

189
00:12:25,196 --> 00:12:27,123
Same with Getty, images.

190
00:12:27,415 --> 00:12:33,143
They've had exact replicas of their
images come out of an AI platform.

191
00:12:33,383 --> 00:12:35,273
So that's obviously an issue.

192
00:12:35,483 --> 00:12:38,243
addition to copyright issues,
we have privacy concerns.

193
00:12:38,463 --> 00:12:43,423
maybe real images of people where there
are instances of real images of people

194
00:12:43,460 --> 00:12:48,480
coming out most certainly, private
photos that are from somebody's, old

195
00:12:48,480 --> 00:12:51,623
Facebook or old blog posts, old journals.

196
00:12:51,633 --> 00:12:55,530
Think about what original, blogs
were kind of like journals, right?

197
00:12:55,530 --> 00:12:58,666
And people would use them, as
a journal and they're probably

198
00:12:58,666 --> 00:12:59,856
hanging around somewhere.

199
00:12:59,856 --> 00:13:02,706
Think about, I mean, I'm thinking
about a blog, I guess it was

200
00:13:02,786 --> 00:13:04,816
at the time that I was started.

201
00:13:04,816 --> 00:13:07,560
I mean, it didn't last very
long and no one ever saw it.

202
00:13:07,760 --> 00:13:08,921
But it's still somewhere.

203
00:13:08,921 --> 00:13:11,741
Like, I don't know if I could find it
today, but it's still out there and

204
00:13:11,968 --> 00:13:13,828
somebody's a web crawl that could find it.

205
00:13:13,848 --> 00:13:16,200
I don't think they'd be very
interested, but it's there.

206
00:13:16,396 --> 00:13:18,950
And so we do have the privacy concerns.

207
00:13:19,206 --> 00:13:25,470
And then we have the contract breach
of, if we are using say a client's

208
00:13:25,470 --> 00:13:30,693
confidential information, we're
entering it into a AI chat, bot.

209
00:13:30,938 --> 00:13:34,465
And It's the potential to be shared.

210
00:13:34,555 --> 00:13:38,265
We are breaching our contractual
obligations to our clients.

211
00:13:38,295 --> 00:13:44,571
If we're doing that without permission,
even if it is silent with the specifics

212
00:13:44,611 --> 00:13:49,245
of whether or not you can use a I and some
contracts are being explicit about it.

213
00:13:49,435 --> 00:13:50,771
But even if it's silent.

214
00:13:50,958 --> 00:13:54,423
And you are obligated to use the
client's information, keep it

215
00:13:54,423 --> 00:13:58,663
confidential and only share it under
very specific circumstances, putting

216
00:13:58,663 --> 00:14:03,060
it into an AI platform is probably
not one of those permitted uses.

217
00:14:03,383 --> 00:14:05,963
So you do have issues there as well.

218
00:14:06,180 --> 00:14:06,576
All right.

219
00:14:06,666 --> 00:14:11,693
So that is what I wanted to cover
today regarding AI training data.

220
00:14:11,866 --> 00:14:14,516
as you know, this is a fast moving.

221
00:14:14,618 --> 00:14:17,908
matter, you know, who knows
what will come next week.

222
00:14:17,908 --> 00:14:22,471
I'll try to keep you up to date, but
always feel free to connect with me and

223
00:14:22,471 --> 00:14:23,901
let me know what your questions are.

224
00:14:23,901 --> 00:14:26,141
I'm always happy to answer them.

225
00:14:26,368 --> 00:14:26,955
Thanks again.

226
00:14:26,975 --> 00:14:29,055
And don't forget IP is fuel.