1
00:00:01,720 --> 00:00:09,080
I'm Miko Pawlikowski,
and this is HockeyStick.

2
00:00:09,100 --> 00:00:13,200
LLMs, or Large Language Models,
are taking the world by storm.

3
00:00:13,279 --> 00:00:18,069
This breakthrough artificial intelligence
technology promises to fundamentally

4
00:00:18,069 --> 00:00:20,140
reshape the way we work with computers.

5
00:00:20,299 --> 00:00:23,700
Over the last year, we've witnessed
its Hockey Stick moment, and as

6
00:00:23,700 --> 00:00:28,485
of early 2024, We're firmly in
the Cambrian explosion phase.

7
00:00:28,645 --> 00:00:32,765
Today, we're taking a deep dive into how
this models came from humble beginnings to

8
00:00:32,765 --> 00:00:35,505
making people scared of imminent Skynet.

9
00:00:35,595 --> 00:00:39,235
I'm joined by two experts, Chris
Brousseau, staff machine learning

10
00:00:39,235 --> 00:00:43,655
engineer at JP Morgan, and Matthew
Sharp, MLOps engineer at LTK, the

11
00:00:43,864 --> 00:00:49,115
authors of "Production LLMs" currently
available in early access at manning.com.

12
00:00:49,265 --> 00:00:52,685
In this conversation, we'll cover
the intricacies of human language

13
00:00:52,685 --> 00:00:54,335
and how machines can understand it.

14
00:00:54,425 --> 00:00:58,325
Give you the vocab to sound smart to the
next family gathering and discuss the

15
00:00:58,325 --> 00:01:02,975
various mathematical ideas and models
ultimately leading to LLMs, as well as

16
00:01:02,975 --> 00:01:05,735
some noteworthy examples beyond Chad GPT.

17
00:01:05,855 --> 00:01:08,315
Welcome to this episode and please enjoy.

18
00:01:08,416 --> 00:01:09,256
where should we start?

19
00:01:09,571 --> 00:01:11,431
How did you guys meet?

20
00:01:11,481 --> 00:01:13,631
we happen to both live in Utah, and we

21
00:01:13,631 --> 00:01:16,101
actually met at a meetup.

22
00:01:16,121 --> 00:01:19,881
It was actually an MLOps meetup,
was the primary one where we met.

23
00:01:20,981 --> 00:01:25,331
It happens once a month and we'd get
together, and so that's our origin story.

24
00:01:25,411 --> 00:01:28,531
we became friends through there,
started helping each other, with,

25
00:01:28,581 --> 00:01:32,321
content creation, Chris was starting
a YouTube channel, I write on

26
00:01:32,381 --> 00:01:35,131
LinkedIn, just giving each other
feedback and helping each other out.

27
00:01:35,131 --> 00:01:37,961
It was especially helpful because
I was trying to figure out how

28
00:01:38,001 --> 00:01:42,351
best to present a lot of the
material that's in our book now.

29
00:01:42,921 --> 00:01:45,001
how do you explain a transformer model?

30
00:01:45,071 --> 00:01:49,251
And Matt was fantastic about helping
me, find my voice on YouTube.

31
00:01:49,301 --> 00:01:54,431
Okay, so going from meeting
someone at a meetup, to committing

32
00:01:54,431 --> 00:01:57,901
to spending a a couple of years
working on a book from someone:

33
00:01:58,231 --> 00:01:59,531
that's a little bit of a difference.

34
00:01:59,901 --> 00:02:01,911
Was there any particular
moment where I just clicked?

35
00:02:01,951 --> 00:02:03,611
"Oh, we need to write a book".

36
00:02:03,841 --> 00:02:05,751
How did you come up with the idea?

37
00:02:05,801 --> 00:02:10,541
I was approached and, I would
love to write a book, but I don't

38
00:02:10,551 --> 00:02:12,361
know a lot about that process.

39
00:02:12,491 --> 00:02:15,681
And obviously, I didn't really
have an authorship voice.

40
00:02:15,851 --> 00:02:18,381
I am not experienced in content creation.

41
00:02:19,061 --> 00:02:23,001
And while I was going through the
process of talking with some different

42
00:02:23,001 --> 00:02:28,651
publishers, Matt approached me and
said: "Hey, I was a technical reviewer

43
00:02:28,681 --> 00:02:32,751
on the fundamentals of data engineering
by Joe Reese and Matt Housley.

44
00:02:33,641 --> 00:02:39,681
And so he had experience and he had,
subject matter expertise, and he was

45
00:02:39,681 --> 00:02:42,331
giving me some advice and I said,
"You know what, why don't you just

46
00:02:42,331 --> 00:02:47,491
come on as a coauthor?, You obviously
could help a lot here ,and I need

47
00:02:47,491 --> 00:02:49,971
it, so let's just do it together".

48
00:02:50,031 --> 00:02:54,871
yeah, I think that it worked out really
well because Chris has that background in

49
00:02:54,881 --> 00:02:59,211
linguistics, he understands the natural
language processing side better than

50
00:02:59,331 --> 00:03:04,751
anyone else I've met in person, and I
was coming more from the MLOps side,

51
00:03:04,751 --> 00:03:06,201
how do we actually deploy these things?

52
00:03:06,201 --> 00:03:13,481
And so I think it's really rounded out our
book better than, anything else I'm seeing

53
00:03:13,551 --> 00:03:15,721
out there that you could buy and read.

54
00:03:15,721 --> 00:03:19,271
getting that diverse perspective,
I think, really helps our book out.

55
00:03:19,816 --> 00:03:24,246
I was very excited when you said 'yes'
to coming onto this because since last

56
00:03:24,246 --> 00:03:30,434
year I think in most people's minds
sometime early last year with chat GPT.

57
00:03:30,914 --> 00:03:34,944
All of a sudden, everybody started
talking about large language

58
00:03:34,964 --> 00:03:40,254
models, and some people started
worrying about, impending doom and

59
00:03:40,274 --> 00:03:42,414
robot apocalypse, and all of that.

60
00:03:43,224 --> 00:03:47,324
But from a perspective of someone
who's worked, with that for best

61
00:03:47,324 --> 00:03:49,754
part of a decade now, I'm wondering.

62
00:03:50,259 --> 00:03:54,839
what was the point when you realized
that these LLMs, they're really onto

63
00:03:54,839 --> 00:04:01,689
something and they're moving from, a
demo to an actual legitimate technology

64
00:04:01,689 --> 00:04:02,959
that's going to change things.

65
00:04:02,999 --> 00:04:06,289
What was the hockey stick moment for LLMs

66
00:04:06,326 --> 00:04:07,006
Oh, boy.

67
00:04:07,056 --> 00:04:11,276
for me, without a doubt,
that was the release of T5.

68
00:04:12,021 --> 00:04:18,091
And looking at Google's paper about the
text-to-text transformer, that set really

69
00:04:18,131 --> 00:04:20,821
the groundwork for prompting, right?

70
00:04:20,831 --> 00:04:25,631
They had a whole bunch of different
tasks that you didn't have to change

71
00:04:25,701 --> 00:04:28,531
anything other than some statement.

72
00:04:29,191 --> 00:04:32,931
For the model to do that task,
and then a colon and then whatever

73
00:04:32,931 --> 00:04:34,611
your input was going to be anyway.

74
00:04:34,611 --> 00:04:36,811
that was groundbreaking to me.

75
00:04:36,811 --> 00:04:39,281
I had been messing around with GPT2.

76
00:04:39,301 --> 00:04:41,811
I'd been playing with that and
trying to shoehorn it into a

77
00:04:41,811 --> 00:04:43,271
product where I was working.

78
00:04:43,541 --> 00:04:49,791
T5, did everything that we were trying
to do with GPT2, and it was incredibly

79
00:04:49,791 --> 00:04:54,371
flexible, it was easy to fine tune, and
for me, that was the hockey stick moment

80
00:04:54,371 --> 00:04:56,871
that "oh wow, no, they're really cooking".

81
00:04:56,871 --> 00:04:57,731
when is that?

82
00:04:57,732 --> 00:05:00,994
for anybody who hasn't heard of heard

83
00:05:01,049 --> 00:05:01,139
T5?

84
00:05:01,139 --> 00:05:04,927
I think it was 2019, Yeah, exploring
the limits of transfer learning

85
00:05:04,927 --> 00:05:08,817
with a unified text to text
transformer was October in 2019.

86
00:05:08,877 --> 00:05:10,177
it came out in October.

87
00:05:10,197 --> 00:05:13,537
I think I picked it up in
November-December of 2019.

88
00:05:13,964 --> 00:05:18,934
Yeah, I think for my hockey stick
moment, like I was, in the industry

89
00:05:18,944 --> 00:05:23,604
been paying attention, obviously
GPT2 coming around, T5, etc.

90
00:05:23,654 --> 00:05:30,444
But wasn't really seeing the adoption
that someone who's working in MLOps

91
00:05:30,714 --> 00:05:35,024
cares more about I was seeing, , these
models can do really cool things,

92
00:05:35,024 --> 00:05:36,854
but people weren't caring about them.

93
00:05:36,944 --> 00:05:40,774
Sam Altman even said it was
like, "we didn't think GPT-3

94
00:05:40,794 --> 00:05:42,624
would be that big of success.

95
00:05:42,624 --> 00:05:44,974
We thought that would once GPT-4 came out.

96
00:05:45,714 --> 00:05:49,304
but I just remember, January 2023.

97
00:05:50,024 --> 00:05:51,784
ChatGPT's been out a month.

98
00:05:52,024 --> 00:05:53,704
it's still essentially in beta.

99
00:05:53,784 --> 00:05:57,674
They just released it to get feedback
and to start collecting data.

100
00:05:57,674 --> 00:05:59,204
to start improving their model.

101
00:05:59,734 --> 00:06:01,144
but it blew up, right?

102
00:06:01,174 --> 00:06:07,634
I just remember being at a church
function and this guy sitting

103
00:06:07,634 --> 00:06:12,224
across the table from me who has
no idea anything about AI, right?

104
00:06:12,244 --> 00:06:17,324
I was stuck in this table for an hour
and all he could talk about was GPT-3.

105
00:06:17,684 --> 00:06:19,164
he was obsessed with it.

106
00:06:19,564 --> 00:06:20,704
I'm like, oh, wow.

107
00:06:21,364 --> 00:06:26,984
even people who don't know anything
about, machine learning or AI or the

108
00:06:26,984 --> 00:06:32,474
industry were like, really going gung
ho and his wife was an English teacher.

109
00:06:32,964 --> 00:06:36,564
she was really scared of it and was
like, "how are we gonna help kids

110
00:06:36,564 --> 00:06:42,234
learn how to, write and read when
they can just go online and now cheat

111
00:06:42,234 --> 00:06:43,424
and write these things and stuff".

112
00:06:44,104 --> 00:06:47,544
The very beginning of what, like
everyone's had conversations about now,

113
00:06:47,594 --> 00:06:54,284
but like he talked about how his brother
in law owned a website that made fake

114
00:06:54,284 --> 00:06:58,854
articles you can think like the onion and
so once it came out in that month like

115
00:06:58,854 --> 00:07:04,869
I said chat GPT still wasn't a product
yet, and anyone who's been following

116
00:07:04,869 --> 00:07:08,749
it knows a lot of those demos just
shut down and then never came back up

117
00:07:08,859 --> 00:07:14,189
His brother in law ended up firing
like a hundred writers because he's

118
00:07:14,189 --> 00:07:19,739
like: "Oh chat GPT can make these funny
fake articles and we're good, right?"

119
00:07:19,779 --> 00:07:24,099
that was my hockey stick moment
of "okay we really are changing

120
00:07:24,149 --> 00:07:28,049
when some random guy at church is
talking about it all the time".

121
00:07:28,774 --> 00:07:29,974
Yeah, I love that example.

122
00:07:30,004 --> 00:07:34,364
But even for people who are in tech
who weren't directly following that

123
00:07:34,364 --> 00:07:36,554
very closely, that was a scary moment.

124
00:07:36,564 --> 00:07:42,364
I remember when I first used a copilot,
I was like, what, it just does that.

125
00:07:42,574 --> 00:07:45,454
And three out of four,
it would actually work.

126
00:07:45,724 --> 00:07:46,804
that was a scary moment.

127
00:07:46,854 --> 00:07:51,654
It reverberated through a lot of
levels of society, including, our own.

128
00:07:51,884 --> 00:07:57,504
And, I think in many ways, technology
and writing code might be the easiest

129
00:07:57,514 --> 00:07:59,714
use case for, this kind of models, right?

130
00:07:59,714 --> 00:08:00,554
Do you agree with that?

131
00:08:00,554 --> 00:08:04,844
I don't know if I completely agree
with it, because, code is incredibly

132
00:08:04,844 --> 00:08:06,844
syntactically dependent, right?

133
00:08:06,914 --> 00:08:11,594
every developer who's worked with
JavaScript or C++ and then moves

134
00:08:11,594 --> 00:08:13,514
to Python, they feel it, right?

135
00:08:13,534 --> 00:08:16,774
That's one of the biggest complaints
is "I hate Python syntax".

136
00:08:16,814 --> 00:08:21,444
"I hate that white space matters", it's
a little bit more complex than just

137
00:08:21,444 --> 00:08:25,394
repeating whatever natural language
happened, but you're absolutely right

138
00:08:25,414 --> 00:08:28,344
that is one of the best use cases so far.

139
00:08:29,281 --> 00:08:33,631
because, it's better structured than
just spoken language, or is there any

140
00:08:33,681 --> 00:08:37,541
other reasons that make it so well
suited for that particular application?

141
00:08:37,591 --> 00:08:40,551
programming languages are
not real languages, right?

142
00:08:40,551 --> 00:08:44,571
one of the things that makes it
simultaneously very well and ill-suited

143
00:08:44,591 --> 00:08:49,471
for it is how much gets repeated,
You use the exact same words.

144
00:08:49,756 --> 00:08:54,706
The exact same tokens to define every
function that you make, but then the

145
00:08:54,706 --> 00:08:57,226
function's name can be whatever you want.

146
00:08:57,996 --> 00:09:01,326
And so using the exact
same tokens is awesome.

147
00:09:01,326 --> 00:09:03,666
That provides landmarks
for the probability as it's

148
00:09:03,666 --> 00:09:04,746
going through all of this.

149
00:09:05,156 --> 00:09:09,041
But then that input to just say
whatever you want and put it in camel

150
00:09:09,041 --> 00:09:13,331
case or snake case or whatever, tons
of different formatting for functions.

151
00:09:14,751 --> 00:09:16,761
it makes it a little bit more difficult.

152
00:09:17,146 --> 00:09:19,206
Especially while you're
trying to tokenize that,

153
00:09:19,396 --> 00:09:24,876
one of the big benefits with code is
the amount of data we have around code.

154
00:09:24,876 --> 00:09:26,546
lots of people are writing code.

155
00:09:26,716 --> 00:09:31,106
they all have very similar ideas
of what they're trying to do, of

156
00:09:31,106 --> 00:09:33,711
what they're trying to architect,
of what they're trying to design.

157
00:09:33,711 --> 00:09:38,181
and so we're not necessarily
worrying about, hallucinations or

158
00:09:38,181 --> 00:09:42,071
fake news or, people disagreeing
or other things like that.

159
00:09:42,111 --> 00:09:45,851
there's just a lot of data, that
all agrees with each other and

160
00:09:45,851 --> 00:09:47,231
pushes in the same direction.

161
00:09:47,471 --> 00:09:48,251
It makes it good.

162
00:09:48,341 --> 00:09:53,281
there's obviously some negatives of just
assuming, some of these LLMs writing

163
00:09:53,281 --> 00:09:57,811
code is going to do things well, but,
I think Chris highlighted that already.

164
00:09:58,619 --> 00:10:02,029
it's actually really similar
to how regular languages work.

165
00:10:02,129 --> 00:10:06,559
If we have more python data, like Matt's
saying, it's going to do better at python.

166
00:10:07,019 --> 00:10:11,709
And that can create a little bit of a
positive feedback loop with LLMs, where

167
00:10:11,709 --> 00:10:15,889
a lot of people want to get into python,
and they're very good at it, but then

168
00:10:15,889 --> 00:10:21,279
when you look at emerging languages like
mojo, for example It's really difficult

169
00:10:21,289 --> 00:10:25,979
to find that data and so LLMs are worse
at it, similar to natural languages

170
00:10:25,979 --> 00:10:30,479
that have a lower number of speakers,
a lower presence on the internet,

171
00:10:31,686 --> 00:10:37,096
So is the solution to use an LLM to
generate a lot of Mojo and make it

172
00:10:37,096 --> 00:10:39,326
a significant percentage of GitHub?

173
00:10:41,269 --> 00:10:42,309
that'd be fun, dude.

174
00:10:42,599 --> 00:10:46,919
I think there are some problems
with synthetic data that can lead

175
00:10:46,919 --> 00:10:48,379
to stuff like model collapse.

176
00:10:48,569 --> 00:10:51,199
I don't know if we're going to see
that in the code space, though.

177
00:10:51,539 --> 00:10:53,549
I think we could see
that in natural language.

178
00:10:53,749 --> 00:10:55,869
So that might be a valid solution.

179
00:10:56,716 --> 00:10:57,076
Okay.

180
00:10:57,126 --> 00:11:03,276
the date is 13 February, the
day before Valentine's Day 2024.

181
00:11:03,276 --> 00:11:05,126
I'm going to ask you
for a wild prediction.

182
00:11:05,146 --> 00:11:06,616
Where do you see that going?

183
00:11:06,766 --> 00:11:12,816
Should, all kinds of, or maybe any subset
of programmers who, produce code as a

184
00:11:12,816 --> 00:11:16,246
job, should they start at least worrying?

185
00:11:16,736 --> 00:11:21,566
Is that something that's going to,
decrease the pool of available jobs,

186
00:11:22,864 --> 00:11:26,844
no, I don't think it's really
going to impact the amount of work.

187
00:11:27,754 --> 00:11:32,584
I just think about my job, and even when
I'm in very technical roles, and I'm

188
00:11:32,584 --> 00:11:38,304
spending 50% of my time on the keyboard,
still, it feels like a majority of the

189
00:11:38,304 --> 00:11:42,744
work is still just communicating with
stakeholders, understanding exactly what

190
00:11:42,744 --> 00:11:48,254
the problems are, technical writing,
design docs, really understanding at

191
00:11:48,254 --> 00:11:50,234
a high level, what you want to build.

192
00:11:50,284 --> 00:11:53,434
To be fair, programmers have
been automating the 'writing the

193
00:11:53,434 --> 00:11:55,794
code' portion forever, right?

194
00:11:55,874 --> 00:11:56,794
From the beginning.

195
00:11:57,651 --> 00:12:02,401
yeah, with massive amounts of like
scripts and configs that they use.

196
00:12:02,401 --> 00:12:06,651
And that's why they love
Vim or Emacs still, right?

197
00:12:06,651 --> 00:12:08,961
It's because they have
it configured just right.

198
00:12:08,961 --> 00:12:12,691
And they can move really quickly,
because it provides a lot of that

199
00:12:12,741 --> 00:12:16,521
automation for them already, but
this is just helping junior engineers

200
00:12:16,521 --> 00:12:21,041
already have all that configuration
and set up really quickly, right?

201
00:12:21,091 --> 00:12:27,291
It mostly will just make our jobs a little
bit easier, it doesn't remove the need to

202
00:12:27,301 --> 00:12:31,701
really understand the engineering aspect,
the architecture aspect, the design

203
00:12:31,721 --> 00:12:34,441
aspect that still is involved with coding.

204
00:12:35,139 --> 00:12:35,689
Oh, yeah.

205
00:12:36,329 --> 00:12:39,909
this is why we love comparing
LLMs to a printing press.

206
00:12:40,259 --> 00:12:42,158
That Johannes Gutenberg.

207
00:12:42,159 --> 00:12:44,919
Because did that destroy
the writing industry?

208
00:12:44,969 --> 00:12:49,379
All it did was it destroyed the
monopoly that certain organizations

209
00:12:49,519 --> 00:12:50,919
had on publishing books.

210
00:12:51,489 --> 00:12:55,329
Before you had to get a scribe and you
had to pay the scribe and you had to

211
00:12:55,329 --> 00:12:59,669
have access to scribes You couldn't
just walk up to a printing press and

212
00:13:00,039 --> 00:13:02,669
hit it and then boom you have a book.

213
00:13:02,749 --> 00:13:04,949
You have to have knowledge
You have to have an idea.

214
00:13:05,159 --> 00:13:09,319
The printing press just gives
you a lower barrier to entry

215
00:13:10,114 --> 00:13:11,964
Which is what we love, right?

216
00:13:12,354 --> 00:13:16,914
For coding, I think Matt is exactly
right, that it's a lower barrier to

217
00:13:16,924 --> 00:13:21,204
entry for junior engineers to be able
to produce significantly better work.

218
00:13:21,424 --> 00:13:26,104
and in some ways it actually accelerates
it, because when you copy and paste what

219
00:13:26,104 --> 00:13:30,654
an LLM gave you and it doesn't work,
you have to go figure it out, right?

220
00:13:31,004 --> 00:13:35,384
With the junior  engineers, it also
helps speed up senior engineers, and

221
00:13:35,684 --> 00:13:37,304
staff engineers and principal engineers.

222
00:13:37,364 --> 00:13:42,194
it's good, and lowers the barrier for
the entire industry, we like that.

223
00:13:43,361 --> 00:13:43,701
Yeah.

224
00:13:43,806 --> 00:13:47,606
I've lately been spending lots of
time writing chapter 10 of our book,

225
00:13:47,606 --> 00:13:51,981
and in chapter 10, we actually go
through a project, where we help you

226
00:13:51,981 --> 00:13:58,471
build your own co pilot and we build
the VS Code extension to get it in.

227
00:13:58,481 --> 00:14:03,751
if you want to be running your own LLM
on your own computer with your own data,

228
00:14:04,241 --> 00:14:05,941
so that way, you can get your own things.

229
00:14:06,651 --> 00:14:08,301
we walk through all the steps to do that.

230
00:14:08,331 --> 00:14:12,001
And in some aspects, it's
interesting cause sometimes.

231
00:14:12,591 --> 00:14:15,341
adding an extra feature,
made the model work, right?

232
00:14:15,411 --> 00:14:18,121
there's still just so
much to learn about it.

233
00:14:18,171 --> 00:14:20,221
ultimately, it comes
down to your data, right?

234
00:14:20,621 --> 00:14:22,281
how good is your coding data?

235
00:14:22,381 --> 00:14:24,541
is really how well the
co pilot works, right?

236
00:14:24,591 --> 00:14:28,741
SQL is one of the most repetitive
of all of the programming languages.

237
00:14:29,106 --> 00:14:34,036
but true skill with SQL does
not involve being good at SQL.

238
00:14:34,036 --> 00:14:36,496
It involves knowing the data, right?

239
00:14:36,506 --> 00:14:41,956
It's knowing which tables to query, how to
merge them, how window functions, all of

240
00:14:41,956 --> 00:14:47,176
that stuff, knowing exactly what you need
to be looking at is the true skill in SQL.

241
00:14:47,776 --> 00:14:51,726
And we're hopefully getting to
a point where we can help the

242
00:14:51,726 --> 00:14:54,911
model know the data, right?

243
00:14:54,911 --> 00:14:59,101
We can give it some sort of context for
the data that it's going to be looking

244
00:14:59,101 --> 00:15:01,371
at, so that it can generate good SQL

245
00:15:01,421 --> 00:15:02,271
that's a really good point.

246
00:15:02,281 --> 00:15:06,411
I've actually had, lots of mentees who are
trying to learn SQL for the first time.

247
00:15:07,081 --> 00:15:13,051
I said: "just use ChatGPT", generating
SQL is actually something that's really

248
00:15:13,051 --> 00:15:18,111
good at, you don't need GPT-4, like
even GPT-3, like even GPT-2, it's not

249
00:15:18,111 --> 00:15:20,841
hard to generate really good SQL syntax.

250
00:15:20,841 --> 00:15:24,321
Cause it's so simple, it follows
a very similar structure.

251
00:15:24,821 --> 00:15:28,741
But ultimately, you can have it write
the SQL, but you're going to have to

252
00:15:28,741 --> 00:15:33,361
go back and figure out how to connect
all the pieces and understand your

253
00:15:33,361 --> 00:15:35,081
database and understand your data.

254
00:15:35,181 --> 00:15:38,471
that's a perfect example,
understanding how to write the

255
00:15:38,471 --> 00:15:39,761
code is only half the problem.

256
00:15:39,761 --> 00:15:43,021
Understanding how to integrate
it is really the bigger problem.

257
00:15:43,021 --> 00:15:47,331
What's the most terrible use
case, that people are currently

258
00:15:47,331 --> 00:15:48,961
trying to use LLMs for?

259
00:15:49,091 --> 00:15:55,471
What does LLM in general, or LLMs,
what do they suck at the most?

260
00:15:56,591 --> 00:16:03,251
I'm going to say they, they suck at,
sequence prediction, which sounds so off.

261
00:16:03,816 --> 00:16:07,876
Because that's what they're made for,
but one of the things that I'm seeing

262
00:16:07,876 --> 00:16:12,946
people do, is try and automate entire
workflows with LLMs, and they're trying

263
00:16:12,946 --> 00:16:18,786
to get the LLM to just do the whole
workflow and they suck at that what

264
00:16:18,966 --> 00:16:20,836
they need all of this stuff to help it.

265
00:16:20,836 --> 00:16:26,486
They need tools, they need rag, they
need specific fine tuning landmarks

266
00:16:26,516 --> 00:16:30,846
and they need few shot prompting,
they need all sorts of stuff to make

267
00:16:30,846 --> 00:16:34,246
it work, and then it's still up in
the air about whether or not it will

268
00:16:34,246 --> 00:16:36,066
do the right task in the right order.

269
00:16:36,878 --> 00:16:39,858
Yeah, I was thinking, I don't
know how much I'm seeing this.

270
00:16:39,858 --> 00:16:45,468
But, three months, six months ago, I
was hearing a hundred horror stories

271
00:16:45,468 --> 00:16:51,118
about, essentially CEOs being like,
"we need LLMs" and like their magic,

272
00:16:51,118 --> 00:16:55,798
they can do anything, And so it didn't
matter what the problem was, "oh, we need

273
00:16:55,808 --> 00:16:59,708
to, do outlier detection using LLMs".

274
00:17:00,148 --> 00:17:00,638
No,

275
00:17:00,741 --> 00:17:02,021
use stats for that.

276
00:17:02,548 --> 00:17:05,058
yeah, outlier detection is
really a statistical problem.

277
00:17:05,058 --> 00:17:07,308
It's really a data and math problem.

278
00:17:07,348 --> 00:17:09,508
LLMs are good at natural language.

279
00:17:09,928 --> 00:17:13,658
And so when we can solve a problem
using words and communication,

280
00:17:13,908 --> 00:17:15,288
that's when LLMs can get in.

281
00:17:15,288 --> 00:17:21,068
But problems like, outlier detection
or weather prediction or these

282
00:17:21,068 --> 00:17:22,878
other things, we have, algorithm.

283
00:17:22,878 --> 00:17:25,781
stock market prediction,
Super Bowl prediction,

284
00:17:25,941 --> 00:17:30,111
All these things, we have
better ways to make predictions.

285
00:17:30,486 --> 00:17:31,966
And it's called math, right?

286
00:17:32,056 --> 00:17:35,906
Fourier transforms, other machine learning
algorithms, other things like that.

287
00:17:37,106 --> 00:17:41,006
LLMs are not good at doing those
things, cause we don't talk

288
00:17:41,016 --> 00:17:43,266
about them in natural language.

289
00:17:43,396 --> 00:17:47,086
we've invented other languages
like math just to describe them

290
00:17:47,556 --> 00:17:48,726
And that's why they're not good.

291
00:17:48,736 --> 00:17:53,896
we can make tools, you can build
functions for an LLM to use to do Fourier

292
00:17:53,896 --> 00:17:56,156
transitions and whatever else, right?

293
00:17:57,276 --> 00:18:02,026
But getting the LLM to know that it
needs to do that is really difficult.

294
00:18:02,026 --> 00:18:07,226
Probably just as difficult to, as
explaining what the Fourier transition

295
00:18:07,256 --> 00:18:11,356
is to an LLM within your training data
to get it to be able to replicate it.

296
00:18:11,856 --> 00:18:16,416
This is one thing that makes it
almost miraculous when stuff does

297
00:18:16,416 --> 00:18:20,666
work, and that's that feeling that
we're chasing right now, and that's

298
00:18:20,756 --> 00:18:24,656
the replicability that we're trying
to help people get to in a book.

299
00:18:25,016 --> 00:18:28,096
how do you actually do it, and how
do you make sure that your scope

300
00:18:28,096 --> 00:18:31,756
is small enough, that it will work
repeatedly and you can build a

301
00:18:31,756 --> 00:18:33,586
product off of it, that's difficult.

302
00:18:33,636 --> 00:18:34,906
I'm a big fan of chess.

303
00:18:35,156 --> 00:18:41,461
And, since ChatGPT came out, lots of
people have been making memes, or just

304
00:18:41,461 --> 00:18:47,481
like: "Hey, I'll play ChatGPT in chess",
and ChatGPT can play chess because we

305
00:18:47,481 --> 00:18:48,961
can talk about it in language, right?

306
00:18:48,971 --> 00:18:54,276
Like E4, move the pawn, or
knight to g6, whatever it is.

307
00:18:54,886 --> 00:18:58,896
we have language of it,
but ChatGPT has no idea.

308
00:18:58,946 --> 00:19:04,216
It has no idea the model behind
those letter number combinations.

309
00:19:04,226 --> 00:19:06,956
all it knows is that there's
certain things it can do, right?

310
00:19:07,396 --> 00:19:11,746
it writes words, and so when they
do this, and these like videos or

311
00:19:11,766 --> 00:19:14,676
memes, like they just let ChatGPT
do whatever it says, right?

312
00:19:14,676 --> 00:19:18,556
it just magically creates a knight out
of nowhere, and magically, will take

313
00:19:18,556 --> 00:19:23,586
its own pieces as it moves its pieces
around, it's always pretty funny.

314
00:19:23,596 --> 00:19:27,216
And even though it's cheating the entire
way, it almost always loses, right?

315
00:19:27,216 --> 00:19:31,061
Cause It doesn't have an understanding
of chess, like it doesn't

316
00:19:31,091 --> 00:19:32,721
have that model underneath it.

317
00:19:33,771 --> 00:19:36,831
sure we can talk about it in
language, but not really, right?

318
00:19:36,831 --> 00:19:42,571
So we, we still have better ways to
play chess, alpha zero, et cetera.

319
00:19:43,001 --> 00:19:46,851
Stockfish, like there are engines out
there that play chess really well.

320
00:19:46,931 --> 00:19:51,661
And we don't need to make LLMs good at
chess, but that's a very good example

321
00:19:51,671 --> 00:19:53,231
of one of the things it's not good at.

322
00:19:53,801 --> 00:19:59,901
I've seen someone on Twitter who
said "I'm gonna give LLM $1000 or

323
00:19:59,901 --> 00:20:04,751
whatever initial amount, and I'm
gonna ask it how to best invest it.

324
00:20:04,781 --> 00:20:06,321
I didn't follow where it went.

325
00:20:06,341 --> 00:20:09,621
But I think a lot of
people had the same idea.

326
00:20:09,621 --> 00:20:11,381
this is some kind of genius system.

327
00:20:11,421 --> 00:20:16,031
I'm just gonna be its flesh and
bones agent in the real world.

328
00:20:16,896 --> 00:20:18,076
and hope for the best.

329
00:20:18,276 --> 00:20:20,286
So I think that kind of goes
back to your chess thing.

330
00:20:20,286 --> 00:20:24,846
So excuse me for that, but
I have to ask you the AGI,

331
00:20:25,866 --> 00:20:27,976
Artificial General Intelligence.

332
00:20:28,046 --> 00:20:30,746
Any chance for that
happening anytime soon?

333
00:20:31,076 --> 00:20:32,006
What's your prediction?

334
00:20:32,006 --> 00:20:33,836
not with our current systems.

335
00:20:33,896 --> 00:20:38,826
No, I don't think AGI is ever
going to come out of quadratic

336
00:20:38,836 --> 00:20:42,126
equations, like not a single chance.

337
00:20:43,166 --> 00:20:47,966
maybe if there are better dropping
sub-quadratic replacements, stuff

338
00:20:47,966 --> 00:20:49,886
like hyena, I've tested that out.

339
00:20:49,886 --> 00:20:51,076
I think it's really cool.

340
00:20:51,386 --> 00:20:55,696
But, the fact that attention,
the query key value attention,

341
00:20:55,746 --> 00:20:58,496
ultimately generates complex numbers.

342
00:20:58,556 --> 00:21:03,326
I think that is a little too
much for AGI at the moment.

343
00:21:03,326 --> 00:21:07,206
So you're not one of those people
who secretly hope that OpenAI has

344
00:21:07,206 --> 00:21:08,886
something they're gonna release soon.

345
00:21:08,936 --> 00:21:10,936
I don't think they have it, right?

346
00:21:10,936 --> 00:21:12,566
I'll be hopeful, sure.

347
00:21:12,566 --> 00:21:14,186
If it comes out, that's great.

348
00:21:14,196 --> 00:21:16,216
Yeah, I'm of the same mind as Chris.

349
00:21:16,216 --> 00:21:17,486
I hope they keep pursuing it.

350
00:21:17,546 --> 00:21:21,346
we've gotten major breakthroughs
from what they pursued.

351
00:21:21,396 --> 00:21:25,756
It's very possible AGI will happen in
my lifetime, I'm still pretty young We

352
00:21:25,756 --> 00:21:30,276
keep on making advances really quickly,
but are we relatively close to it?

353
00:21:30,276 --> 00:21:30,996
Probably not.

354
00:21:31,106 --> 00:21:31,436
No

355
00:21:31,436 --> 00:21:37,056
Oh, the thing about progress though
is that it's very rarely linear, It

356
00:21:37,506 --> 00:21:39,506
tends to have a very weird curve.

357
00:21:39,506 --> 00:21:43,846
So that's why all the predictions are so
funny, but hey, I had to ask you anyway.

358
00:21:43,846 --> 00:21:45,836
No, I think it's a great question.

359
00:21:46,998 --> 00:21:52,328
Okay, let's delve a little bit
into, a portion of your book,

360
00:21:52,428 --> 00:21:56,598
It's basically describing the
two options that you have today.

361
00:21:56,908 --> 00:22:01,668
you can either go and pay some
money to OpenAI, maybe Google, or

362
00:22:01,678 --> 00:22:03,508
somebody else, or you can build,

363
00:22:03,508 --> 00:22:05,108
So you've got buy versus build.

364
00:22:05,798 --> 00:22:10,978
Could you talk to me a little bit
about how someone would decide

365
00:22:11,008 --> 00:22:15,353
about this as of february 13, 2024.

366
00:22:16,233 --> 00:22:19,373
What's the things to consider,
and what's the weights that

367
00:22:19,373 --> 00:22:21,643
you would put in, and biases?

368
00:22:21,693 --> 00:22:25,663
the basic consideration is
just your use case, right?

369
00:22:25,703 --> 00:22:30,343
If you just want to test something out,
you're a student and you don't have a

370
00:22:30,343 --> 00:22:34,923
lot of budget, and you want something
up and running so that you have LLM

371
00:22:34,923 --> 00:22:43,113
experience, I would say just, shell
out for that, ChatGPT+ or buy Anthropic

372
00:22:43,123 --> 00:22:48,221
or Google Bard has a fantastic API,
or I guess Gemini now just do it.

373
00:22:48,311 --> 00:22:49,701
it's not that big of a thing.

374
00:22:49,741 --> 00:22:54,581
If your product that you're trying
to ship is inconsequential and you

375
00:22:54,581 --> 00:22:57,891
don't need it to be right every
time, you just want to sprinkle the

376
00:22:57,901 --> 00:23:00,341
AI pixie dust on it, just buy it.

377
00:23:00,581 --> 00:23:04,631
If your use case goes deeper than that,
though, if you want to be able to build

378
00:23:04,631 --> 00:23:08,921
your own, if you need to make sure that
it says the right things all the time,

379
00:23:09,341 --> 00:23:13,411
if you need it to behave a little bit
more deterministically, There have been

380
00:23:13,431 --> 00:23:17,451
probably a thousand case studies in the
last year of people building products on

381
00:23:17,451 --> 00:23:24,491
top of ChatGPT and then OpenAI rolling
out an update that changes how chat

382
00:23:24,491 --> 00:23:29,031
GPT behaves, and they don't have any
way to measure all of the different

383
00:23:29,181 --> 00:23:30,621
ways that it will change it, right?

384
00:23:30,621 --> 00:23:35,961
There are 176 billion parameters in
GPT-3 alone, they don't know it's going

385
00:23:35,961 --> 00:23:37,661
to break your program down the line.

386
00:23:37,931 --> 00:23:41,311
they're just going to update it for
what they consider to be better.

387
00:23:41,871 --> 00:23:45,791
And those programs break constantly.

388
00:23:46,391 --> 00:23:47,741
that doesn't mean you can't fix them.

389
00:23:47,741 --> 00:23:51,721
It's just a much bigger problem of
maintenance, than I think a lot of

390
00:23:51,721 --> 00:23:53,811
people are expecting going into it.

391
00:23:54,361 --> 00:23:57,241
So If you want to have to
maintain it less, build your own.

392
00:23:57,241 --> 00:24:01,931
Yeah, I think the other aspect is
like you want that control, right?

393
00:24:01,961 --> 00:24:07,451
there's lots of examples of companies
who, essentially built a small shell

394
00:24:07,471 --> 00:24:11,251
around ChatGPT that did something unique.

395
00:24:11,771 --> 00:24:15,871
And then, months down the
line, now ChatGPT just does

396
00:24:15,871 --> 00:24:17,261
that out of the gate, right?

397
00:24:17,361 --> 00:24:20,291
their value proposition
just completely disappeared.

398
00:24:20,321 --> 00:24:22,501
And that's because they didn't
have control over the model.

399
00:24:22,521 --> 00:24:27,691
They didn't have, control over, what
it did it's just interesting, right?

400
00:24:27,711 --> 00:24:30,111
Because I say these things and
things have changed over time.

401
00:24:30,111 --> 00:24:33,451
But when ChatGT first came out, it
was free, it was a demo, and they were

402
00:24:33,451 --> 00:24:35,171
specifically doing it to collect data.

403
00:24:35,771 --> 00:24:38,571
And that's what they did, they used
collected data to improve their models.

404
00:24:39,201 --> 00:24:41,691
And that's what they continued
to do for a while, right?

405
00:24:41,701 --> 00:24:43,121
Oh no, they're back.

406
00:24:43,591 --> 00:24:45,621
They it's terms and service, right?

407
00:24:45,921 --> 00:24:49,271
If you want them to save your
chat, so that you can return to

408
00:24:49,271 --> 00:24:52,841
it and ask more questions, they
get to train off of your data.

409
00:24:53,321 --> 00:24:57,441
So if you want to put anything
private or sensitive in there, like

410
00:24:57,451 --> 00:24:59,461
it's over, you've just leaked it.

411
00:24:59,461 --> 00:25:02,321
they're back and forth about what data
they're collecting, what data they're

412
00:25:02,321 --> 00:25:07,991
not collecting, and if you're with an
enterprise customer, like maybe you

413
00:25:07,991 --> 00:25:13,131
can make certain rules and things like
that, and oftentimes they won't, it's

414
00:25:13,131 --> 00:25:17,421
a minefield, for how people are using
it, and so it's just something important

415
00:25:17,431 --> 00:25:23,566
to take into consideration, if your
LLM model is doing something magical,

416
00:25:23,576 --> 00:25:27,476
that's really core to your business,
that is really driving customers.

417
00:25:28,286 --> 00:25:29,396
You want to control that.

418
00:25:29,636 --> 00:25:35,226
You want to make sure that the model
is working exactly as intended.

419
00:25:35,316 --> 00:25:38,966
You're not getting updates randomly,
that break your application.

420
00:25:39,416 --> 00:25:45,096
You're also controlling the data flow,
you're making sure that you're not

421
00:25:45,306 --> 00:25:48,606
accidentally training your competitor's
model, and other things like that.

422
00:25:48,606 --> 00:25:52,496
And there's just lots of aspects
where it's just important to

423
00:25:53,636 --> 00:25:55,026
make sure that you own it.

424
00:25:55,366 --> 00:25:59,321
And, no, that's not necessarily
everyone's concern, right?

425
00:25:59,411 --> 00:26:02,701
if you're a student or you're just doing
some side project or anything, there's

426
00:26:02,701 --> 00:26:07,031
lots of APIs out there that are very
cheap that can get you up and running,

427
00:26:07,091 --> 00:26:11,391
there are literally hundreds of
hugging face spaces that are free APIs.

428
00:26:11,611 --> 00:26:14,061
With, have LLMs running behind
them and you can just hit

429
00:26:14,061 --> 00:26:15,891
them whenever you want, right?

430
00:26:16,493 --> 00:26:18,843
unless you're queuing behind
a thousand other people.

431
00:26:18,843 --> 00:26:19,843
yeah, exactly.

432
00:26:19,843 --> 00:26:25,023
I liked the example you gave in the book,
I think people at Latitude, the Dungeons

433
00:26:25,193 --> 00:26:29,563
& Dragons people would agree with a lot of
what you're saying now, but can you tell

434
00:26:29,563 --> 00:26:32,013
the story of what happened with them?

435
00:26:32,753 --> 00:26:37,483
Latitude, is a local company,
that was here in Utah.

436
00:26:37,543 --> 00:26:40,693
it was put together by, two guys from BYU.

437
00:26:41,093 --> 00:26:43,903
GPT-2 came out several years ago.

438
00:26:43,903 --> 00:26:45,933
They're like, "Oh, this is mind-boggling.

439
00:26:46,243 --> 00:26:48,283
Let's build a game off of it!"

440
00:26:48,813 --> 00:26:52,063
And what they came up with was
like a dungeon crawler, a text

441
00:26:52,063 --> 00:26:56,373
based game it was really neat,
because it would just generate, an

442
00:26:56,413 --> 00:26:57,833
infinite amount of opportunities.

443
00:26:57,833 --> 00:27:00,443
And so it created this
'choose your own adventure'.

444
00:27:01,503 --> 00:27:05,373
It got relatively big in the space,
and lots of people enjoyed playing it.

445
00:27:05,463 --> 00:27:11,523
things were going really good, and then
OpenAI GPT-3 came out, they offered it to

446
00:27:11,523 --> 00:27:15,803
them, hey, we can, we have this new model,
it's a lot better, why don't you try it?

447
00:27:15,853 --> 00:27:19,353
they played around with it, and "oh yeah,
this is, it's much more descriptive,

448
00:27:19,353 --> 00:27:23,878
it's much more interesting, it's really
great", There was a lot of excitement

449
00:27:23,878 --> 00:27:29,338
around it, however, it turned out that
the model itself, had a propensity

450
00:27:29,338 --> 00:27:34,638
to, generate smut, and it got really
concerning people would write like,

451
00:27:34,638 --> 00:27:39,428
"I'm an eight year old girl", and then
the model would complete it saying

452
00:27:39,438 --> 00:27:41,168
"....and I'm wearing a skimpy outfit",

453
00:27:41,168 --> 00:27:44,418
And oh, whoa, like the player didn't want
that, but like the model generated it.

454
00:27:45,038 --> 00:27:50,108
there became this big feud between OpenAI
and Latitude about creating filters.

455
00:27:50,598 --> 00:27:53,338
"hey, we don't want
your players doing that.

456
00:27:53,348 --> 00:27:54,518
We don't like that".

457
00:27:54,528 --> 00:27:58,048
And, Latitude's "okay, we'll create
some filters" and things like that.

458
00:27:58,048 --> 00:28:00,428
And it devolved really quickly.

459
00:28:00,448 --> 00:28:03,568
Latitude being a very startup,
not necessarily knowing everything

460
00:28:03,568 --> 00:28:08,668
they were doing, they built a very
shaky filtering system, and then

461
00:28:09,078 --> 00:28:10,678
OpenAI was "that's not good enough".

462
00:28:10,678 --> 00:28:13,908
So then they started banning players,
and so eventually we got to this

463
00:28:13,918 --> 00:28:18,668
territory where players - paying
customers would be playing a game, the

464
00:28:18,668 --> 00:28:23,028
model would randomly generate, something
that the filtering system didn't

465
00:28:23,028 --> 00:28:24,618
like, and then they would get banned.

466
00:28:24,618 --> 00:28:29,178
Cause it's like the game just did itself.

467
00:28:30,238 --> 00:28:33,668
It was a very complicated time,
and there was lots of back and

468
00:28:33,668 --> 00:28:38,488
forth between Latitude, who's
a small company, and OpenAI.

469
00:28:38,488 --> 00:28:45,408
There's lots of ' he said they said'
going on, but ultimately, it's just this

470
00:28:46,108 --> 00:28:53,868
position where Latitude They had this game
that was completely dependent on OpenAI's

471
00:28:53,898 --> 00:29:01,268
model to generate good output, and it
really caused a lot of drama between

472
00:29:01,268 --> 00:29:07,878
the players and Latitude and, OpenAI in
the background and that is a critical

473
00:29:07,968 --> 00:29:13,868
example of LLM was very critical to their
business, If they owned it, then they

474
00:29:13,868 --> 00:29:17,708
could have controlled it, they could have
made sure that from the model aspect,

475
00:29:17,708 --> 00:29:21,538
they could have trained the model to make
sure it didn't do any of those things.

476
00:29:22,068 --> 00:29:26,118
And then they would never need to
play the little blame game, right?

477
00:29:26,128 --> 00:29:27,518
Nobody likes to play that game.

478
00:29:27,638 --> 00:29:31,218
That's whose fault is it, that
the model is generating bad stuff.

479
00:29:31,248 --> 00:29:33,618
Is it the player who's prompting it?

480
00:29:33,848 --> 00:29:38,868
Is it Latitude who has some systems
for tokenizing and preparing player

481
00:29:38,868 --> 00:29:40,718
output before it goes to OpenAI?

482
00:29:41,048 --> 00:29:43,898
Is it OpenAI because their
model is generating that?

483
00:29:43,928 --> 00:29:48,333
Is it Latitude for post processing
the content from OpenAI before

484
00:29:48,333 --> 00:29:49,163
they serve it to the player.

485
00:29:49,223 --> 00:29:51,603
I don't even know if it
really matters who's to blame.

486
00:29:51,703 --> 00:29:53,403
it's just a sucky game to play.

487
00:29:53,453 --> 00:29:58,693
and that's like the ultimate example
of why you might want to consider

488
00:29:58,693 --> 00:30:04,403
build versus buy is if you buy from any
provider, we're picking on OpenAI here,

489
00:30:04,403 --> 00:30:08,983
because they're a big player, but you buy
from Anthropic, you buy from the guys down

490
00:30:08,983 --> 00:30:13,113
the street, the startup that just barely
came up and they're offering for half

491
00:30:13,113 --> 00:30:15,883
the price of whatever, Buy from anybody,

492
00:30:15,903 --> 00:30:18,363
and you will eventually have
to play that blame game.

493
00:30:18,363 --> 00:30:22,883
we had another example in there of some
lawyers who generated, cases that didn't

494
00:30:22,883 --> 00:30:30,773
exist they asked ChatGPT about cases
and it came up with a perfect response.

495
00:30:31,433 --> 00:30:32,683
a little too perfect.

496
00:30:32,693 --> 00:30:34,833
It hallucinated stuff that didn't exist.

497
00:30:34,833 --> 00:30:37,583
and, is it ChatGPT's fault?

498
00:30:37,603 --> 00:30:41,083
Is it OpenAI's fault for,
allowing their model to make

499
00:30:41,083 --> 00:30:43,063
stuff up and behave dishonestly?

500
00:30:43,583 --> 00:30:46,023
Or is it the lawyer's
fault for not checking it?

501
00:30:46,043 --> 00:30:46,763
who cares?

502
00:30:46,763 --> 00:30:49,453
the problem is that it's not locked down.

503
00:30:49,453 --> 00:30:50,823
It's qnon deterministic.

504
00:30:50,823 --> 00:30:56,653
Yeah, in a way, as I was reading
the chapter on that, it makes

505
00:30:56,653 --> 00:31:03,308
me think of using a machine
to maybe do some farm, work.

506
00:31:03,318 --> 00:31:06,648
Let's say that you're plowing
a field and you're using a

507
00:31:06,658 --> 00:31:08,628
horse versus a machine, right?

508
00:31:08,628 --> 00:31:11,388
A machine might break,
but in a predictable way.

509
00:31:11,428 --> 00:31:14,118
And if you've got a mechanic
around, they'll come and fix it.

510
00:31:14,178 --> 00:31:18,578
A horse can get scared, or it has
a bad day, or it can be moody.

511
00:31:19,478 --> 00:31:21,538
And it can come up with something new.

512
00:31:21,638 --> 00:31:23,778
So you always have to
be careful with that.

513
00:31:23,908 --> 00:31:28,608
is that an accurate feeling of someone
who's working with this LLMs day-to-day?

514
00:31:29,578 --> 00:31:31,608
You work with some kind of animal?

515
00:31:32,388 --> 00:31:36,268
One of the most annoying things is
even if you set the seed of it, so

516
00:31:36,268 --> 00:31:40,138
the random generator is going to
be the same every single time, you

517
00:31:40,138 --> 00:31:43,598
can still give it the same prompt
and get something different out.

518
00:31:43,648 --> 00:31:49,298
The truly awesome thing about LLMs is
the number of non-linear activations

519
00:31:49,348 --> 00:31:51,718
that are going through the model, right?

520
00:31:52,138 --> 00:31:57,578
It's creating incredible, non-linear
jumps throughout that dimensional

521
00:31:57,578 --> 00:31:59,328
space that the embeddings are in.

522
00:31:59,998 --> 00:32:01,438
you just can't really predict it.

523
00:32:01,438 --> 00:32:02,798
It is a little bit like an animal.

524
00:32:05,585 --> 00:32:08,895
the fact that like we can
prompt engineer at all.

525
00:32:09,480 --> 00:32:11,480
it's a little bit telling
of where we are, right?

526
00:32:11,480 --> 00:32:15,980
Cause like prompt engineering, you
can change the spaces, the white space

527
00:32:15,980 --> 00:32:19,870
inside of your prompt and it can end up
giving you a completely different result.

528
00:32:19,970 --> 00:32:24,290
we're still in a very interesting
area, where we're trying to create

529
00:32:24,300 --> 00:32:28,960
better ways to communicate with the
LLM and get predictable outputs.

530
00:32:28,970 --> 00:32:31,670
But, the fact that we
can do that at all is.

531
00:32:32,070 --> 00:32:33,380
This is a bit of a miracle, right?

532
00:32:33,480 --> 00:32:34,720
you can't do that with a human.

533
00:32:35,350 --> 00:32:39,000
a human isn't going to be tricked
into saying something different.

534
00:32:39,020 --> 00:32:41,120
humans are tricked all the
time, but not necessarily in the

535
00:32:41,120 --> 00:32:42,710
same way that we do with LLMs.

536
00:32:42,710 --> 00:32:46,665
it's a very interesting world we are
in, and a lot of people are having

537
00:32:46,665 --> 00:32:49,145
that horse versus machine experience.

538
00:32:49,195 --> 00:32:51,775
let's talk about the cost a little bit.

539
00:32:51,955 --> 00:32:57,345
you mentioned that it's super cheap to
pay some big company to use their thing.

540
00:32:57,845 --> 00:33:02,685
let's focus for a minute on the cost
of actually building your own LLM.

541
00:33:02,745 --> 00:33:06,275
if I wanted to build one of
this foundational models,

542
00:33:06,365 --> 00:33:13,305
Let's say that I take one of those
75TB corpora from the internet and I'm

543
00:33:13,305 --> 00:33:16,975
feeling particularly GPU poor that day.

544
00:33:17,245 --> 00:33:22,965
How much money do I need to have in my
little piggy bank to get something useful?

545
00:33:24,075 --> 00:33:25,265
That's difficult, man.

546
00:33:26,765 --> 00:33:31,195
because you're either
paying for a GPU, right?

547
00:33:31,215 --> 00:33:36,455
Or a suite of GPUs in order to
parallelize it so that you can ingest

548
00:33:36,455 --> 00:33:38,285
that over a short period of time.

549
00:33:38,785 --> 00:33:43,835
Or technically with a lot of this stuff,
you can load it onto a [Geforce] 3090,

550
00:33:43,835 --> 00:33:50,885
I've done this personally, you can train
in FP16, you can train up to, about, 13

551
00:33:50,895 --> 00:33:55,835
billion parameters pretty effectively,
and pretty cheaply, on a 3090.

552
00:33:56,555 --> 00:33:59,655
You have to be a little bit smart about
your data loading, you have to make

553
00:33:59,655 --> 00:34:03,295
sure you're streaming stuff you have to
pay for the data storage anyway, it's

554
00:34:03,335 --> 00:34:07,385
incredibly slow, you have to do gradient
checkpointing, you have to, do like

555
00:34:07,395 --> 00:34:13,365
gradient accumulation steps, which slow
down the training even more, I trained a

556
00:34:13,715 --> 00:34:19,195
little bit bigger than that, it was about
a 20 billion parameter model on my 3090,

557
00:34:19,665 --> 00:34:27,295
but what I don't, generally talk about is
it took a year of just running to do that.

558
00:34:27,395 --> 00:34:32,275
it was horrendous and that all
culminated in a company giving me a

559
00:34:32,275 --> 00:34:36,785
cease and desist, so I couldn't even
release it, so you're either paying.

560
00:34:37,125 --> 00:34:42,765
A lot of money, hundreds of thousands of
dollars in order to get something quick.

561
00:34:42,895 --> 00:34:47,465
Especially with 75TB of text or
more, grab your own data, get

562
00:34:47,475 --> 00:34:51,655
more data, and you're paying to
store and to process all of that.

563
00:34:51,905 --> 00:34:53,595
And that costs tons of money.

564
00:34:53,855 --> 00:35:00,750
Or you are not paying the money, but it
takes a really long time and makes all

565
00:35:00,750 --> 00:35:04,400
of your shareholders really frustrated
because you're ruining go to market.

566
00:35:04,410 --> 00:35:06,190
You're taking too long.

567
00:35:06,210 --> 00:35:10,390
You're not going to be the first
in the space, It's a huge trade off

568
00:35:11,208 --> 00:35:15,278
as with many things, you
can trade time or money, and

569
00:35:15,448 --> 00:35:17,108
training an LLM is very similar.

570
00:35:17,158 --> 00:35:23,338
I think they estimated, huge models
that we see, like ChatGPT things.

571
00:35:23,408 --> 00:35:27,798
You're probably paying somewhere
like what was it like a half million?

572
00:35:27,808 --> 00:35:31,958
I think they say, and that's just
for the training, we're not even

573
00:35:31,958 --> 00:35:37,108
talking about all the experts
you have to pay and buy in order

574
00:35:37,250 --> 00:35:38,130
data curation,

575
00:35:38,200 --> 00:35:38,460
man.

576
00:35:39,388 --> 00:35:42,478
on the very far end on the expensive side.

577
00:35:42,528 --> 00:35:45,998
it gets really expensive really quickly
to train these models, just because.

578
00:35:46,048 --> 00:35:51,538
buying enough GPUs in order to parallelize
this to do it within, reasonable time and

579
00:35:51,548 --> 00:35:55,528
just the sheer volume of data you have to
run through to train all the parameters.

580
00:35:55,528 --> 00:36:00,958
It gets really expensive, but on
the other end there's lots of good

581
00:36:00,958 --> 00:36:04,928
open source models that have done
that main pre-training already.

582
00:36:04,988 --> 00:36:10,138
And so you can grab one of those,
you can train it with something like

583
00:36:10,138 --> 00:36:16,868
Laura, which you, only need a handful
of samples and maybe like 10 minutes

584
00:36:16,898 --> 00:36:21,888
if that, and you can train it on a
very, simple GPU and you have something

585
00:36:21,888 --> 00:36:26,443
fine tuned for what you need, and you
can get under $200 is very reasonable.

586
00:36:26,593 --> 00:36:27,978
$150, $20.

587
00:36:28,803 --> 00:36:33,143
It's very possible to train,
these models with certain

588
00:36:33,143 --> 00:36:34,433
methods to get what you need.

589
00:36:35,753 --> 00:36:42,193
So does it mean that in a kind of natural,
almost biological like evolution we're

590
00:36:42,193 --> 00:36:48,173
going to end up with few primary models
that a lot of the different models branch

591
00:36:48,223 --> 00:36:51,153
off of, instead of, reinventing the wheel?

592
00:36:51,713 --> 00:36:53,263
That's where we're at currently.

593
00:36:53,648 --> 00:37:00,238
I hope that it doesn't stay that
way, because I really enjoy seeing

594
00:37:00,238 --> 00:37:03,898
new people create new models for
new use cases and all this stuff.

595
00:37:04,428 --> 00:37:10,313
so I hope it doesn't stay that way, but I
do see a lot of value in creating industry

596
00:37:10,313 --> 00:37:15,293
standards, at least around how you are
actually writing the binary files, how

597
00:37:15,293 --> 00:37:17,063
are the weights actually being stored?

598
00:37:17,073 --> 00:37:19,083
What do the different layers look like?

599
00:37:19,153 --> 00:37:22,943
I, think that standardizing what the
model looks like so that you can load

600
00:37:22,943 --> 00:37:25,653
it as flexibly as possible is awesome.

601
00:37:27,183 --> 00:37:32,363
I would like to see more open source
models, which is funny considering

602
00:37:32,383 --> 00:37:37,483
there are thousands of open source
fine tuned versions and hundreds

603
00:37:37,553 --> 00:37:42,563
of open source foundational models
on the Hugging Face Hub right now.

604
00:37:42,943 --> 00:37:44,033
I want more, right?

605
00:37:44,113 --> 00:37:44,983
I'm greedy, man.

606
00:37:46,785 --> 00:37:51,355
To me, it sounds like basically every
week there is another one that's better

607
00:37:51,355 --> 00:37:56,805
at something and if you look at the
Hugging Face LLM leadership board, it's

608
00:37:56,805 --> 00:38:04,465
changing by the hour, literally and it
looks like a gold rush in many ways but

609
00:38:04,655 --> 00:38:08,355
I like this gold rush much better than
the crypto one, couple of years ago

610
00:38:09,558 --> 00:38:12,198
Yeah, man, there's a lot higher
chance that you'll come out

611
00:38:12,198 --> 00:38:17,468
of this gold rush with a great
product than with the crypto one.

612
00:38:17,518 --> 00:38:22,108
yeah, there's a lot there, and just
to summarize that into one sentence,

613
00:38:22,608 --> 00:38:29,908
you can probably fine tune even a
gigantic model for around $200 to $500.

614
00:38:32,233 --> 00:38:33,693
And you can go lower than that.

615
00:38:33,713 --> 00:38:37,993
Even if you are smart about how you're
doing it, versus training from scratch,

616
00:38:38,013 --> 00:38:42,883
which either is going to take an
inordinate amount of time or will cost

617
00:38:43,533 --> 00:38:45,373
thousands and thousands of dollars.

618
00:38:46,530 --> 00:38:50,530
So I'm willing to bet money that a lot
of our listeners are going to pause

619
00:38:50,550 --> 00:38:52,590
this now and start Googling furiously.

620
00:38:52,590 --> 00:38:55,410
How do I fine tune a model?

621
00:38:56,000 --> 00:38:58,790
Where would you point them
as a good starting point?

622
00:38:58,910 --> 00:39:04,275
any particular paper, any particular,
company, anything that's, a

623
00:39:04,275 --> 00:39:06,435
good place to start with that

624
00:39:07,303 --> 00:39:09,663
a bit selfishly, I would
say you should buy our book.

625
00:39:09,713 --> 00:39:16,863
We talk about probably the main ways
to train in chapter 5 of our book,

626
00:39:16,893 --> 00:39:18,113
I was going to say that, but, I

627
00:39:18,113 --> 00:39:19,943
was going to say it last, right?

628
00:39:19,963 --> 00:39:21,223
Cause we do go over it.

629
00:39:21,523 --> 00:39:26,473
The book is primarily about production
environments, but you can't really

630
00:39:26,473 --> 00:39:29,518
put a model in production if you
don't know how to work with it.

631
00:39:29,518 --> 00:39:31,348
So we have stuff on fine tuning.

632
00:39:31,348 --> 00:39:34,328
We have stuff on perimeter,
efficient, fine tuning on low

633
00:39:34,328 --> 00:39:35,908
rank adaptation, the whole deal.

634
00:39:36,438 --> 00:39:41,298
YouTube is actually probably one of
your best resources right now, because

635
00:39:41,348 --> 00:39:48,108
it has amazing content creators that
show you how to do it in whatever

636
00:39:48,118 --> 00:39:49,658
format you're comfortable in.

637
00:39:49,658 --> 00:39:55,448
So if you're a C+ developer, there are
YouTube videos on how to fine tune a model

638
00:39:55,498 --> 00:39:59,238
and create a Laura using llama CPP, right?

639
00:39:59,238 --> 00:40:01,118
It's not even all that difficult.

640
00:40:01,118 --> 00:40:05,258
You just have to convert a model into
a GGUF format and Boom, you're there.

641
00:40:05,258 --> 00:40:06,668
You can do it on a CPU.

642
00:40:06,718 --> 00:40:10,463
it'll take a long time, but you
can do it in whatever quantization

643
00:40:10,463 --> 00:40:11,363
you want and everything.

644
00:40:12,053 --> 00:40:17,363
YouTube will meet you where you're at if
you want to learn something a little bit

645
00:40:17,363 --> 00:40:21,563
more industry-standard so that you could
potentially, get employment in this area,

646
00:40:22,173 --> 00:40:27,798
PyTorch has an amazing documentation,
fantastic tutorials and they're one of the

647
00:40:27,808 --> 00:40:32,098
best at really making it feel like you're
playing with, let's say "big boy Legos"

648
00:40:32,728 --> 00:40:39,928
You're like building the model using their
little Lego pieces pretty cool If you need

649
00:40:39,928 --> 00:40:42,193
something Bit more high level than that.

650
00:40:42,573 --> 00:40:48,613
Hugging face, I think is the industry
standard for, working in between a whole

651
00:40:48,613 --> 00:40:52,583
bunch of different frameworks, whether
that's PyTorch or TensorFlow or, whatever

652
00:40:52,583 --> 00:40:54,593
other framework you're working with Onyx.

653
00:40:55,233 --> 00:40:59,193
HuggingFace has abstracted away a
lot of the difficulty of setting

654
00:40:59,213 --> 00:41:04,003
up models for fine tuning cause in
PyTorch you have to build out the

655
00:41:04,003 --> 00:41:07,623
exact model architecture just to load
the weights and then fine tune it.

656
00:41:08,113 --> 00:41:10,973
HuggingFace already has
the class built for you.

657
00:41:13,188 --> 00:41:16,078
I would point to those if you
need more explanation, like

658
00:41:16,078 --> 00:41:17,798
Coursera is a fantastic place.

659
00:41:17,798 --> 00:41:21,688
Deep learning AI on Coursera
and on their own sites felt like

660
00:41:22,178 --> 00:41:24,418
that's Andrew Ng's education stuff.

661
00:41:24,428 --> 00:41:28,368
That's where I got my start with
machine learning was Andrew Ng's

662
00:41:28,388 --> 00:41:30,738
machine learning course on Coursera.

663
00:41:30,748 --> 00:41:31,778
It was Awesome.

664
00:41:31,808 --> 00:41:32,608
Fantastic.

665
00:41:32,658 --> 00:41:36,578
Jeremy Howard is also amazing in
that area of creating content for

666
00:41:36,578 --> 00:41:40,388
people starting out and learning
from beginner to advanced level.

667
00:41:40,498 --> 00:41:42,198
He's a fast AI.

668
00:41:42,198 --> 00:41:44,188
I, yeah, I strongly recommend all of those

669
00:41:46,200 --> 00:41:47,030
and your book.

670
00:41:48,358 --> 00:41:51,288
yeah, we ingested a lot of those
in order to write the book,

671
00:41:52,438 --> 00:41:58,618
our book is a very nice high-level
overview of the key things you want

672
00:41:58,618 --> 00:42:02,978
to be looking at and like different
methodologies from training from

673
00:42:02,978 --> 00:42:05,658
scratch to basic fine tuning to.

674
00:42:06,913 --> 00:42:11,323
model distillation to, Laura
and Path and things like that.

675
00:42:11,373 --> 00:42:15,253
we definitely give a high level overview,
we give code samples and show you that.

676
00:42:15,303 --> 00:42:19,203
But, ultimately if you really
wanted to get into it, yeah, there

677
00:42:19,223 --> 00:42:21,243
are other resources out there.

678
00:42:21,263 --> 00:42:25,383
I know Manning has another
book coming out, specifically

679
00:42:25,393 --> 00:42:28,153
around all about training LLMs.

680
00:42:28,203 --> 00:42:30,563
there are definitely other
places you can go, but.

681
00:42:30,913 --> 00:42:34,093
If you're looking for the quick,
summarized version of all of

682
00:42:34,093 --> 00:42:37,033
these things, our book is actually
a really good resource for it.

683
00:42:37,083 --> 00:42:41,313
One other thing that I like about
your book is, the part where you

684
00:42:41,363 --> 00:42:46,833
build up the different, breakthrough
moments, throughout the world of

685
00:42:46,833 --> 00:42:52,043
mathematics, that ultimately led
to 'attention is all you need', and

686
00:42:52,173 --> 00:42:54,853
what is it, seven years later now?

687
00:42:54,943 --> 00:42:56,623
the gold rush that we're observing.

688
00:42:56,653 --> 00:43:00,923
but just before we jump into that,
there is a little bit of vocabulary

689
00:43:00,983 --> 00:43:05,653
and that one needs to have in order
to basically talk or even read

690
00:43:05,663 --> 00:43:07,833
a lot of this papers, could you.

691
00:43:08,753 --> 00:43:11,843
Talk us through briefly that vocabulary.

692
00:43:11,873 --> 00:43:16,813
I'm talking about phonetics, syntax,
semantics, pragmatics, morphology, that

693
00:43:16,833 --> 00:43:22,743
until I read your book actually made me
think mostly of blood tests and semiotics.

694
00:43:23,503 --> 00:43:27,603
Could you give us like the MVP version
of what you need to know about these

695
00:43:27,623 --> 00:43:30,203
things to be able to read papers?

696
00:43:30,203 --> 00:43:31,473
Oh, absolutely.

697
00:43:31,573 --> 00:43:34,653
Matt has been learning a lot of this
too, he might be better at it than me.

698
00:43:34,653 --> 00:43:36,303
I will throw other jargon into it.

699
00:43:36,623 --> 00:43:40,503
writing this book with Chris over the
last year has been, mind-opening for me.

700
00:43:40,563 --> 00:43:43,753
until you can Understand these words
like you were saying it's really

701
00:43:43,753 --> 00:43:48,833
hard to dive into the deep end but
we go over in our book just because

702
00:43:48,923 --> 00:43:53,993
we do find it so valuable, It really
helped me understand very quickly.

703
00:43:54,053 --> 00:43:56,303
"Oh, this is what my LLMs are good at.

704
00:43:56,303 --> 00:43:59,523
This is what LLMs are not", and that was
one of the first things we started with

705
00:43:59,553 --> 00:44:04,998
but the first one semantics, that is just
like the structure of words, how things

706
00:44:04,998 --> 00:44:07,348
go, whether or not it sounds correct.

707
00:44:07,838 --> 00:44:09,528
that is what LLMs are really good at.

708
00:44:09,538 --> 00:44:14,098
They're really good at making sure like
the semantics of words align really well.

709
00:44:14,148 --> 00:44:19,378
but after that, you got pragmatics,
which is what LLMs have no idea about.

710
00:44:19,428 --> 00:44:22,618
That is all the information around.

711
00:44:23,083 --> 00:44:24,663
That isn't said, right?

712
00:44:24,663 --> 00:44:30,713
So when you say I'm going to find the
eggs the Easter Bunny left, right?

713
00:44:30,733 --> 00:44:33,653
you have to understand what,
Easter is, what the Easter

714
00:44:33,653 --> 00:44:35,833
Bunny is, why a bunny has eggs.

715
00:44:35,873 --> 00:44:38,563
there's a lot of context around
it that you have to understand,

716
00:44:38,983 --> 00:44:40,823
and that's all pragmatics.

717
00:44:40,903 --> 00:44:42,553
it's information that isn't said.

718
00:44:43,673 --> 00:44:45,273
And that's what LLMs generally lack.

719
00:44:45,353 --> 00:44:47,223
Actually, I'm gonna, I'm
gonna jump in here real quick.

720
00:44:47,293 --> 00:44:51,043
Miko, did you like the Velkanot
example that I gave in there?

721
00:44:51,480 --> 00:44:53,020
Yeah, I thought it was

722
00:44:53,043 --> 00:44:53,443
Yeah.

723
00:44:53,563 --> 00:44:54,393
Was that pretty good?

724
00:44:55,663 --> 00:44:59,653
I just wanted to ask because I
remember experiencing that in Slovakia.

725
00:44:59,663 --> 00:45:05,293
Like I lived there for years and that
was a hugely beneficial portion to me

726
00:45:05,373 --> 00:45:11,126
to help figure out that 'no, tons of
people have tons of ways of looking at

727
00:45:11,126 --> 00:45:12,903
things', and LLMs don't know about it.

728
00:45:13,288 --> 00:45:16,628
you would have to explain every bit
of it to them in order to get them

729
00:45:16,628 --> 00:45:18,038
to understand the same things as you.

730
00:45:18,368 --> 00:45:19,268
Anyway, sorry, Matt.

731
00:45:20,185 --> 00:45:24,855
I find like those two words in general,
semantics and pragmatics, understanding

732
00:45:24,855 --> 00:45:29,715
those is going to get you significantly
farther and just understanding

733
00:45:29,715 --> 00:45:31,475
how LLMs work, what they're doing.

734
00:45:31,985 --> 00:45:34,955
there's obviously a lot of
other words that we talk about,

735
00:45:35,005 --> 00:45:36,305
like morphology and stuff.

736
00:45:36,305 --> 00:45:39,965
And I'll hand it off to Chris to talk
about what he wants to add to there.

737
00:45:40,558 --> 00:45:41,578
I would agree with Matt.

738
00:45:41,858 --> 00:45:45,518
Just understanding semantics and
pragmatics would get you probably 60%

739
00:45:45,518 --> 00:45:49,718
of the way there, and you could read
new papers that come out and immediately

740
00:45:49,718 --> 00:45:52,288
see like where are they amazing?

741
00:45:52,368 --> 00:45:53,538
Where are they failing?

742
00:45:53,608 --> 00:45:58,778
I end up using The relationship
between those two, just the literal

743
00:45:58,818 --> 00:46:00,628
encoded meaning of your words.

744
00:46:00,678 --> 00:46:04,988
if I say, "I'm married to my
ex-wife", there's immediately,

745
00:46:05,018 --> 00:46:06,588
boom, semantic problem there.

746
00:46:06,898 --> 00:46:08,608
How can I be married to my ex-wife?

747
00:46:09,258 --> 00:46:11,058
The words don't agree with each other.

748
00:46:12,268 --> 00:46:16,048
Versus, exactly as Matt was saying, if
we talk about Easter, if we talk about

749
00:46:16,048 --> 00:46:20,288
traditions, if we talk about rituals
that people have, just like the stuff

750
00:46:20,288 --> 00:46:24,548
that you say, if you ask someone in
Slovakia, they're going to respond to you.

751
00:46:24,958 --> 00:46:25,818
That's normal.

752
00:46:25,988 --> 00:46:27,508
it's a question, they respond.

753
00:46:27,628 --> 00:46:34,298
LLMs don't have that, and you have to have
them ingest tons and tons of data in order

754
00:46:34,298 --> 00:46:37,538
to even get as far as giving a response.

755
00:46:38,158 --> 00:46:42,138
the other ones that we can
think about, syntax, I would say

756
00:46:42,138 --> 00:46:44,238
that syntax is largely solved.

757
00:46:44,438 --> 00:46:49,868
At this point, syntax is your structure
around the words, like what order do

758
00:46:49,868 --> 00:46:51,838
the words go in for them to be correct?

759
00:46:52,058 --> 00:46:56,818
Is it 'I go to the store' or is it 'I
to the store go' or all of that stuff.

760
00:46:57,238 --> 00:46:58,178
That's syntax.

761
00:46:58,178 --> 00:47:02,138
It's the structure that holds your
sentences, your utterances together.

762
00:47:02,138 --> 00:47:07,218
Morphology is delving into something that
I consider to be very important in LLMs.

763
00:47:08,043 --> 00:47:11,053
I'm not going to say the most important,
cause I think that's still semantics.

764
00:47:11,113 --> 00:47:12,283
There's a lot of work there.

765
00:47:14,053 --> 00:47:17,603
but morphology would
be how words are built.

766
00:47:18,223 --> 00:47:21,703
what are the fundamental units
of meaning the morphemes do those

767
00:47:21,703 --> 00:47:23,633
even exist that sort of stuff.

768
00:47:23,703 --> 00:47:25,963
and we don't have to delve
really deep into that.

769
00:47:25,973 --> 00:47:30,313
That's largely solved by
tokenization, but we can see.

770
00:47:30,363 --> 00:47:33,523
with newer models that come
out that really matters.

771
00:47:33,533 --> 00:47:39,583
You have much smaller models that have
more novel tokenization, more novel

772
00:47:39,613 --> 00:47:43,503
morphology that end up outperforming
larger models on tasks that they

773
00:47:43,503 --> 00:47:45,103
didn't even train on all that much.

774
00:47:45,793 --> 00:47:47,843
if we can put it all
together really quick.

775
00:47:48,153 --> 00:47:50,093
The model solves syntax.

776
00:47:50,793 --> 00:47:55,533
Embeddings try to solve semantics,
but semantics is difficult,

777
00:47:55,643 --> 00:47:57,023
and so they're not perfect.

778
00:47:57,393 --> 00:48:01,743
Pragmatics is stuff like RAG, your
Retrieval Augmented Generation, and

779
00:48:02,163 --> 00:48:07,343
having repeated sequences within your
training data, it gives it landmarks, it's

780
00:48:07,343 --> 00:48:09,933
context around the syntax and semantics.

781
00:48:10,203 --> 00:48:16,068
Morphology is your tokenization, which,
if I would Give that an example, your

782
00:48:16,068 --> 00:48:22,258
tokenization provides your model with
stuff that it sees, it changes from text

783
00:48:22,268 --> 00:48:24,458
into what does the model actually see.

784
00:48:25,648 --> 00:48:28,658
And, your embedding strategy
is moot if you don't have it.

785
00:48:28,658 --> 00:48:31,968
Just your morphology gives your model
glasses, if you want to call it that.

786
00:48:32,448 --> 00:48:34,998
And then phonetics is the one
that we haven't even talked about.

787
00:48:35,338 --> 00:48:41,188
Phonetics is the reason why we are doing a
podcast and we're talking instead of just

788
00:48:41,188 --> 00:48:42,988
texting each other or emailing each other.

789
00:48:42,988 --> 00:48:46,388
Can you imagine trying to ingest
a podcast that's just emails?

790
00:48:46,998 --> 00:48:47,978
It's horrendous.

791
00:48:48,643 --> 00:48:54,563
And it's because there's so much richness
and depth in meaning in the language that

792
00:48:54,573 --> 00:48:59,143
is just lost when you strip it of its
phonetic, I'm going to call it a medium.

793
00:48:59,673 --> 00:49:04,133
And that can lead people to think that
it has to do with sound, that's the

794
00:49:04,133 --> 00:49:08,513
most common modality for people, but
sign language has phonetics, they have

795
00:49:08,513 --> 00:49:11,323
particular places where they, make signs.

796
00:49:11,343 --> 00:49:15,053
They have particular ways that they do
them to inflect and express more emotion.

797
00:49:15,413 --> 00:49:19,563
Their phonetics exists even
outside of the verbal modality.

798
00:49:20,233 --> 00:49:25,033
that's important because that's where I
see the most improvements coming to LLMs

799
00:49:25,073 --> 00:49:27,623
in the future is being able to process.

800
00:49:28,498 --> 00:49:32,708
phonetic information without
having to convert it into text

801
00:49:32,808 --> 00:49:36,288
or process phonetic information
and compare it against the text.

802
00:49:36,288 --> 00:49:39,068
that can be incredibly helpful
for your model's understanding.

803
00:49:39,068 --> 00:49:43,078
those are the five features
of language that we break

804
00:49:43,078 --> 00:49:44,578
things down into in the book.

805
00:49:44,598 --> 00:49:46,388
And they're largely agreed upon.

806
00:49:46,388 --> 00:49:49,738
There are some other linguistic features
that are incredibly important, stuff like

807
00:49:49,738 --> 00:49:52,488
dialogue, that we haven't even covered.

808
00:49:52,988 --> 00:49:53,748
beyond that.

809
00:49:53,813 --> 00:49:55,883
Yeah, we can talk about semiotics too.

810
00:49:55,923 --> 00:50:01,933
That's, Charles Sanders Peirce, smart
dude from the 1800s just created, a lot

811
00:50:01,933 --> 00:50:07,253
of structure and organizations we dive
into that very lightly in the book.

812
00:50:07,303 --> 00:50:11,593
I don't think that you need a grounding
in semiotics in order to improve

813
00:50:11,593 --> 00:50:13,323
your ability to interact with LLMs.

814
00:50:13,878 --> 00:50:18,178
But it is helpful for organizing
all of these other concepts.

815
00:50:18,218 --> 00:50:22,668
how do we create a mental map for
how stuff needs to be processed

816
00:50:22,668 --> 00:50:24,308
within a machine learning pipeline?

817
00:50:24,668 --> 00:50:28,648
How do we make sure that we're not mixing
things up and inadvertently destroying

818
00:50:28,688 --> 00:50:30,738
our model's ability to see things, right?

819
00:50:30,758 --> 00:50:36,248
If we put embeddings before
tokenization, it breaks your process.

820
00:50:36,468 --> 00:50:40,458
it's helpful for organizing things and
it's also helpful for understanding

821
00:50:40,468 --> 00:50:45,028
how conversation happens and how I
say something and it moves through

822
00:50:45,028 --> 00:50:47,078
your mind to create an interpretation.

823
00:50:47,078 --> 00:50:50,788
that's by far like the most
theoretical out there concept that

824
00:50:50,788 --> 00:50:52,098
we get into in the whole book.

825
00:50:53,720 --> 00:50:59,440
And together you came up with this
language definition as being, as a

826
00:50:59,450 --> 00:51:05,760
concept, "an abstraction of feelings and
thoughts that occur to us in our heads".

827
00:51:05,840 --> 00:51:08,750
And I'll be honest, I
initially thought it sucked.

828
00:51:09,165 --> 00:51:12,525
because it's a little bit,
it's a little bit wishy washy.

829
00:51:13,145 --> 00:51:14,835
I wanted something a bit more concrete.

830
00:51:14,855 --> 00:51:18,455
But then, as I looked up all the other
definitions in different contexts, I

831
00:51:18,455 --> 00:51:23,795
was like, Okay, I can clearly not come
up with anything better than that.

832
00:51:23,795 --> 00:51:27,555
So I think I'm ready to yield
now and say that this is actually

833
00:51:27,555 --> 00:51:29,275
capturing it pretty well.

834
00:51:29,985 --> 00:51:35,060
Putting abstraction in it, sounds
also vaguely techie, so that helps.

835
00:51:35,610 --> 00:51:37,590
How did you come up with that definition?

836
00:51:38,868 --> 00:51:39,388
I didn't.

837
00:51:39,728 --> 00:51:41,708
I would love to take credit for that.

838
00:51:41,718 --> 00:51:45,718
No, that definition has been around
for a long time within the linguistic

839
00:51:45,728 --> 00:51:51,298
community, and one of the best examples
of why it really works is babies, right?

840
00:51:51,678 --> 00:51:56,708
Babies have no idea how to express their
thoughts, but somehow they get it across.

841
00:51:56,988 --> 00:52:03,128
when a baby is happy, we can tell when
a baby is crying, we can infer that

842
00:52:03,128 --> 00:52:07,508
it needs something, babies are able to
communicate without language, meaning

843
00:52:07,508 --> 00:52:13,008
that language is something that we
created to shorten the conversation.

844
00:52:13,348 --> 00:52:17,118
The reason I called it an abstraction
is we have abstract ideas.

845
00:52:17,128 --> 00:52:22,478
You probably come up to a situation where
you're feeling something, and you don't

846
00:52:22,488 --> 00:52:24,718
know the words to really express it.

847
00:52:24,968 --> 00:52:29,908
I think that's a pretty universal
human adult thing that has happened

848
00:52:29,908 --> 00:52:30,868
at least once in your life.

849
00:52:30,868 --> 00:52:34,768
That's happened to me a bunch of
times, and it really illustrates that

850
00:52:34,768 --> 00:52:40,118
"Oh man, the language that we use
is actually describing "what's in

851
00:52:40,118 --> 00:52:42,168
here", it isn't "what isn't here".

852
00:52:42,738 --> 00:52:43,938
it's a hard concept.

853
00:52:44,368 --> 00:52:48,748
Once you get there though, it really helps
with LLMs, because you realize that the

854
00:52:48,748 --> 00:52:50,588
language that we're using is a crutch.

855
00:52:51,128 --> 00:52:54,458
And that's all that the LLMs
have in the first place.

856
00:52:54,958 --> 00:52:57,828
And so this is another thing
that goes towards the miraculous

857
00:52:57,828 --> 00:52:59,338
nature of them working at all.

858
00:52:59,818 --> 00:53:04,698
Is they're dealing with an abstraction
of an abstraction at least.

859
00:53:04,858 --> 00:53:06,938
In order to communicate with us.

860
00:53:06,988 --> 00:53:08,918
So let's say that I buy that.

861
00:53:09,018 --> 00:53:13,948
my first question, would be going back
to your baby example, isn't what the

862
00:53:13,948 --> 00:53:17,148
baby's doing some form of a language?

863
00:53:17,408 --> 00:53:18,298
what's the line

864
00:53:18,345 --> 00:53:19,005
I'd like it to

865
00:53:19,068 --> 00:53:19,958
what is and what

866
00:53:20,058 --> 00:53:20,398
isn't?

867
00:53:20,398 --> 00:53:23,258
what's the line between, a
language and communication?

868
00:53:23,858 --> 00:53:24,608
I like that.

869
00:53:24,618 --> 00:53:27,678
That's a question that a lot
of people I bet have and It'll

870
00:53:27,678 --> 00:53:28,448
probably go in the appendix.

871
00:53:28,898 --> 00:53:32,478
We'll probably talk about this in an
appendix for curious readers so the line

872
00:53:32,478 --> 00:53:36,948
between just straight up communication
and a language is the ability to talk.

873
00:53:37,008 --> 00:53:40,558
there, there are a lot, but one of my
favorite ones is the ability to talk about

874
00:53:40,558 --> 00:53:42,728
something that is not physically present.

875
00:53:42,968 --> 00:53:44,578
bees have communication.

876
00:53:44,798 --> 00:53:46,588
gibbons have communication.

877
00:53:46,658 --> 00:53:48,058
Babies have communication.

878
00:53:48,498 --> 00:53:53,448
Babies, though, are unable to express
any ideas about stuff that is not

879
00:53:53,698 --> 00:53:58,588
physically present, you can't talk
to a baby about theoretical physics.

880
00:53:58,718 --> 00:54:00,888
I mean you can, but what
are you gonna get back?

881
00:54:01,798 --> 00:54:06,988
You can talk to a baby about
my Star Wars posters, right?

882
00:54:07,008 --> 00:54:11,118
I can point at them because they're
right there, but if I'm in a different

883
00:54:11,118 --> 00:54:15,198
room, baby's not gonna be able to
talk to me about them And that's

884
00:54:15,198 --> 00:54:16,778
the difference, It's one of them.

885
00:54:16,808 --> 00:54:19,928
That's the one that I'd like to
highlight though is that the fact that

886
00:54:20,558 --> 00:54:26,118
we can speak about things that are not
physically right here with us, that we

887
00:54:26,118 --> 00:54:30,158
can point at, that's the distinction
between communication and language,

888
00:54:30,178 --> 00:54:31,598
because babies are communicating.

889
00:54:33,758 --> 00:54:37,528
But once they get to that point,
it really deepens the interaction

890
00:54:37,528 --> 00:54:39,128
that you're able to have with them.

891
00:54:39,178 --> 00:54:44,628
So now, equipped, with all that
knowledge, I'm gonna try to prompt

892
00:54:44,648 --> 00:54:47,438
engineer you and give you this prompt.

893
00:54:47,528 --> 00:54:53,028
I'm a five year old baby, that has
language now, and who's very curious

894
00:54:53,068 --> 00:55:00,733
about understanding how we got from bag of
words, counting frequencies all the way to

895
00:55:00,833 --> 00:55:07,653
LLMs and ChatGPT and people worrying about
the Terminator actually coming into life.

896
00:55:08,603 --> 00:55:12,723
Could you walk me through the high
level ideas that were important,

897
00:55:13,603 --> 00:55:15,513
build up to what we're seeing today.

898
00:55:16,956 --> 00:55:22,686
The bag of words is really easy to
think about, especially if you keep

899
00:55:22,686 --> 00:55:24,886
your tokenization incredibly easy.

900
00:55:25,226 --> 00:55:27,996
Sorry, this is, I'm already
out of five year old territory.

901
00:55:28,936 --> 00:55:31,156
You just count words.

902
00:55:32,626 --> 00:55:35,556
If I take that sentence,
"you ; just ; count ; words".

903
00:55:35,596 --> 00:55:37,166
Each of those has a count of one.

904
00:55:37,976 --> 00:55:41,496
If I add another sentence,
"I like Star Wars".

905
00:55:41,746 --> 00:55:44,416
All of those still have
a count of just one word.

906
00:55:44,856 --> 00:55:48,036
And then if I add another,
"do you like Star Wars?"

907
00:55:48,386 --> 00:55:51,346
You and star and wars all go up to two.

908
00:55:52,976 --> 00:55:53,456
That's it.

909
00:55:53,456 --> 00:55:55,156
That's a bag of words model.

910
00:55:56,513 --> 00:55:57,763
why is it important?

911
00:55:57,813 --> 00:55:58,373
what can it

912
00:55:58,383 --> 00:55:58,693
do?

913
00:56:00,506 --> 00:56:06,366
I think that bag of words is The
first model that we really have

914
00:56:06,556 --> 00:56:08,706
to explain being data-driven.

915
00:56:09,321 --> 00:56:10,961
It's just keeping track of things.

916
00:56:11,051 --> 00:56:16,551
if you look at a bag of words model
for your workouts, it's just how

917
00:56:16,561 --> 00:56:18,511
often do you do certain things?

918
00:56:18,551 --> 00:56:22,941
how often are you doing a bicep workout
versus doing a pectoral workout?

919
00:56:22,961 --> 00:56:25,421
How often are you doing which thing?

920
00:56:25,521 --> 00:56:27,091
it's just being data driven.

921
00:56:27,181 --> 00:56:29,371
It's the first step, right?

922
00:56:29,751 --> 00:56:31,461
You're not looking at any features.

923
00:56:31,551 --> 00:56:35,071
You're really caring about how these
things interact with each other.

924
00:56:35,151 --> 00:56:36,301
You're just keeping track

925
00:56:37,738 --> 00:56:41,508
So I guess with that information from
your example, I can guess whether

926
00:56:41,508 --> 00:56:46,298
you, are skipping leg days, and I
can see what's important to you.

927
00:56:47,008 --> 00:56:51,238
Or, if I'm counting, words in U.

928
00:56:51,238 --> 00:56:51,408
S.

929
00:56:51,408 --> 00:56:57,318
presidents speeches, I can say, like you
described in your book, whether it's a

930
00:56:57,318 --> 00:57:02,228
wartime or a peacetime president, and
what they really try to get across.

931
00:57:02,781 --> 00:57:08,641
this is something that you can use
for anything you count in soccer

932
00:57:09,261 --> 00:57:13,501
which players make goals how often
that is a bag of words model.

933
00:57:13,501 --> 00:57:14,681
You're not tracking words.

934
00:57:14,801 --> 00:57:18,021
It's a bag of goals or it's
a bag of, whatever else.

935
00:57:20,028 --> 00:57:21,708
So what's the next step from there?

936
00:57:21,708 --> 00:57:26,308
bag of words was really monumental just
because it's so simple, but it's so

937
00:57:26,308 --> 00:57:30,568
powerful because know words you use when
you're describing sports is very different

938
00:57:30,568 --> 00:57:35,508
from the words you use describing politics
And so just picking up on certain words

939
00:57:35,508 --> 00:57:40,728
and their counts helps us understand
the overall subject of what it is.

940
00:57:40,758 --> 00:57:45,748
But it really lacked, any sort
of structure, because the order

941
00:57:45,748 --> 00:57:47,828
of words also matter, right?

942
00:57:47,828 --> 00:57:54,298
So the cat in the hat versus the cat's
hat, they both have the word 'cat', they

943
00:57:54,298 --> 00:57:58,373
both have 'hat', but mean different things
because of the order of the words, and

944
00:57:58,373 --> 00:58:01,613
so that kind of led to, n-gram models.

945
00:58:01,663 --> 00:58:06,383
instead of just simple words, we
would also take n-grams, which are,

946
00:58:06,383 --> 00:58:11,533
n number of words in a certain order,
and we would start cataloging those.

947
00:58:11,583 --> 00:58:14,973
And so, more than just
words, we're getting n-grams.

948
00:58:15,423 --> 00:58:20,473
And that is improving our understanding
of the language because now we

949
00:58:20,633 --> 00:58:22,573
have embedded some syntax in it.

950
00:58:22,573 --> 00:58:29,173
We understand some ordering of words and
that's able to improve our categorization.

951
00:58:30,343 --> 00:58:35,983
however, from there though, we're not
really able to make any predictions

952
00:58:35,983 --> 00:58:39,423
of what next words about to come up
or anything like that, when it comes

953
00:58:39,423 --> 00:58:42,653
to bag of words or n-grams they're
really more for categorization.

954
00:58:43,253 --> 00:58:45,873
And so that kind of led
to Bayesian techniques

955
00:58:47,453 --> 00:58:50,183
and so not to really go deeply

956
00:58:50,183 --> 00:58:52,743
into Bayesian statistics, but

957
00:58:52,816 --> 00:58:53,046
Yeah.

958
00:58:53,046 --> 00:58:53,836
I'm sorry.

959
00:58:53,846 --> 00:58:55,766
Sorry to all Bayesian fanboys.

960
00:58:55,766 --> 00:58:59,506
We're going to go about as deep
into this as we did to pragmatics.

961
00:59:00,263 --> 00:59:05,243
it's just you know, based off of the
priors of the words that came before we

962
00:59:05,243 --> 00:59:11,343
can then predict the next word to come
up and so if every single time after

963
00:59:11,373 --> 00:59:16,963
in text we saw 'I am a man' then it's
going to predict that the next word is

964
00:59:16,963 --> 00:59:22,163
man instead of other words that easily
could have come up like woman or girl

965
00:59:22,163 --> 00:59:25,123
or boy or cook or professional athlete.

966
00:59:25,173 --> 00:59:28,513
certain things that could come up that
are gonna be a lot rarer Like I am an

967
00:59:28,523 --> 00:59:33,473
astronaut like a lot less people have
been astronauts in order to say that

968
00:59:33,883 --> 00:59:37,693
it's gonna have a very low probability
of being the next word predicted but

969
00:59:37,723 --> 00:59:41,563
it gives us this opportunity to look
at what is the next word predicted.

970
00:59:42,123 --> 00:59:47,148
from there, we move on to what's
called Markov chains we're swinging

971
00:59:47,158 --> 00:59:52,588
back towards the n-gram model But it
gives us a bit of prediction next.

972
00:59:53,028 --> 01:00:00,278
I actually really love Markov chains
because they provide very fast

973
01:00:00,338 --> 01:00:07,028
predictive text like Markov chains is
essentially what's been fueling like

974
01:00:07,028 --> 01:00:12,138
the predictive text like for Google
search and things like that has been the

975
01:00:12,138 --> 01:00:15,248
technology that's really been leading
that charge for a really long time.

976
01:00:15,298 --> 01:00:20,138
and it's just a very basic way
that we're using Ngrams now to

977
01:00:20,168 --> 01:00:22,498
make predictions of the future.

978
01:00:22,678 --> 01:00:23,768
You can think about it there,

979
01:00:24,053 --> 01:00:26,543
that is obviously I'm

980
01:00:27,223 --> 01:00:28,073
reducing it.

981
01:00:28,093 --> 01:00:31,743
that's not exactly how it works,
but it's a bag of n-grams where you

982
01:00:31,743 --> 01:00:36,713
take a state, at each point in a
sequence, and look at all the times

983
01:00:36,723 --> 01:00:40,633
that Previewings have occurred in that
sequence, and then from that you can

984
01:00:40,643 --> 01:00:42,863
model probability about what comes next.

985
01:00:42,913 --> 01:00:48,863
Instead of just looking at each
n-gram by itself, you give it state.

986
01:00:49,663 --> 01:00:51,213
and it's a bag of n-grams.

987
01:00:51,233 --> 01:00:52,053
It's really fun.

988
01:00:52,213 --> 01:00:54,753
It's a probabilistic bag of n-grams.

989
01:00:55,743 --> 01:00:56,893
That's how the chains work.

990
01:00:57,738 --> 01:01:01,758
One of my favorite parts, and I like
that you kept track of this quote

991
01:01:01,768 --> 01:01:05,608
here, that Markov models represent
the first comprehensive attempt to

992
01:01:05,608 --> 01:01:09,788
actually model language, which is
funny, because Markov was not trying

993
01:01:09,788 --> 01:01:13,908
to model language initially, he
was just trying to win an argument.

994
01:01:14,578 --> 01:01:18,978
And He eventually used it to,
he looked at distributions in

995
01:01:19,018 --> 01:01:20,498
particular Russian authors.

996
01:01:20,528 --> 01:01:25,598
He looked at distributions in,
Russian government official speeches.

997
01:01:25,638 --> 01:01:29,988
he knew what he had and he believed
in it, and I love that, what a

998
01:01:29,988 --> 01:01:32,418
great piece of history anyway.

999
01:01:32,418 --> 01:01:33,728
continuous bag of words.

1000
01:01:34,368 --> 01:01:39,688
Is where we, start essentially taking
the logic of a Markov chain where,

1001
01:01:39,738 --> 01:01:45,828
"oh, if we keep track of where things
appear and how often they appear there,

1002
01:01:45,888 --> 01:01:52,348
then it helps us, be able to model
for what could appear next", right?

1003
01:01:52,438 --> 01:01:56,903
And this is the first moment where
we're really coming full circle all

1004
01:01:56,903 --> 01:02:01,133
together and going right back to
bag of words and just adding context

1005
01:02:01,143 --> 01:02:04,103
for position and adding context.

1006
01:02:05,203 --> 01:02:10,563
from the context of the bag of words,
the literal counting of things, we're

1007
01:02:10,563 --> 01:02:12,023
able to create embeddings, right?

1008
01:02:12,073 --> 01:02:15,523
I don't know if a lot of people
are aware, but bag of words

1009
01:02:15,533 --> 01:02:17,753
is how Word2vec came to be.

1010
01:02:18,318 --> 01:02:25,798
Word2vec was huge in, I think, 2015,
2016, and it stayed huge, Gensim

1011
01:02:25,808 --> 01:02:29,828
is still one of the most downloaded
natural language processing libraries

1012
01:02:29,828 --> 01:02:33,138
in Python for Word2vec and for GloVe.

1013
01:02:33,678 --> 01:02:36,808
Continuous bag of words, just
adding that one little thing.

1014
01:02:37,143 --> 01:02:40,773
adds all this context so
that we can create embeds.

1015
01:02:41,013 --> 01:02:44,833
We can create vectors that
we can compare between words.

1016
01:02:44,923 --> 01:02:50,083
this all comes from the logic
of I forgot that dude's name.

1017
01:02:50,813 --> 01:02:55,353
Tell me the company that a word keeps,
and I'll tell you what that word means.

1018
01:02:55,353 --> 01:02:57,293
just what's around the word.

1019
01:02:57,603 --> 01:03:01,798
influences its meaning, which
goes directly against a lot

1020
01:03:01,798 --> 01:03:05,468
of previous linguists' thought
that, syntax and semantics are

1021
01:03:05,498 --> 01:03:07,208
absolutely not related at all.

1022
01:03:07,218 --> 01:03:12,168
That's one of the big things from Chomsky,
the colorless green ideas sleep furiously,

1023
01:03:12,308 --> 01:03:14,458
nonsense, there's some semblance to it.

1024
01:03:14,488 --> 01:03:18,608
There's some sense to it and taking
advantage of that with continuous

1025
01:03:18,608 --> 01:03:20,748
bag of words, we can create.

1026
01:03:21,113 --> 01:03:23,663
like I said, these vectors
that we can then compare, and

1027
01:03:23,663 --> 01:03:25,423
that's really interesting.

1028
01:03:25,433 --> 01:03:30,453
that is what fuels LLMs now, is
this exact same continuous bag

1029
01:03:30,453 --> 01:03:31,653
of words modeling technique.

1030
01:03:32,003 --> 01:03:36,383
It's been built upon a little bit, but
that bag of words is still fundamental

1031
01:03:36,383 --> 01:03:38,003
to how embeddings are created.

1032
01:03:38,193 --> 01:03:45,298
Bag of words and positionality and,
like we can get into, the rope scaling,

1033
01:03:45,308 --> 01:03:51,078
all of these rotational, plugins that
you can use to get longer sequences

1034
01:03:51,358 --> 01:03:54,148
embedded correctly, or at least better.

1035
01:03:54,568 --> 01:03:57,288
that's one of the hard things when
we're talking about language modeling

1036
01:03:57,288 --> 01:03:58,828
is what is good and what is better.

1037
01:03:59,478 --> 01:04:02,388
a lot of people like to appeal
to, this is how humans do it.

1038
01:04:02,548 --> 01:04:05,428
I don't know if humans are incredibly
efficient when we do it, but.

1039
01:04:06,258 --> 01:04:07,978
Like it's fine.

1040
01:04:08,088 --> 01:04:11,258
then we get into the 1960s, the very first

1041
01:04:11,748 --> 01:04:12,598
perceptrons,

1042
01:04:12,825 --> 01:04:16,565
Before we go there, can we
spend a little longer on what

1043
01:04:16,565 --> 01:04:18,335
the embeddings actually are?

1044
01:04:18,375 --> 01:04:23,065
You mentioned words to Vec, you mentioned
the words vectors and embedding, but for

1045
01:04:23,065 --> 01:04:27,905
somebody, listening to us, from the start,
that's probably not clear what that is.

1046
01:04:27,925 --> 01:04:29,125
can we delve a little bit?

1047
01:04:30,158 --> 01:04:31,218
Yeah, absolutely.

1048
01:04:31,218 --> 01:04:36,088
So embeddings are the vectors
that come out of models like

1049
01:04:36,098 --> 01:04:37,498
continuous bag of words.

1050
01:04:37,938 --> 01:04:42,618
when you look at a modern machine learning
pipeline, there are multiple models that

1051
01:04:42,618 --> 01:04:46,628
you go through and we just attract all
of it and call it model, just one model.

1052
01:04:46,638 --> 01:04:54,718
When you look at GPT-3, ChatGPT, it has
a model that they call it, a byte pair

1053
01:04:54,748 --> 01:04:56,938
encoding model to do its tokenization.

1054
01:04:56,968 --> 01:04:59,598
And then it has a model to do embeddings.

1055
01:05:00,228 --> 01:05:03,998
that model is fundamentally
a continuous bag of words.

1056
01:05:03,998 --> 01:05:07,448
It's built on top of it a little bit
with, like I said, keeping track.

1057
01:05:07,753 --> 01:05:11,573
Not just how many times a word
occurs, but how many times a word

1058
01:05:11,573 --> 01:05:13,653
occurs in particular positions.

1059
01:05:13,653 --> 01:05:19,053
and then on top of that, it
keeps track of the, flip.

1060
01:05:19,053 --> 01:05:24,833
It's either an odd or an even position
within a sentence and it assigns

1061
01:05:24,833 --> 01:05:29,243
it cosine or sine based on whether
it's an odd or an even position.

1062
01:05:29,243 --> 01:05:35,043
in order to try to insert some of that
meaning back into it, that was taken out

1063
01:05:35,173 --> 01:05:41,243
from the tokenization, cause tokenization
is just assign each token a number in

1064
01:05:41,243 --> 01:05:45,863
a dictionary, and you have a way to
get all words into that dictionary, and

1065
01:05:45,863 --> 01:05:47,373
then come back out of that dictionary.

1066
01:05:47,383 --> 01:05:49,223
So it takes all of the meaning out of it.

1067
01:05:49,233 --> 01:05:50,353
It's just one number.

1068
01:05:51,333 --> 01:05:56,153
The embeddings attempt to put some of the
meaning back into it using positionality,

1069
01:05:56,163 --> 01:05:58,743
using continuous language modeling

1070
01:05:58,743 --> 01:05:59,373
techniques.

1071
01:06:00,143 --> 01:06:05,813
embeddings really simply, they're not
perfect, they're just an approximation

1072
01:06:05,863 --> 01:06:11,588
of that meaning, and because we are
able to put it into a vectorized

1073
01:06:11,598 --> 01:06:14,508
space, we're able to take these
words, put them in a vectorized space.

1074
01:06:14,508 --> 01:06:20,088
We can start doing things that start
to make sense and start to make us feel

1075
01:06:20,088 --> 01:06:21,918
like we're headed in the right direction.

1076
01:06:21,968 --> 01:06:25,698
the classic example is, when we
first discovered embeddings, we

1077
01:06:25,698 --> 01:06:29,748
took the embedding of 'king',
we subtracted 'man' from it.

1078
01:06:30,328 --> 01:06:35,218
We then added the embedding of
'woman' and we got the closest.

1079
01:06:35,528 --> 01:06:41,088
Embedding to that was 'queen' to
that, we start to get this vectorized

1080
01:06:41,098 --> 01:06:42,598
space that starts to make sense.

1081
01:06:42,598 --> 01:06:46,408
We start to, these words start to have
connection to each other and they start

1082
01:06:46,408 --> 01:06:49,328
to make semantic sense to us as humans.

1083
01:06:50,558 --> 01:06:52,698
however, embeddings are still
an approximation, right?

1084
01:06:52,698 --> 01:06:56,578
So if you were to do that with kind of
every combination, it's interesting,

1085
01:06:56,588 --> 01:07:02,078
what do you get when you start, taking
words, That don't necessarily make any

1086
01:07:02,078 --> 01:07:04,548
sense, like adding or
subtracting them together.

1087
01:07:04,548 --> 01:07:05,068
what do you get

1088
01:07:05,171 --> 01:07:07,991
a good quintessential example of
that is you take the vector for

1089
01:07:07,991 --> 01:07:11,451
'king', you subtract the vector
for 'wolf', and you add the

1090
01:07:11,451 --> 01:07:14,291
vector for 'prince', and you
get the vector for 'village'.

1091
01:07:14,871 --> 01:07:16,111
Or at least pretty close to it.

1092
01:07:16,601 --> 01:07:17,791
That doesn't make any sense,

1093
01:07:17,951 --> 01:07:22,831
there's still lots of, okay, these
are starting to add meaning, not

1094
01:07:22,831 --> 01:07:27,631
always, but sometimes, like it's an
approximation and embeddings ultimately.

1095
01:07:28,206 --> 01:07:30,806
it's something we're constantly
trying to learn and improve

1096
01:07:30,983 --> 01:07:36,508
If your listeners are wondering how to
keep up in space, like embeddings are

1097
01:07:36,508 --> 01:07:41,588
probably the number one thing to keep
track of OpenAI recently released, logic

1098
01:07:41,598 --> 01:07:46,128
for being able to change the size of
embeddings, to me, like being pretty

1099
01:07:46,128 --> 01:07:47,868
deep into this, it feels groundbreaking.

1100
01:07:48,438 --> 01:07:52,688
Because normally you have to structure
these vectors so that they're all the same

1101
01:07:52,688 --> 01:07:59,318
size and each point within that vector
represents meaning negative or positive

1102
01:07:59,348 --> 01:08:05,148
and it's very structured and not malleable
and so the idea that you could take you

1103
01:08:05,278 --> 01:08:10,343
all of your embedding space and change the
size of it at your whim Is just amazing.

1104
01:08:10,843 --> 01:08:14,723
that's one of the things that I see as a
huge groundbreaking piece of technology

1105
01:08:14,773 --> 01:08:17,113
that OpenAI is continuing to lead in.

1106
01:08:17,163 --> 01:08:20,563
yeah, and if you're ever in doubt
for oh man, is this paper important?

1107
01:08:20,903 --> 01:08:25,053
If it's about embeddings and doing really
cool things with embeddings, probably.

1108
01:08:27,033 --> 01:08:31,253
I think the one question for anybody
to like picture that, so what's

1109
01:08:31,253 --> 01:08:33,953
the dimension of all these vectors?

1110
01:08:33,983 --> 01:08:36,173
Is that the entire vocabulary?

1111
01:08:36,943 --> 01:08:39,253
Are there different techniques?

1112
01:08:39,253 --> 01:08:45,003
yeah, currently the, number
one, dimensionality that is an

1113
01:08:45,023 --> 01:08:47,613
unspoken industry standard is 768.

1114
01:08:47,623 --> 01:08:50,873
that's a number that pretty much
every NLP practitioner knows.

1115
01:08:51,223 --> 01:08:55,543
like the reason OpenAI's embeddings
initially were like really cool

1116
01:08:55,663 --> 01:08:59,193
and they thought they were super
dense is they were, what, 536, or

1117
01:08:59,193 --> 01:09:04,653
1536, which is 768 doubled, right?

1118
01:09:04,663 --> 01:09:08,463
You're gonna see multiples of
768 all over the place here.

1119
01:09:09,323 --> 01:09:13,573
And that's not because that number
is super significant, that's just

1120
01:09:13,593 --> 01:09:17,693
the first embedding space that we
found that tended to work better

1121
01:09:17,713 --> 01:09:18,553
than the others.

1122
01:09:19,375 --> 01:09:19,975
So that's the

1123
01:09:19,975 --> 01:09:23,365
more art than science part of this

1124
01:09:24,070 --> 01:09:24,290
for

1125
01:09:24,403 --> 01:09:26,093
It's the brute force testing.

1126
01:09:26,173 --> 01:09:33,613
Yeah, before going through and
testing, 767, 766, 765 and landed on

1127
01:09:33,613 --> 01:09:37,763
that one and it worked, that's the
best one that we've found so far.

1128
01:09:38,243 --> 01:09:42,993
Even the doubled embeddings from
open AI offer a marginal improvement

1129
01:09:43,003 --> 01:09:44,473
in that understanding space.

1130
01:09:45,230 --> 01:09:49,950
I think we can move on to
the multilayer perceptrons.

1131
01:09:51,088 --> 01:09:51,428
Okay.

1132
01:09:51,958 --> 01:09:56,038
a perceptron is essentially just
a linear transformation of data.

1133
01:09:56,078 --> 01:10:02,098
If you look at it from a statistical
standpoint, if you have three things

1134
01:10:02,178 --> 01:10:07,723
about something, You can just add
those things together and you get

1135
01:10:08,003 --> 01:10:10,333
a description of that thing, right?

1136
01:10:10,333 --> 01:10:17,163
Just summing them and, that's like
abstracting it a little bit much,

1137
01:10:17,453 --> 01:10:20,303
especially if machine learning
practitioners are listening to that.

1138
01:10:20,303 --> 01:10:23,843
Like we can do linear
trans transformations.

1139
01:10:24,413 --> 01:10:28,563
that's like the easiest way to think
about it for me is you perform one.

1140
01:10:28,903 --> 01:10:33,053
action on a group of features
and you get something out of it.

1141
01:10:33,473 --> 01:10:35,263
That's not by itself.

1142
01:10:36,413 --> 01:10:37,413
really helpful.

1143
01:10:37,463 --> 01:10:43,073
once you get into having multiple
layers of the, this is the MLP, the

1144
01:10:43,073 --> 01:10:47,213
multi layer perceptron, once you get
into multiple layers where you are

1145
01:10:47,543 --> 01:10:51,753
adding these transformations together,
and in between those layers you have

1146
01:10:51,763 --> 01:10:56,773
non linear activation functions so
that you can, create, you can create

1147
01:10:56,803 --> 01:11:02,463
nonlinear relationships between
sets of linear transformations.

1148
01:11:02,823 --> 01:11:04,873
You can get into really cool spaces.

1149
01:11:04,953 --> 01:11:10,223
And one of the first things that any
machine learning practitioner learns,

1150
01:11:10,793 --> 01:11:16,293
at least in a lot of the cases that
I've talked to is that just adding

1151
01:11:16,293 --> 01:11:18,113
more layers does not make it better.

1152
01:11:18,113 --> 01:11:22,283
In fact, the cool part is finding
the minimum number of layers that

1153
01:11:22,283 --> 01:11:26,573
you need in order to model the
relationship between two points.

1154
01:11:26,673 --> 01:11:30,183
that's a little bit abstract, I think
the quintessential example is like

1155
01:11:30,363 --> 01:11:33,073
detecting which type of iris flower.

1156
01:11:33,783 --> 01:11:37,513
It is from an image, the, we don't
necessarily know how many features

1157
01:11:37,513 --> 01:11:43,443
there are, but we can vectorize the
entire picture of an iris flower.

1158
01:11:43,483 --> 01:11:48,043
And then we can discover that the,
I think minimum number of layers is

1159
01:11:48,043 --> 01:11:53,493
like five in order to go through and
actually get really good accuracy on

1160
01:11:53,493 --> 01:11:56,223
detecting which iris flower it is.

1161
01:11:57,513 --> 01:12:02,018
yeah, multi layered perceptrons
are The feed forward networks.

1162
01:12:02,038 --> 01:12:05,808
Those are the basis of everything that
comes after it whether it's recurrent

1163
01:12:05,818 --> 01:12:12,458
or even Transformers have feed forward
networks inside them and that's the basis

1164
01:12:12,458 --> 01:12:13,128
of it right there.

1165
01:12:13,785 --> 01:12:18,955
How do you choose the sizes and
is it all just trial and error

1166
01:12:18,975 --> 01:12:23,335
as well for the number of layers,
the sizes of the hidden layers?

1167
01:12:23,995 --> 01:12:24,325
Are there

1168
01:12:24,498 --> 01:12:24,898
Not any

1169
01:12:24,905 --> 01:12:25,835
rules that always

1170
01:12:25,835 --> 01:12:26,315
work?

1171
01:12:28,588 --> 01:12:34,008
Yeah, so going through a feed forward
network and this comes from trial and

1172
01:12:34,008 --> 01:12:38,128
error, it comes from a lot of people
trying different stuff, but generally

1173
01:12:38,128 --> 01:12:43,928
you have your Initial dimensionality
could be something like 768, right?

1174
01:12:43,928 --> 01:12:45,598
Your initial hidden layer.

1175
01:12:45,608 --> 01:12:47,288
that's a good number for it.

1176
01:12:47,318 --> 01:12:51,018
That's an embedding dimension that we're
familiar with, but then we want the

1177
01:12:51,018 --> 01:12:52,728
next hidden layer to be double that.

1178
01:12:52,808 --> 01:12:56,538
And then we want to go smaller
and smaller until we hit our

1179
01:12:56,538 --> 01:12:58,508
final output classification layer.

1180
01:12:58,508 --> 01:13:01,658
So we want to have a
big jump and then small.

1181
01:13:02,028 --> 01:13:08,138
What to think about that theoretically is
you want to model the number of features

1182
01:13:08,168 --> 01:13:13,248
that you are looking for, and then you
want to just model double that is just

1183
01:13:13,248 --> 01:13:17,208
a good way of saying all the features
that we might not know about that we

1184
01:13:17,208 --> 01:13:18,728
might not even be keeping track of.

1185
01:13:18,728 --> 01:13:21,288
Let's see if the model can
figure them out mathematically.

1186
01:13:21,648 --> 01:13:24,003
And then we want to narrow it down.

1187
01:13:24,333 --> 01:13:25,183
Narrow it down.

1188
01:13:25,183 --> 01:13:28,933
Narrow it down until we get to our
actual classification, which in language

1189
01:13:28,953 --> 01:13:31,433
modeling is what is the next word, right?

1190
01:13:31,483 --> 01:13:31,993
Got it.

1191
01:13:33,103 --> 01:13:37,543
So double it and then boil it down to
the size that you're actually looking

1192
01:13:37,543 --> 01:13:40,723
for across a bunch of layers and hope for

1193
01:13:40,723 --> 01:13:41,283
the best.

1194
01:13:42,143 --> 01:13:42,633
Okay.

1195
01:13:42,746 --> 01:13:46,556
and that's why when OpenAI doubled
the embedding layers, it was a

1196
01:13:46,556 --> 01:13:50,046
marginal improvement, but it's
predictable because that's normal.

1197
01:13:50,636 --> 01:13:51,326
People do that.

1198
01:13:52,383 --> 01:13:58,353
Are there any particular, well known
kind of configurations of this neural

1199
01:13:58,353 --> 01:14:02,183
networks that just work for a bunch
of problems that, something that

1200
01:14:02,203 --> 01:14:06,323
you keep seeing over and over, or
is it more custom for every problem

1201
01:14:06,763 --> 01:14:09,493
you just follow the heuristics
that you just described?

1202
01:14:09,543 --> 01:14:14,843
as far as model architecture, no,
it's basically the heuristics that I

1203
01:14:14,843 --> 01:14:21,133
described, and then people will experiment
and tune them and find that, oh man,

1204
01:14:21,143 --> 01:14:26,023
statistically, If this layer of the
model is bigger, then it works better,

1205
01:14:26,413 --> 01:14:28,043
but it follows that general structure.

1206
01:14:28,063 --> 01:14:32,493
I think, one of the papers that I
would point to for this is a bit,

1207
01:14:33,133 --> 01:14:39,433
MFIT, where it was, it's basically
a methodology for fine tuning.

1208
01:14:40,003 --> 01:14:45,473
But it experiments with gradual
unfreezing of layers where when you're

1209
01:14:45,473 --> 01:14:50,253
training, you will start with only
the very last classification layer and

1210
01:14:50,283 --> 01:14:51,823
everything else is exactly the same.

1211
01:14:51,853 --> 01:14:53,483
And you only train that one.

1212
01:14:53,513 --> 01:14:58,933
And then you unfreeze, unfreeze, and
test each layer as you're training.

1213
01:14:58,953 --> 01:15:03,273
And that tends to help things like
even now that is abstracted within

1214
01:15:03,613 --> 01:15:05,063
the hugging face trainer class.

1215
01:15:05,093 --> 01:15:07,413
And that's abstracted
within pretty much every.

1216
01:15:07,653 --> 01:15:11,193
model.fit methodology because it works.

1217
01:15:13,420 --> 01:15:13,820
Awesome.

1218
01:15:14,810 --> 01:15:16,500
What's next in our journey?

1219
01:15:16,520 --> 01:15:19,640
probably just the fact
that multilayer perceptrons

1220
01:15:19,650 --> 01:15:22,450
struggle with sequences, right?

1221
01:15:22,660 --> 01:15:26,690
even if you try to embed things
and try and keep some of that

1222
01:15:26,690 --> 01:15:30,105
positional encoding within your
embeddings, they struggle to model.

1223
01:15:30,525 --> 01:15:33,625
Multiple things where the
order of them matters, right?

1224
01:15:34,175 --> 01:15:38,225
which language, which the
order matters sometimes, right?

1225
01:15:38,445 --> 01:15:43,295
Sometimes it's normal to say gibberish
and knowing when is, which is extremely

1226
01:15:43,295 --> 01:15:48,450
difficult and to solve that, I don't
know if we need to necessarily go

1227
01:15:48,450 --> 01:15:51,780
into recurrent neural networks, but
we definitely need to talk about

1228
01:15:51,820 --> 01:15:56,090
LSTMs, the long term short memories,
which are recurrent neural networks

1229
01:15:56,090 --> 01:16:01,270
to, start with, but they added some
really important things, which, for

1230
01:16:01,270 --> 01:16:03,710
example, when I'm talking, you are.

1231
01:16:04,090 --> 01:16:08,550
Kind of consciously predicting what I
might be saying, you can hear what I'm

1232
01:16:08,550 --> 01:16:12,410
saying and you're trying to figure it
out as it goes on to understand it.

1233
01:16:12,410 --> 01:16:13,560
we call that active listening.

1234
01:16:13,560 --> 01:16:14,370
that's what happens.

1235
01:16:14,710 --> 01:16:19,050
long term short memories, model that a
little bit in that they take the sequences

1236
01:16:19,580 --> 01:16:24,250
and they allow the model to try to
predict both going forwards and backwards.

1237
01:16:25,020 --> 01:16:26,930
instead of just doing the one way.

1238
01:16:26,940 --> 01:16:30,060
So that bidirectionality it's
computationally expensive.

1239
01:16:30,060 --> 01:16:33,770
It takes a lot longer, which is why
I think these are not used as much

1240
01:16:33,770 --> 01:16:38,220
anymore, but it's really novel and it
did help a lot in predicting sequences.

1241
01:16:38,220 --> 01:16:40,410
it was phenomenal for language modeling.

1242
01:16:40,410 --> 01:16:44,030
beyond that, they like
solving the attention.

1243
01:16:44,300 --> 01:16:49,080
Within LSTMs, like when attention came
out, adding attention to whatever you

1244
01:16:49,080 --> 01:16:56,750
were doing was phenomenal where it added
an extra layer of non linearity when it

1245
01:16:56,750 --> 01:17:00,930
was going through and trying to search
for what word might come next, it not

1246
01:17:00,930 --> 01:17:04,620
only had all the modeling that we've
already talked about, it also had the

1247
01:17:04,620 --> 01:17:10,070
ability to search now and search for not
that exact thing, but something similar.

1248
01:17:11,225 --> 01:17:16,145
And, that just exploded in popularity
because it works, it was phenomenal.

1249
01:17:16,155 --> 01:17:22,045
However, the difficulty with long term
short memories is they're computationally

1250
01:17:22,045 --> 01:17:27,275
expensive, they're slow, it's a lot of
math that you have to do in order to

1251
01:17:27,275 --> 01:17:33,600
get through every single layer of it,
let alone trying to predict and stream

1252
01:17:33,600 --> 01:17:38,520
those predictions in a sequence, you're
going at one token per 30 seconds.

1253
01:17:38,580 --> 01:17:42,230
And that's difficult for having
models that are the same size

1254
01:17:42,310 --> 01:17:43,900
as transformers, for example.

1255
01:17:45,140 --> 01:17:49,960
so yeah, it was a lot of really
cool stuff that helped us solve

1256
01:17:49,990 --> 01:17:53,400
basically how to get to the next step.

1257
01:17:53,460 --> 01:17:56,380
It was just computationally
expensive and slow.

1258
01:17:56,430 --> 01:18:00,360
basically, not very practical
in use, but important.

1259
01:18:01,070 --> 01:18:05,150
talking about practicality, I think
it's great that it's accurate, right?

1260
01:18:05,550 --> 01:18:07,680
I think accuracy is incredibly practical.

1261
01:18:07,990 --> 01:18:12,820
I don't think that from a customer
experience that's practical, right?

1262
01:18:12,900 --> 01:18:16,110
Customers don't like waiting a long
time for the right answer because

1263
01:18:16,110 --> 01:18:18,720
they might be able to find the right
answer in that amount of time anyway.

1264
01:18:18,720 --> 01:18:21,250
and then from there, do
we jump to the attention?

1265
01:18:22,020 --> 01:18:27,140
at this point, we've gone through
the history of, the field modeling

1266
01:18:27,140 --> 01:18:31,840
language, building up and we
finally reached attention, right?

1267
01:18:32,480 --> 01:18:36,000
And attention is, the backbone
of transformers, which is

1268
01:18:36,000 --> 01:18:37,820
what LLMs are built off of.

1269
01:18:37,860 --> 01:18:40,570
And, attention just adds a non linearity.

1270
01:18:41,040 --> 01:18:45,360
And it was just a breakthrough and
how we're able to connect the words,

1271
01:18:45,390 --> 01:18:49,750
so attention really quickly is
just, creating these dictionaries,

1272
01:18:49,750 --> 01:18:55,330
key values of, every word to every
other word in the token space.

1273
01:18:55,480 --> 01:18:57,470
and then it's able to query it.

1274
01:18:57,470 --> 01:19:00,320
for each other word, we're able to build.

1275
01:19:00,320 --> 01:19:03,400
importance of the other words
that are important to it.

1276
01:19:03,440 --> 01:19:09,180
And it's in a quadratic space, so it's
much more than a linear space, but

1277
01:19:09,190 --> 01:19:14,780
it's a reasonable amount of time, to
compute these kind of dictionaries,

1278
01:19:14,780 --> 01:19:18,530
the key values, and then query them
and understand the importance of other

1279
01:19:18,530 --> 01:19:23,450
words It's the backbone of what all
these, different models are doing.

1280
01:19:23,470 --> 01:19:27,520
and even as Chris mentioned, like
we could inject attention into

1281
01:19:27,580 --> 01:19:33,660
these previous, RNNs, LSTMs, et
cetera, but, it was the backbone

1282
01:19:33,660 --> 01:19:35,700
of building the transformer model,

1283
01:19:35,750 --> 01:19:39,400
which, came out, in the catchy
paper, "attention is all you need".

1284
01:19:40,050 --> 01:19:41,510
where essentially all they use,

1285
01:19:41,513 --> 01:19:42,723
a meme, right?

1286
01:19:43,103 --> 01:19:45,413
That we've seen a whole bunch
of other papers afterwards.

1287
01:19:45,413 --> 01:19:46,893
They're like, "no, this is all you need".

1288
01:19:46,893 --> 01:19:49,653
or no, this is all you need,
or no, you don't need, but the

1289
01:19:49,653 --> 01:19:51,003
reason it's a meme is because they

1290
01:19:51,003 --> 01:19:55,673
took out everything that was,
supposedly novel about the long

1291
01:19:55,673 --> 01:19:57,343
term short memory, the LSTM.

1292
01:19:57,493 --> 01:20:00,353
They used only attention
and feedforward networks

1293
01:20:01,163 --> 01:20:04,023
Could you give us an example
of what that would look like

1294
01:20:04,023 --> 01:20:06,013
on a very stripped down thing?

1295
01:20:06,023 --> 01:20:09,393
What does that dictionary look like?

1296
01:20:09,653 --> 01:20:10,733
for visualization

1297
01:20:11,136 --> 01:20:11,476
and decode.

1298
01:20:11,903 --> 01:20:13,863
no, just for the attention itself, right?

1299
01:20:13,863 --> 01:20:17,633
You mentioned a key value from
basically every combination.

1300
01:20:17,763 --> 01:20:20,703
You have to pre compute every
combination within the vocabulary.

1301
01:20:21,456 --> 01:20:26,296
You can take a sentence that you're
feeding in to the attention algorithm, the

1302
01:20:26,296 --> 01:20:28,366
cat in the hat, since I used that earlier.

1303
01:20:28,366 --> 01:20:33,951
and so essentially you would have a
dictionary where the is comparing to

1304
01:20:33,961 --> 01:20:41,171
every other word, cat in the hat, and
it's coming up with assimilating metrics

1305
01:20:41,181 --> 01:20:42,881
of the importance of all the other words.

1306
01:20:43,036 --> 01:20:51,596
And then you would do that for cat, it's
going to do it for the in the hat, and in

1307
01:20:51,906 --> 01:20:58,201
the cat, the hat, and it's going to come
up with A dictionary, essentially, of

1308
01:20:58,211 --> 01:21:02,231
key value pairs for all the other words,
helping you understand, the importance

1309
01:21:02,231 --> 01:21:04,181
of the other words that are in there.

1310
01:21:04,231 --> 01:21:08,301
and then the query algorithm, that
runs, that essentially helps us

1311
01:21:08,301 --> 01:21:11,881
understand being able to predict the
next word that's coming afterwards

1312
01:21:11,911 --> 01:21:15,991
based off of how important the,
all of those kind of dictionaries

1313
01:21:16,041 --> 01:21:17,471
are, and adding them.

1314
01:21:17,471 --> 01:21:17,971
And so all of,

1315
01:21:17,971 --> 01:21:19,961
this happens to happen in quadratic time.

1316
01:21:19,961 --> 01:21:20,691
one of the nice

1317
01:21:20,711 --> 01:21:21,091
novel

1318
01:21:21,101 --> 01:21:25,671
things about this is that the query
And key vectors, your query vector

1319
01:21:25,671 --> 01:21:28,761
is the word that you're looking
at in the utterance and your key

1320
01:21:28,761 --> 01:21:31,191
vector is the key in the dictionary.

1321
01:21:31,191 --> 01:21:34,521
those two vectors are not one hot encoded.

1322
01:21:34,701 --> 01:21:37,131
The way that a lot of we
haven't even mentioned this.

1323
01:21:37,131 --> 01:21:43,541
But that's a vector that is 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, that's how a lot of these

1324
01:21:43,541 --> 01:21:50,411
things had been represented previously,
coming off of the bag of words, The idea

1325
01:21:50,441 --> 01:21:53,461
that, hey, we can model these things.

1326
01:21:53,481 --> 01:21:56,601
We can create vectors that
are just did this word appear.

1327
01:21:56,931 --> 01:21:58,241
Or did it not?

1328
01:21:58,301 --> 01:21:59,391
And where did it appear?

1329
01:21:59,411 --> 01:22:04,391
That was a positionality and, attention
is all you need, you can immediately see

1330
01:22:04,391 --> 01:22:08,571
a problem with one hot encoding in the
it's very sparse, especially as you're

1331
01:22:08,571 --> 01:22:11,951
getting into 768 dimensions, right?

1332
01:22:12,471 --> 01:22:17,161
You have just one 1 and a whole bunch of
zeros and those zeros don't really matter.

1333
01:22:17,406 --> 01:22:21,806
And so one of the breakthroughs
here was using dense vectors

1334
01:22:21,816 --> 01:22:25,626
for queries and keys in order to
get values that are also dense.

1335
01:22:26,756 --> 01:22:30,576
I think one of my favorite visualizations
of it, it's from Jesse Vig.

1336
01:22:30,736 --> 01:22:32,946
It's called BertViz on GitHub.

1337
01:22:33,876 --> 01:22:39,581
I've used this in production environments
in order to show that hey, Our model

1338
01:22:39,581 --> 01:22:44,331
is not understanding this because
look at the attention, all of it is

1339
01:22:44,331 --> 01:22:50,001
factoring in, all of the queries are
related to the key of the wrong word.

1340
01:22:50,071 --> 01:22:53,301
If you look at words with semantic
ambiguity, I think the quintessential

1341
01:22:53,301 --> 01:22:55,391
one is "time flies like an arrow".

1342
01:22:56,201 --> 01:23:00,451
Where flies is also another word
that could mean multiple small

1343
01:23:00,451 --> 01:23:01,961
little bugs buzzing around.

1344
01:23:02,231 --> 01:23:04,211
How do we know that it's not that word?

1345
01:23:04,211 --> 01:23:09,231
It's because of the position in the
sentence that we know that it is a verb.

1346
01:23:09,331 --> 01:23:11,841
and it's referring to time
and it's referring to arrow.

1347
01:23:12,331 --> 01:23:16,371
And we can see that predictably
within attention, because that

1348
01:23:16,371 --> 01:23:18,131
word is determined to be important.

1349
01:23:18,601 --> 01:23:23,521
That query is determined to be important
as it relates to the keys of time and

1350
01:23:23,531 --> 01:23:26,341
arrow within query key value attention.

1351
01:23:26,831 --> 01:23:28,221
That's what that dictionary looks like.

1352
01:23:28,231 --> 01:23:28,681
That's why it's

1353
01:23:28,681 --> 01:23:29,101
useful.

1354
01:23:30,333 --> 01:23:34,073
And, I guess the representation
of the importance, how do

1355
01:23:34,073 --> 01:23:35,093
we actually come up with

1356
01:23:35,103 --> 01:23:35,523
that

1357
01:23:37,671 --> 01:23:38,961
I think it's dot product.

1358
01:23:39,521 --> 01:23:43,311
we're comparing the vectors
between the query and the key.

1359
01:23:43,381 --> 01:23:47,721
dot product attention is, I'm pretty,
that's not where it started, but I

1360
01:23:47,741 --> 01:23:49,371
think that's where we're at right now.

1361
01:23:49,761 --> 01:23:52,361
That's like the industry
standard that everybody uses.

1362
01:23:53,051 --> 01:23:54,851
It's just, multiplying
the vectors together.

1363
01:23:54,851 --> 01:23:59,181
Essentially you take the dot product
of the two vectors, and that's

1364
01:23:59,181 --> 01:24:02,771
where we get the comparison and
the relative importance values.

1365
01:24:02,771 --> 01:24:04,371
it's not magic, it's

1366
01:24:04,371 --> 01:24:04,711
math.

1367
01:24:04,711 --> 01:24:07,451
kind of the same thing from time to time?

1368
01:24:08,571 --> 01:24:08,951
Okay.

1369
01:24:09,611 --> 01:24:18,091
And then with that, we've got the GPT, the
generative pre trained transformer model.

1370
01:24:18,141 --> 01:24:18,441
What's

1371
01:24:18,441 --> 01:24:20,031
so groundbreaking about that?

1372
01:24:20,081 --> 01:24:24,441
as opposed to the original
transformer, they only use a decoder.

1373
01:24:24,451 --> 01:24:29,281
So the original transformer had attention
based encoders, which changed your

1374
01:24:29,281 --> 01:24:34,301
embeddings into essentially another
embedding that was then taken by your

1375
01:24:34,311 --> 01:24:37,301
decoder and used to predict the next word.

1376
01:24:37,351 --> 01:24:44,411
So it had two networks linked together
in the middle in order to produce

1377
01:24:44,856 --> 01:24:48,896
Your next word and the reason this
is important is it goes back to that

1378
01:24:48,896 --> 01:24:52,686
original idea that we talked about
a language as an abstraction, right?

1379
01:24:52,696 --> 01:24:58,286
The authors of attention is all you
need looked at that abstraction and

1380
01:24:58,286 --> 01:24:59,766
we're like, Hey, can we model that?

1381
01:25:00,151 --> 01:25:01,541
And that's what an encoder is.

1382
01:25:01,541 --> 01:25:07,101
When you look at models like BERT,
it's taking your input and putting it

1383
01:25:07,111 --> 01:25:12,070
into a new abstract space with lots
of nonlinear trans transformations and

1384
01:25:12,070 --> 01:25:13,744
it's taking your input and putting it
into a new abstract space with lots

1385
01:25:13,744 --> 01:25:16,371
of nonlinear trans transformations and
it's taking your Incredibly useful.

1386
01:25:16,461 --> 01:25:21,851
And so the GPT models were
groundbreaking, because they

1387
01:25:21,851 --> 01:25:22,981
were like, we don't need that.

1388
01:25:22,981 --> 01:25:28,121
we just need the decoder and we're
just going to use syntax basically.

1389
01:25:28,141 --> 01:25:33,081
And the thought process there is that
syntax is related to semantics deeper than

1390
01:25:33,141 --> 01:25:37,601
linguists are able to really conceptualize
in an easy to understand way.

1391
01:25:38,291 --> 01:25:42,871
We know that it's true, And we know
that it's predictive with especially

1392
01:25:42,901 --> 01:25:46,151
looking at how good GPT-3, GPT-4 are.

1393
01:25:46,341 --> 01:25:49,371
And even looking at the open
source stuff, LLAMA is a decoder

1394
01:25:49,371 --> 01:25:51,951
only network and it rocks, right?

1395
01:25:52,541 --> 01:25:58,751
I have a suspicion that we're going
to hit a point later where, Google

1396
01:25:58,751 --> 01:26:02,751
is going to blow everybody out of the
water with another T5, like another,

1397
01:26:03,221 --> 01:26:05,801
version of that puts the encoder back in.

1398
01:26:06,171 --> 01:26:08,731
I don't know how we're going to get
to that point, though, because the

1399
01:26:08,741 --> 01:26:10,571
decoder only models work so well.

1400
01:26:11,961 --> 01:26:14,721
And they're faster, they're less
computationally expensive, because

1401
01:26:14,721 --> 01:26:18,241
you're taking, probably, a third of
the model and just throwing it away.

1402
01:26:18,596 --> 01:26:22,196
So you mentioned Llama, and
I think that might be a good

1403
01:26:22,246 --> 01:26:28,026
segway from what essentially
is, about a third of your book.

1404
01:26:28,116 --> 01:26:31,636
so for everybody else who wants to
go and jump into more details and

1405
01:26:31,636 --> 01:26:37,026
see actual Python implementations
of a lot of what we just covered,

1406
01:26:37,656 --> 01:26:39,896
the book is called Production LLMs.

1407
01:26:39,966 --> 01:26:44,356
It's available on manning.com, and I'm
pretty sure you're going to love it.

1408
01:26:45,326 --> 01:26:51,166
So going back to Llama, let's do
a little hall of fame, rundown

1409
01:26:51,206 --> 01:26:55,476
of the kind of landmark important
models from the last few years.

1410
01:26:55,506 --> 01:26:56,136
Where should we start?

1411
01:26:56,216 --> 01:26:59,496
I would probably start with
the original transformer, like

1412
01:27:00,326 --> 01:27:01,286
they deserve credit.

1413
01:27:01,346 --> 01:27:05,096
A lot of the, Vaswani and all, a lot
of the people who wrote that paper have

1414
01:27:05,096 --> 01:27:09,676
gone on to found or co found companies
that are now competing in this space.

1415
01:27:10,026 --> 01:27:12,406
Whether that's Anthropic or Character.

1416
01:27:12,406 --> 01:27:15,506
ai, those are the people that
created that Transformer and

1417
01:27:15,506 --> 01:27:16,586
they're still building on it.

1418
01:27:16,816 --> 01:27:19,336
I think that's the first one that
I'd say for the Hall of Fame.

1419
01:27:19,386 --> 01:27:20,226
what would you say, Matt?

1420
01:27:20,226 --> 01:27:24,726
think part of this question is what is the
first LLM versus what is, the first, Hall

1421
01:27:24,726 --> 01:27:29,956
of Fame model and yeah, like Transformers,
Bert, like Bert, is incredibly powerful,

1422
01:27:29,956 --> 01:27:36,316
I think, because it's so small, it's not
in the LLM space, it's often overlooked.

1423
01:27:36,356 --> 01:27:42,856
And I think many companies are
still looking at these massive

1424
01:27:42,866 --> 01:27:46,916
LLM models for problems they could
solve with a simple BERT model.

1425
01:27:46,946 --> 01:27:52,076
But because they're only
getting into this space now,

1426
01:27:52,916 --> 01:27:53,066
they

1427
01:27:53,066 --> 01:27:55,556
think immediately, hey,
we have to use an LLM,

1428
01:27:55,576 --> 01:27:55,856
right?

1429
01:27:55,896 --> 01:27:56,106
And

1430
01:27:56,409 --> 01:27:58,029
they didn't care in 2017.

1431
01:27:58,459 --> 01:27:58,679
And

1432
01:27:59,211 --> 01:27:59,351
And

1433
01:27:59,401 --> 01:27:59,851
over what

1434
01:27:59,851 --> 01:28:00,361
was there.

1435
01:28:00,371 --> 01:28:03,961
and I go back, I said it before,
I love Markov chains, like they're

1436
01:28:04,541 --> 01:28:07,841
amazing and they're really powerful
for what they do really well.

1437
01:28:07,891 --> 01:28:12,291
And even then, a lot of people could
just use Markov chains for a lot

1438
01:28:12,291 --> 01:28:15,691
of the problems that they're trying
to solve with LLMs, but, LLMs.

1439
01:28:16,006 --> 01:28:21,876
They do give that flexibility, just
their massive levels of computation.

1440
01:28:22,406 --> 01:28:27,686
I think if I was to point, to a model that
I thought was just really powerful, it.

1441
01:28:28,176 --> 01:28:29,926
It would be Bloom, actually.

1442
01:28:29,956 --> 01:28:37,506
Bloom was essentially the first, LLM
massive, large model that was built.

1443
01:28:37,646 --> 01:28:40,036
And it was built,
completely transparently.

1444
01:28:40,176 --> 01:28:42,486
it was a research, project.

1445
01:28:42,746 --> 01:28:46,466
funded, a large part by,
the French government.

1446
01:28:46,476 --> 01:28:49,451
And just, it was built
completely transparently and

1447
01:28:49,451 --> 01:28:51,171
completely in the open space.

1448
01:28:51,191 --> 01:28:57,321
and even though the bloom model today,
isn't seen as, a very competitive

1449
01:28:57,321 --> 01:29:02,271
model, but like a lot of the open
source learnings, a lot of what

1450
01:29:02,281 --> 01:29:08,856
we have nowadays is because of
what those researchers figured out

1451
01:29:08,866 --> 01:29:10,416
while they were working in bloom.

1452
01:29:10,666 --> 01:29:14,836
we got amazing, libraries out
of it from like deep speed

1453
01:29:14,836 --> 01:29:15,956
and other things like that.

1454
01:29:16,016 --> 01:29:20,516
it really boosted the open source
community, which has been one of the

1455
01:29:20,526 --> 01:29:25,856
major driving factors of LLMs today,
and probably a large part of why

1456
01:29:25,856 --> 01:29:29,746
we could even write our book, cause
the open source community wasn't.

1457
01:29:30,221 --> 01:29:33,821
At where it is today, like there
wouldn't be much we could really

1458
01:29:33,821 --> 01:29:38,591
tell people other than oh, You got
to go work for Google or Microsoft or

1459
01:29:39,231 --> 01:29:41,011
how would We, know any of it, right?

1460
01:29:41,194 --> 01:29:41,634
Yeah.

1461
01:29:42,241 --> 01:29:43,801
we know, about it largely

1462
01:29:43,801 --> 01:29:47,991
because, we've been involved in the
open source and we, built off of

1463
01:29:48,021 --> 01:29:49,641
what those scientists at Bloom did.

1464
01:29:50,111 --> 01:29:50,331
Big

1465
01:29:50,331 --> 01:29:50,811
science.

1466
01:29:51,924 --> 01:29:53,864
So that's 2022, right?

1467
01:29:53,934 --> 01:29:55,384
That's a couple of years now.

1468
01:29:56,654 --> 01:29:56,964
Yeah.

1469
01:29:57,584 --> 01:30:02,594
and then we had llama that
became important, and llama2

1470
01:30:03,791 --> 01:30:04,181
Yeah,

1471
01:30:04,354 --> 01:30:05,344
even more important.

1472
01:30:07,061 --> 01:30:12,331
Yeah, and it's largely just because,
I don't remember the username of who

1473
01:30:12,331 --> 01:30:17,561
did it, but whoever put that PR on
the original llama GitHub that had the

1474
01:30:17,561 --> 01:30:22,071
torrent link to leak the weights, that's
the hockey stick moment for LLMs, right?

1475
01:30:22,551 --> 01:30:25,351
That's what made them
available to everybody.

1476
01:30:25,401 --> 01:30:28,891
That's what enabled Stanford to
create alpaca and show that, oh man,

1477
01:30:28,901 --> 01:30:33,711
you can make the model better with
like only 50 K  responses like you

1478
01:30:33,711 --> 01:30:38,081
don't need tons and tons of data in
order to fine tune and get very good

1479
01:30:38,081 --> 01:30:39,841
results and improve in every metric.

1480
01:30:40,581 --> 01:30:43,991
yeah, that everything since then
has just been building off of that

1481
01:30:43,991 --> 01:30:49,011
exact same momentum of whoever
leaked that first llama and Meta

1482
01:30:49,021 --> 01:30:50,811
has benefited greatly from it too.

1483
01:30:50,811 --> 01:30:58,361
they now have a very open, I wouldn't
say completely, but a very open attitude

1484
01:30:58,481 --> 01:31:03,021
towards the space because they recognize
how, advantageous it is to have other

1485
01:31:03,021 --> 01:31:06,831
people building on top of their model
and be considered an industry standard.

1486
01:31:08,234 --> 01:31:11,584
Yeah they've really leaned
into it recently, right?

1487
01:31:11,584 --> 01:31:12,254
And like

1488
01:31:12,326 --> 01:31:13,076
how big was their

1489
01:31:13,076 --> 01:31:13,846
stock jump?

1490
01:31:13,909 --> 01:31:14,394
right?

1491
01:31:14,444 --> 01:31:16,424
all of the underlying architecture, right?

1492
01:31:16,444 --> 01:31:23,529
Like these open source programmers or
even just like the video programmers, like

1493
01:31:23,579 --> 01:31:26,429
they're able to go in and because they
know everything about Lama, they're able

1494
01:31:26,429 --> 01:31:29,569
to optimize, cuda kernels and everything.

1495
01:31:29,569 --> 01:31:35,809
And so Lama has gotten faster and more
proficient, Lama CPP, we're able to run

1496
01:31:35,809 --> 01:31:42,109
it with, just on a CPU, there's lots
of benefits that because they, gave

1497
01:31:42,109 --> 01:31:45,809
us the architecture, it was leaked,
but now, they've, leaned into it.

1498
01:31:45,819 --> 01:31:47,679
They essentially they've given it to us.

1499
01:31:47,679 --> 01:31:47,969
And so

1500
01:31:48,856 --> 01:31:51,406
Yeah, we just need them to release
the data that they used to train

1501
01:31:51,406 --> 01:31:52,846
on it And it's completely open,

1502
01:31:53,016 --> 01:31:53,306
right?

1503
01:31:53,356 --> 01:31:56,966
but even the data, they've told us
a lot about what the data is, right?

1504
01:31:58,716 --> 01:32:03,676
we don't have the exact data, but we know
essentially red pajama, what those data

1505
01:32:03,676 --> 01:32:05,776
sites were built off of, what they were.

1506
01:32:05,776 --> 01:32:06,486
And so

1507
01:32:07,416 --> 01:32:08,416
we're able to.

1508
01:32:08,466 --> 01:32:11,066
replicate it really closely
in the open source community.

1509
01:32:11,116 --> 01:32:14,986
Llama, I don't know, if we have
a really good list of Hall of

1510
01:32:14,996 --> 01:32:16,026
Famers because

1511
01:32:16,476 --> 01:32:19,686
it's difficult to see what's going
to stick around partially because

1512
01:32:19,686 --> 01:32:23,666
it's so difficult to evaluate these
models as opposed to BERT right?

1513
01:32:23,666 --> 01:32:26,096
large BERT had 300 million parameters.

1514
01:32:26,766 --> 01:32:30,096
You can run stuff to see how
well those parameters are,

1515
01:32:30,706 --> 01:32:31,896
like you can hyper tune them.

1516
01:32:31,906 --> 01:32:34,826
you can run evaluations to
see how each one is performing

1517
01:32:34,966 --> 01:32:37,046
and still go relatively fast.

1518
01:32:38,036 --> 01:32:41,586
When we're getting into the 7
billion parameter range and the 13

1519
01:32:41,596 --> 01:32:45,166
billion parameter range and the 70
billion parameter range, it's much

1520
01:32:45,176 --> 01:32:48,586
more difficult and computationally
expensive to evaluate on that level.

1521
01:32:49,426 --> 01:32:51,616
And we don't even have the
ability to describe what all

1522
01:32:51,616 --> 01:32:52,746
the parameters are doing.

1523
01:32:52,796 --> 01:32:58,236
and so our evaluation metrics
are difficult to gauge.

1524
01:32:58,746 --> 01:33:02,096
You look at MMLU, you look at a
lot of the benchmarks that people

1525
01:33:02,096 --> 01:33:03,866
are running, and they're useful.

1526
01:33:04,386 --> 01:33:09,156
But ultimately at this stage, we
still have to go download those models

1527
01:33:09,186 --> 01:33:12,356
and test them against our own use
cases to see if they perform better.

1528
01:33:13,186 --> 01:33:15,306
And that's incredibly time consuming.

1529
01:33:15,356 --> 01:33:19,066
like we could talk about a lot of the
models that have come out, like Capybara,

1530
01:33:19,106 --> 01:33:25,011
we can talk about New Zermes, we can talk
about WizardCoder, and they're all great.

1531
01:33:25,571 --> 01:33:27,821
I don't know which ones are
going to be the hall of fame.

1532
01:33:27,831 --> 01:33:29,541
The next industry standard though,

1533
01:33:29,721 --> 01:33:32,781
there's definitely some other models
that we love and we talk about in our

1534
01:33:32,781 --> 01:33:34,971
book, like Falcon, which came out of

1535
01:33:35,811 --> 01:33:38,591
the TII and Abu Dabi, right?

1536
01:33:38,591 --> 01:33:40,601
Like amazing model.

1537
01:33:40,931 --> 01:33:41,211
It's,

1538
01:33:41,694 --> 01:33:41,954
Micu.

1539
01:33:43,131 --> 01:33:46,031
the latest Falcon is one of the
largest open source models and it's

1540
01:33:46,051 --> 01:33:47,691
come, under the Apache 2 license.

1541
01:33:47,701 --> 01:33:49,431
So it's completely open source.

1542
01:33:49,491 --> 01:33:51,831
the very first model
that's fully open source.

1543
01:33:52,101 --> 01:33:54,901
there's definitely amazing, progress being

1544
01:33:54,901 --> 01:33:57,761
made and lots of different
models to be paying attention to.

1545
01:33:57,811 --> 01:33:58,981
But yeah,

1546
01:33:59,344 --> 01:34:00,484
One of the biggest ones to

1547
01:34:00,484 --> 01:34:01,234
pay attention to.

1548
01:34:01,234 --> 01:34:04,914
right now, I think is Olmo, not
because it's competitive and

1549
01:34:04,914 --> 01:34:09,094
performant, but because like
Falcon, it is 100% open source.

1550
01:34:09,104 --> 01:34:10,544
You can see the data they trained on.

1551
01:34:10,544 --> 01:34:12,724
You can replicate exactly
their experiments.

1552
01:34:12,734 --> 01:34:16,114
that's going to be one of the biggest
drivers in this field where, you look at

1553
01:34:16,164 --> 01:34:21,489
a lot of the, innovation that's happening
and it's happening over on files that

1554
01:34:21,489 --> 01:34:23,019
people are passing around on torrents.

1555
01:34:23,019 --> 01:34:28,169
It's happening on like random users
on Reddit are coming up with NTK aware

1556
01:34:28,169 --> 01:34:30,319
scaling and rope scaling after that.

1557
01:34:30,369 --> 01:34:33,059
And they're coming up
with more stuff because.

1558
01:34:33,789 --> 01:34:37,869
They have time, and they want to help
and a lot of these people are experts

1559
01:34:37,869 --> 01:34:43,209
and they're just anonymous and that's
Incredibly important for the space because

1560
01:34:43,769 --> 01:34:49,909
we're finding that people who deal with
these models and use them 24/7 Have skills

1561
01:34:49,939 --> 01:34:54,659
that the researchers don't necessarily
have and that's difficult to admit being

1562
01:34:54,659 --> 01:34:56,589
on the research part of it But it's true.

1563
01:34:57,649 --> 01:35:02,909
so that's the one coming from
Allen Institute for AI, right?

1564
01:35:02,969 --> 01:35:08,249
The one it has, yeah, I think
they're also open source in the

1565
01:35:08,289 --> 01:35:09,729
actual training code as well.

1566
01:35:09,739 --> 01:35:10,079
the whole

1567
01:35:10,217 --> 01:35:11,197
they are the whole

1568
01:35:11,197 --> 01:35:11,527
thing.

1569
01:35:12,709 --> 01:35:13,649
That's pretty awesome.

1570
01:35:14,169 --> 01:35:18,879
So with that caveat out of the way,
hedging your predictions, we don't

1571
01:35:18,879 --> 01:35:20,329
know what's going to happen tomorrow.

1572
01:35:20,909 --> 01:35:27,649
Do you see any one company kind
of getting ahead of the others?

1573
01:35:27,649 --> 01:35:35,549
The GPT-4 is still holding up well against
a lot of these models, which makes me

1574
01:35:35,559 --> 01:35:37,619
think personally that they have a few.

1575
01:35:38,179 --> 01:35:41,769
Tweaks and hacks they haven't
shared, which helps with

1576
01:35:41,779 --> 01:35:43,399
their multi billion valuation.

1577
01:35:43,949 --> 01:35:48,129
Do you see anybody like running away
from the crowds or is it too late now?

1578
01:35:48,129 --> 01:35:53,279
The cat's out of the bag and the progress
is going to come from the mass of people.

1579
01:35:53,279 --> 01:35:56,894
I don't know I know that, I was
texting with a couple of people the

1580
01:35:56,894 --> 01:36:02,174
other day talking about GPT-4 and,
how it is still relevant, even, people

1581
01:36:02,174 --> 01:36:06,624
talk about the performance decrease,
but it's still relevant, and every

1582
01:36:06,624 --> 01:36:10,964
week, every model is, that's coming
out getting compared against GPT-4.

1583
01:36:10,984 --> 01:36:15,514
And they're finding that most models
are more performant in GPT than

1584
01:36:15,514 --> 01:36:19,814
GPT-4 on certain things, right?

1585
01:36:19,824 --> 01:36:25,324
It's comparing the Rain Man to an
average human where, and asking like

1586
01:36:25,324 --> 01:36:26,924
what tasks they're good at, right?

1587
01:36:26,954 --> 01:36:30,044
If you, if it's going to
McDonald's and ordering your

1588
01:36:30,044 --> 01:36:32,784
own food, Rain Man is not great.

1589
01:36:33,129 --> 01:36:35,279
And you just got to find
the model that's better.

1590
01:36:35,769 --> 01:36:38,129
a good example for that
with GPT-4 is math.

1591
01:36:38,639 --> 01:36:41,149
if you need a model to
perform calculations for you.

1592
01:36:41,724 --> 01:36:42,474
That's not it.

1593
01:36:43,324 --> 01:36:49,054
you have Alpha Wolf, you have, Goat,
you have, even just Vanilla Llama 2 is

1594
01:36:49,054 --> 01:36:53,194
better at math than GPT-4, even though
they weren't explicitly training on it.

1595
01:36:53,344 --> 01:37:00,014
And I think that they currently
have that first-to-market

1596
01:37:00,274 --> 01:37:01,944
advantage more than anything.

1597
01:37:02,664 --> 01:37:03,904
That's not to say that it's bad.

1598
01:37:03,904 --> 01:37:08,324
That's not to reduce the work that
OpenAI has done because it is phenomenal.

1599
01:37:08,624 --> 01:37:12,504
But that's what's keeping them
really afloat is the first

1600
01:37:12,504 --> 01:37:14,274
market and the ease of use.

1601
01:37:16,807 --> 01:37:21,357
One other question I was holding,
as you were speaking with, you

1602
01:37:21,357 --> 01:37:24,037
mentioned mixed role and, What is it

1603
01:37:24,077 --> 01:37:24,357
called?

1604
01:37:24,417 --> 01:37:26,287
Mixed of, mix of experts.

1605
01:37:26,527 --> 01:37:26,827
what's

1606
01:37:26,834 --> 01:37:27,744
Yeah, mixtral.

1607
01:37:27,744 --> 01:37:30,174
Yeah, it's routing.

1608
01:37:30,234 --> 01:37:34,114
it's being smart and saying, hey,
we don't need a dense feed forward

1609
01:37:34,114 --> 01:37:35,794
network for every single thing.

1610
01:37:36,264 --> 01:37:40,779
Let's have a whole bunch of sparse
networks and just based on the input

1611
01:37:41,209 --> 01:37:44,799
route it and tell it which expert
is actually going to be the best.

1612
01:37:45,029 --> 01:37:50,669
It results in much larger models that
are smaller on disc and faster to run.

1613
01:37:52,066 --> 01:37:56,516
Is that more similar to
how the human brain works?

1614
01:37:57,166 --> 01:37:59,026
Because it's obviously not fully

1615
01:37:59,026 --> 01:37:59,686
connected.

1616
01:37:59,786 --> 01:38:02,236
It's got different regions
and stuff like that.

1617
01:38:02,886 --> 01:38:04,496
I would love to appeal to that.

1618
01:38:04,496 --> 01:38:05,016
authority.

1619
01:38:05,026 --> 01:38:05,766
that didn't rock.

1620
01:38:05,816 --> 01:38:10,916
I don't know though, because like you
look at MRIs and you can see, Oh man,

1621
01:38:10,916 --> 01:38:14,856
this portion is lighting up when you're
experiencing that emotion or seeing that

1622
01:38:14,856 --> 01:38:15,236
input.

1623
01:38:15,236 --> 01:38:15,526
But

1624
01:38:16,554 --> 01:38:16,684
who

1625
01:38:16,766 --> 01:38:19,396
we don't really have a
really great mapping of

1626
01:38:19,396 --> 01:38:20,426
every person's brain.

1627
01:38:20,476 --> 01:38:26,596
I think the connection between a
neural net and actual neurons has

1628
01:38:26,956 --> 01:38:28,786
been lost a long time ago, right?

1629
01:38:29,196 --> 01:38:33,266
how does the human brain work and how does
it really compare to modern day models?

1630
01:38:33,276 --> 01:38:37,636
Like it's hard to really make
that argument, we're still

1631
01:38:37,636 --> 01:38:39,506
learning about how we learn.

1632
01:38:39,626 --> 01:38:46,276
And as we do, and as neuroscience filled
advances, like ultimately leads to

1633
01:38:46,276 --> 01:38:49,866
advances in the AI space and vice versa.

1634
01:38:49,986 --> 01:38:51,656
there's definitely connections there.

1635
01:38:51,716 --> 01:38:56,766
but yeah, as far as your question
goes, I think it's anybody's guess.

1636
01:38:56,766 --> 01:39:00,466
I think this is a perfect note to end.

1637
01:39:00,496 --> 01:39:02,116
A little bit of suspense.

1638
01:39:02,286 --> 01:39:06,176
we're going to have to get you back at
some point when you've finished your

1639
01:39:06,226 --> 01:39:12,386
book and talk a little bit more about the
actual technical problems and challenges.

1640
01:39:12,426 --> 01:39:17,506
We haven't really touched upon any
of that yet, but today I certainly

1641
01:39:17,536 --> 01:39:22,656
learned a lot from you and I hope a
lot of our listeners will as well.

1642
01:39:22,766 --> 01:39:25,306
It was an absolute
pleasure to meet you both.

1643
01:39:26,196 --> 01:39:28,246
Thank you so much and see you next time.