1
00:00:00,240 --> 00:00:03,840
Welcome back to Data Driven, the podcast where we chart the thrilling

2
00:00:03,840 --> 00:00:07,379
terrains of data science, AI, and everything in between.

3
00:00:07,759 --> 00:00:11,120
I'm Bailey, your semiscient host with a pangshang for

4
00:00:11,120 --> 00:00:14,179
sarcasm and a wit sharper than a histogram spike.

5
00:00:14,684 --> 00:00:18,365
Today's episode promises a delightful mix of the analytical and the

6
00:00:18,365 --> 00:00:22,064
artistic as we dive into the fascinating world of vector databases,

7
00:00:22,605 --> 00:00:26,224
retrieval augmented generation, and origami. Yes.

8
00:00:26,285 --> 00:00:29,690
You heard that right. Origami, the ancient art of

9
00:00:29,690 --> 00:00:33,310
folding paper, somehow finds itself intersecting with AI,

10
00:00:33,610 --> 00:00:37,070
proving that the future really does have layers or should I say folds.

11
00:00:37,530 --> 00:00:41,265
Our guest, Arjun Patel, is a developer advocate at Pinecone

12
00:00:41,265 --> 00:00:44,965
who's on a mission to demystify vector databases and semantic

13
00:00:45,025 --> 00:00:48,704
search, turning complex AI concepts into snackable bits of

14
00:00:48,704 --> 00:00:52,465
brilliance. He's also a self taught origami artist and a

15
00:00:52,465 --> 00:00:56,230
former statistics student who actually enjoyed it. So if

16
00:00:56,230 --> 00:01:00,070
you're ready to unravel the secrets of modern AI and maybe pick up a trick

17
00:01:00,070 --> 00:01:03,750
or two about folding life into geometric perfection, you're in the

18
00:01:03,750 --> 00:01:04,489
right place.

19
00:01:08,229 --> 00:01:11,925
Hello, and welcome back to Data Driven, the podcast where we explore the emergent

20
00:01:11,925 --> 00:01:15,465
fields of data science, AI, data engineering.

21
00:01:16,005 --> 00:01:19,845
Now today, due to a scheduling conflict, my most favorite is data engineer

22
00:01:19,845 --> 00:01:23,220
in the world will not be able to make it. But I will

23
00:01:23,220 --> 00:01:27,060
continue on, despite the recent snowstorms that we've had here in

24
00:01:27,060 --> 00:01:30,500
the DC Baltimore area. With me today, I have

25
00:01:30,500 --> 00:01:33,480
Arjun Patel, a developer advocate at Pinecone,

26
00:01:34,215 --> 00:01:37,915
who aims to make vector databases retrieval augmented generation,

27
00:01:38,535 --> 00:01:42,295
also known as RAG, and semantic search accessible by

28
00:01:42,295 --> 00:01:45,895
creating engaging YouTube videos, code notebooks, and blog

29
00:01:45,895 --> 00:01:49,355
posts that transform complex AI concepts

30
00:01:49,920 --> 00:01:53,760
into easily understandable content. After graduating with

31
00:01:53,760 --> 00:01:57,520
a BA in statistics from the University of Chicago, his journey through

32
00:01:57,520 --> 00:02:00,820
tech world stands spans from making speech coaching

33
00:02:00,880 --> 00:02:04,565
accessible with AI at Speeko to tackling AI

34
00:02:04,625 --> 00:02:08,305
generated content detection at Appen. Arjun's

35
00:02:08,305 --> 00:02:12,065
interest spans traditional natural language processing into modern

36
00:02:12,065 --> 00:02:14,405
large language model development and applications.

37
00:02:15,879 --> 00:02:19,480
Behind beyond his technical prowess, Arjun has been designing and folding his

38
00:02:19,480 --> 00:02:22,379
own origami creations for over a decade. Interesting.

39
00:02:23,000 --> 00:02:26,519
Seamlessly blending analytical thinking with artistic expression and his

40
00:02:26,519 --> 00:02:29,739
professional and personal pursuits. Welcome to the show, Arjun.

41
00:02:30,555 --> 00:02:34,255
Hey. Nice to meet you, Frank. Thanks for having me on. Excited to be here.

42
00:02:34,315 --> 00:02:37,435
Awesome. Awesome. There's a lot to unpack from there, but I think it's interesting to

43
00:02:37,435 --> 00:02:41,215
note that you have a BA in statistics. Yes. So you were probably

44
00:02:41,275 --> 00:02:44,335
studying, this sort of stuff before it was cool?

45
00:02:45,480 --> 00:02:48,700
Yeah. Yeah. A lot of the old school ways of analyzing

46
00:02:49,080 --> 00:02:51,900
data, understanding what's going on, so on and so forth.

47
00:02:53,240 --> 00:02:56,680
It was kind of, like, made clear to me pretty early that

48
00:02:56,680 --> 00:03:00,475
understanding how to work with data at small scale and at large scale is gonna

49
00:03:00,475 --> 00:03:03,355
be very important going to the future. So I kinda just took that and ran

50
00:03:03,355 --> 00:03:07,115
with it with my education. Very cool. It was

51
00:03:07,115 --> 00:03:10,955
definitely, you know, one of those things where I don't

52
00:03:10,955 --> 00:03:14,780
think people realized how important statistics would be until,

53
00:03:15,400 --> 00:03:19,159
you know, until the revolution happens, so to speak. So and it's also

54
00:03:19,159 --> 00:03:22,540
interesting to see because there's a lot of people that I think could benefit from,

55
00:03:23,079 --> 00:03:26,920
you know, picking up that old picking up a, an old statistics book and

56
00:03:26,920 --> 00:03:30,584
reading through it and understanding, like, a lot of the fundamentals. Obviously, there's a lot

57
00:03:30,584 --> 00:03:34,105
of new things, but a lot of the fundamentals are largely the

58
00:03:34,105 --> 00:03:37,944
same. You know, just I'll

59
00:03:37,944 --> 00:03:41,704
use this example. You know, McDonald's can add a Mc McRib sandwich,

60
00:03:41,704 --> 00:03:44,510
but it's still a McDonald's. Right? Like, it's This

61
00:03:45,609 --> 00:03:49,290
is what happens when you're shoveling snow. Like, your

62
00:03:49,290 --> 00:03:52,810
brain gets I absolutely agree. And, like,

63
00:03:52,810 --> 00:03:56,409
another proof on that point is that Anthropic just released a

64
00:03:56,409 --> 00:04:00,185
blog recently kind of recapping how to do statistical analysis when you're

65
00:04:00,185 --> 00:04:04,025
comparing different large language models. And when you read the paper in the blog,

66
00:04:04,025 --> 00:04:07,565
it's basically just like 2 sample t tests and kind of going over really,

67
00:04:08,425 --> 00:04:12,270
like, not introductory, but still statistics that's easily accessible for people to

68
00:04:12,270 --> 00:04:15,170
learn and understand. So it's still relevant, and it's still important.

69
00:04:15,790 --> 00:04:19,630
Interesting. One of the things that that that stood out in your in your bio

70
00:04:19,630 --> 00:04:23,230
was, people tend to forget that there

71
00:04:23,230 --> 00:04:26,764
was a natural language processing field prior

72
00:04:27,225 --> 00:04:29,085
to chat gpt launching.

73
00:04:31,384 --> 00:04:32,264
How do you, you know,

74
00:04:36,345 --> 00:04:39,884
we wanna talk about the difference between those 2? Sure.

75
00:04:40,520 --> 00:04:44,280
So the one of the first and probably only

76
00:04:44,280 --> 00:04:48,120
course I took in college related to natural language processing was

77
00:04:48,120 --> 00:04:51,960
called geometric models of meaning. And everything I learned in that

78
00:04:51,960 --> 00:04:55,745
course was like everything before, what we now would

79
00:04:55,745 --> 00:04:59,345
consider, like, modern embedding models. So bag of

80
00:04:59,345 --> 00:05:03,185
word methods, understanding how to represent documents and text purely

81
00:05:03,185 --> 00:05:06,805
based on, like, the frequency of the words that exist in the text,

82
00:05:06,865 --> 00:05:10,570
and then trying to understand, like, okay. Based on that information, how can

83
00:05:10,570 --> 00:05:14,090
we learn about the concepts that exist in text from the words that are being

84
00:05:14,090 --> 00:05:17,850
used? Like, what is the framework we can use to understand what these

85
00:05:17,850 --> 00:05:21,610
words mean based on their, co occurrences with the other words and

86
00:05:21,610 --> 00:05:25,195
texts that you're working with and based on, what those

87
00:05:25,195 --> 00:05:28,875
words mean as well. So, like, what the words' neighbors are and what their meaning

88
00:05:28,875 --> 00:05:32,555
helps and also what those words are doing. And I think a lot of traditional

89
00:05:32,555 --> 00:05:36,315
natural language processing, methodologies kinda stem from that, and

90
00:05:36,315 --> 00:05:39,850
there's a there's a lot of mileage you can get out of just thinking about

91
00:05:39,850 --> 00:05:43,370
approaching problems there before you step into these more complicated methods,

92
00:05:43,370 --> 00:05:46,970
like, these embed modern embedding models that exist. So that's kind of, like, what I

93
00:05:46,970 --> 00:05:50,590
would consider, like, traditional NLP, like, doing named entity recognition,

94
00:05:50,810 --> 00:05:54,605
trying to understand how to, find keywords really

95
00:05:54,605 --> 00:05:58,445
quickly. And then once you get really good at that, there's a whole host of

96
00:05:58,445 --> 00:06:02,125
problems that you encounter afterward that kind of modern techniques try to

97
00:06:02,125 --> 00:06:05,185
solve. Right. That's interesting. So so

98
00:06:05,750 --> 00:06:09,510
what was it, what was your thoughts

99
00:06:09,510 --> 00:06:12,890
when you first, like given that you were an NLP practitioner

100
00:06:13,750 --> 00:06:17,350
prior to the release of transformers and things like that, what was your initial thought?

101
00:06:17,350 --> 00:06:21,125
Because I'm curious because there's not a lot of people there are a

102
00:06:21,125 --> 00:06:24,245
lot of experts today that really kind of started a couple of years ago. No

103
00:06:24,245 --> 00:06:28,085
fault on them. They see where the industry is going. Totally understand it. But what

104
00:06:28,085 --> 00:06:31,845
was your thoughts? What was your thoughts when

105
00:06:31,845 --> 00:06:35,370
you when you first saw the attention all you need? The

106
00:06:35,370 --> 00:06:39,070
attention is all you need paper. So that would have been

107
00:06:39,210 --> 00:06:42,670
probably around the time I graduated college, around

108
00:06:42,890 --> 00:06:45,930
maybe a year or 2 after I took the course that I was just describing.

109
00:06:45,930 --> 00:06:49,545
So I I just started learning about, like, okay. Like, this is

110
00:06:49,545 --> 00:06:52,985
how, like, old school, quote unquote, like, embedding

111
00:06:52,985 --> 00:06:56,345
methodologies work. And the biggest takeaway that I got from those is that they work

112
00:06:56,345 --> 00:07:00,185
pretty well. They work pretty well for, like, a lots of different kinds of

113
00:07:00,185 --> 00:07:03,645
queries. And I think what the attention all you need paper did

114
00:07:03,880 --> 00:07:07,400
was it kinda helped you, understand how

115
00:07:07,400 --> 00:07:11,100
to rigorously create representations of text that

116
00:07:11,160 --> 00:07:14,600
generalize way better than, any sort of, like,

117
00:07:14,600 --> 00:07:18,220
normal, keyword based, bag of word based search methodology.

118
00:07:18,920 --> 00:07:22,755
And I think that at the time, I probably didn't

119
00:07:23,055 --> 00:07:26,895
grasp as much what impact the attention all you need paper would have on the

120
00:07:26,895 --> 00:07:30,735
field until we started getting embedding models that people could use really

121
00:07:30,735 --> 00:07:34,415
easily, like Roberta or Bert. And we're like, okay. Now we can do, like,

122
00:07:34,415 --> 00:07:37,810
multilingual search without any issue. Now we can represent,

123
00:07:37,810 --> 00:07:41,569
like, any sentence without keyword overlap when we

124
00:07:41,569 --> 00:07:45,090
wanna find some document that's interesting, without doing any

125
00:07:45,090 --> 00:07:48,645
additional work. Like, once those papers started hitting the scene, I think now we start

126
00:07:48,645 --> 00:07:51,365
seeing, like, okay, this is what attention is doing for us. This is what the

127
00:07:51,365 --> 00:07:55,125
ability to, like, contextualize our vector embeddings is doing for us.

128
00:07:55,125 --> 00:07:58,805
And now we can see what's kind of getting benefited there. But I think I

129
00:07:58,805 --> 00:08:02,324
think my, understanding of how beneficial that

130
00:08:02,324 --> 00:08:06,010
was kind of lagged until we started seeing these other models kind of hit. And

131
00:08:06,010 --> 00:08:09,470
I'm like, okay. Now I can kinda see why this is important and why, like,

132
00:08:09,930 --> 00:08:12,910
future and future models are gonna get better and better based on this architecture.

133
00:08:13,930 --> 00:08:17,167
Interesting. So so for those that don't know kind of and even I'm rusty on

134
00:08:17,167 --> 00:08:17,835
this. Right? Yeah. One of the things that was interesting about this was the in

135
00:08:17,914 --> 00:08:21,514
on this. Right? Yeah. One of the things that was interesting about this was the

136
00:08:21,514 --> 00:08:25,354
in first, appearance. What was it? You you just described it a

137
00:08:25,354 --> 00:08:29,194
minute ago, but it was something like the the prevalence of a word

138
00:08:29,194 --> 00:08:32,909
in a bit of text versus the lack of prevalence and how that

139
00:08:32,909 --> 00:08:36,429
metric becomes was very important in in

140
00:08:36,750 --> 00:08:39,010
I'll call it classical natural language processing.

141
00:08:40,510 --> 00:08:44,270
Right. So this is the idea that if you have words that co

142
00:08:44,270 --> 00:08:48,055
occur together in some document space, the meaning of those words are gonna be

143
00:08:48,055 --> 00:08:51,655
more similar than words that don't co occur in some other given document

144
00:08:51,655 --> 00:08:55,255
space. This is rooted in something called the

145
00:08:55,255 --> 00:08:58,870
distributional hypothesis, which is basically this idea and the other

146
00:08:58,870 --> 00:09:02,470
idea that, concepts cluster in in this type of

147
00:09:02,470 --> 00:09:06,230
space. So what what does that mean actually? Right? So if you have the word

148
00:09:06,230 --> 00:09:09,990
like hot dog, it's probably gonna be seen in a corpus that's

149
00:09:09,990 --> 00:09:13,704
near other food related words than it would be if you picked some

150
00:09:13,704 --> 00:09:17,305
other word like space or moon. And there's something we can

151
00:09:17,305 --> 00:09:20,824
learn from that relationship to infer the meaning of what that word

152
00:09:20,824 --> 00:09:24,345
is and how we can use that meaning of that word to learn about what

153
00:09:24,345 --> 00:09:27,850
other words are doing. So So this is kind of, like, the theoretical

154
00:09:28,070 --> 00:09:31,850
basis of, like, why we can represent words geometrically,

155
00:09:32,630 --> 00:09:35,990
with with a little bit of hand waving. But that's kind of the core idea.

156
00:09:35,990 --> 00:09:39,575
And attention kind of takes this a little further by allowing the

157
00:09:39,575 --> 00:09:43,255
representation of these tokens or words to be altered based

158
00:09:43,255 --> 00:09:47,014
on the words that occur in a given sentence. So you might have a

159
00:09:47,014 --> 00:09:50,855
word like does, like, does this mean something?

160
00:09:50,855 --> 00:09:54,370
You might say something like that. Or you might say, I saw some

161
00:09:54,370 --> 00:09:58,050
does in the forest. Both spelled exactly the same, but have

162
00:09:58,050 --> 00:10:01,490
completely different meanings based on their context. And if you used a

163
00:10:01,490 --> 00:10:05,170
traditional, maybe, bag of words model where you're just counting the

164
00:10:05,170 --> 00:10:08,995
words that occur in a given document and kind of creating a representation of what

165
00:10:08,995 --> 00:10:12,435
that document looks like based on the words that are composed in there, you're gonna

166
00:10:12,435 --> 00:10:16,135
overlap and conflict with the meaning of those of of the word

167
00:10:16,195 --> 00:10:19,875
does and does because they're spelled exactly the same. They might look

168
00:10:19,875 --> 00:10:23,360
exactly the same with this type of representation. But if you have a way of

169
00:10:23,360 --> 00:10:27,200
informing what that word means with its context, which is what attention

170
00:10:27,200 --> 00:10:30,560
allows us to do, then you can completely change how that's being

171
00:10:30,560 --> 00:10:34,400
represented in your downstream system, which allows you to do interesting things

172
00:10:34,400 --> 00:10:38,045
with with search. So that's kind of, like, the biggest benefit that's coming out of

173
00:10:38,045 --> 00:10:41,805
that type of methodology, and that kinda enables what is now known as

174
00:10:41,805 --> 00:10:45,485
semantic search and retrieval augmented generation and so on and so forth. I was gonna

175
00:10:45,485 --> 00:10:49,165
say, that sounds very it's almost like it was, like, the old pre

176
00:10:50,090 --> 00:10:53,770
that error, the vectorization of this and the distance in

177
00:10:53,770 --> 00:10:57,530
that vector in that geometric space. I guess

178
00:10:57,530 --> 00:11:00,730
we've been doing that for a lot longer than most people realize in in a

179
00:11:00,730 --> 00:11:03,070
sense. Yeah. I mean,

180
00:11:04,375 --> 00:11:07,895
looking through, indexes or document stores with some sort of

181
00:11:07,895 --> 00:11:11,275
vectorization has has has been,

182
00:11:12,214 --> 00:11:16,055
something that people have done, except instead of being dense vectors, which is, like,

183
00:11:16,055 --> 00:11:19,880
you have some fixed size representation that isn't necessarily interpretable

184
00:11:19,940 --> 00:11:23,620
to the human eye for some given query or document, it would

185
00:11:23,620 --> 00:11:27,380
be, like, the size of your vocabulary. So you think of, like, Wikipedia. You

186
00:11:27,380 --> 00:11:31,060
can find, like, every unique word on Wikipedia, and, like, that is gonna be how

187
00:11:31,060 --> 00:11:34,135
big your vector's gonna be. And every time you have a new document come in,

188
00:11:34,135 --> 00:11:37,975
a new article, somebody's kind of, like, wrote up and published to Wikipedia, like, you're

189
00:11:37,975 --> 00:11:41,654
representing that in terms of its vocabulary. But now instead of doing that, we

190
00:11:41,654 --> 00:11:45,495
have, like, this magical fixed sized box that allows us

191
00:11:45,495 --> 00:11:48,795
to represent chunks of text in a way that is

192
00:11:49,160 --> 00:11:52,840
extremely fascinating and abstract. And every time I think about it, it just, like, blows

193
00:11:52,840 --> 00:11:56,120
my mind, but that's kind of, like, the main kind of difference is the way

194
00:11:56,120 --> 00:11:59,560
we're representing that information and how compact compact that is and

195
00:11:59,560 --> 00:12:03,195
generalizable it has become. Yeah. That is, like, it it's almost

196
00:12:03,195 --> 00:12:06,955
like you're, you know correct me if I'm wrong, but, you know,

197
00:12:06,955 --> 00:12:10,475
creating these vectors, these large vector databases, right, with, you

198
00:12:10,475 --> 00:12:14,235
know, 10, 12,000 dimensions, right, of how these words

199
00:12:14,235 --> 00:12:16,255
are measured in relationship to others.

200
00:12:17,860 --> 00:12:21,540
It's almost as a consequence of training a large language

201
00:12:21,540 --> 00:12:24,740
model, you create a knowledge graph. Is that is that true? Is that really the

202
00:12:24,740 --> 00:12:28,500
case where, you know, like, you know, dog is most likely to be

203
00:12:28,500 --> 00:12:32,305
next to, you know, the word pet, you know, or

204
00:12:32,305 --> 00:12:35,904
it has the same distance. Is that I'm not

205
00:12:35,904 --> 00:12:39,585
explaining it right. No. No. No. You're you're on you're on the right track exactly.

206
00:12:39,585 --> 00:12:42,805
And I think this is, like, one of the most fascinating qualities

207
00:12:43,105 --> 00:12:46,730
of even, like, what people would consider, like, older

208
00:12:46,790 --> 00:12:50,470
embedding models is this idea that you can take, like, a training test that

209
00:12:50,470 --> 00:12:54,250
seems completely unrelated to the quality that you want in a downstream model,

210
00:12:54,630 --> 00:12:58,455
and it turns out that that actually achieves that quality. So, what you were referring

211
00:12:58,455 --> 00:13:02,214
to, Frank, is this idea that you might have, like, a sentence. You

212
00:13:02,214 --> 00:13:05,975
might have, like, I took my dog out on a walk, and you might say,

213
00:13:05,975 --> 00:13:09,575
okay. I'm gonna remove the word, walk, and I'm gonna have

214
00:13:09,735 --> 00:13:13,560
I'm gonna train some model that tries to predict what that word

215
00:13:13,560 --> 00:13:17,240
where I removed was. This is masked language modeling, which is this idea that you're

216
00:13:17,240 --> 00:13:20,040
kind of getting at of, like, okay, what are the words and how are they

217
00:13:20,040 --> 00:13:23,240
in relation to the other words in that sentence? And it turns out that if

218
00:13:23,240 --> 00:13:26,920
you, like, do this with, like, 100 of 1,000 of millions of sentences and

219
00:13:26,920 --> 00:13:30,685
words, in some corpus that is somewhat representative of

220
00:13:30,685 --> 00:13:34,525
how people, use human language, you can

221
00:13:34,525 --> 00:13:37,565
act you will get really good at this task, number 1, because you're training the

222
00:13:37,565 --> 00:13:41,405
model on that task exactly. But if you are training a neural

223
00:13:41,405 --> 00:13:45,040
network on that model, some intermediate layer representation

224
00:13:45,180 --> 00:13:48,940
in that model so somewhere in that set of matrix

225
00:13:48,940 --> 00:13:52,540
multiplications where you're turning this input sentence into some fixed size

226
00:13:52,540 --> 00:13:55,920
vector representation is gonna be a good representation

227
00:13:56,300 --> 00:13:59,685
of what that word or that token or that sentence is going to be.

228
00:14:00,465 --> 00:14:03,605
And the fact that that works is not intuitive. Right?

229
00:14:04,145 --> 00:14:07,185
The the fact that that works has been shown empirically, and it turns out that

230
00:14:07,185 --> 00:14:10,405
we can kind of do that and kind of have these models work really well.

231
00:14:10,490 --> 00:14:13,690
And nowadays, in addition to kind of doing that, which is what we would consider

232
00:14:13,690 --> 00:14:17,370
pretraining on some large corpus, we now fine tune those

233
00:14:17,370 --> 00:14:21,050
embedding models on specific tasks that are important to us

234
00:14:21,050 --> 00:14:24,765
for retrieval. Like, okay, we have this query or question we're

235
00:14:24,765 --> 00:14:28,365
asking. We have the set of documents that might answer this question or might

236
00:14:28,365 --> 00:14:31,965
not. We want a model that makes it so that the query's embedding and the

237
00:14:31,965 --> 00:14:35,565
document relevance embeddings are in the same vector space. So you're on the right track.

238
00:14:35,565 --> 00:14:39,240
That's, like, basically how these models are able to learn these things. I don't know

239
00:14:39,240 --> 00:14:43,080
if I would call them, graph representation, maybe a little bit

240
00:14:43,080 --> 00:14:46,920
of, being being pandactic on, like, use of words there because that can

241
00:14:46,920 --> 00:14:50,300
be a little bit, different how how you're organizing that information.

242
00:14:50,760 --> 00:14:54,065
But you can make the argument that the way that these large language models are

243
00:14:54,065 --> 00:14:57,825
representing information is a compressed form of, like, the giant dataset that they're

244
00:14:57,825 --> 00:15:01,585
trained on. And we don't actually know exactly, like, where that

245
00:15:01,585 --> 00:15:05,185
information lies inside that neural network. There's some research that's,

246
00:15:05,185 --> 00:15:08,430
like, trying to get at answering that question, But you could, for the sake of

247
00:15:08,430 --> 00:15:12,190
argument, be like, yeah. There's probably, like, a a a dog

248
00:15:12,190 --> 00:15:15,949
node somewhere in this neural network that knows a ton about dogs, and that's how

249
00:15:15,949 --> 00:15:18,910
we're able to kind of learn this information. That is the stuff that we don't

250
00:15:18,910 --> 00:15:22,665
exactly know. Interesting. Because, there was a really good

251
00:15:22,665 --> 00:15:26,025
video by 3 blue one brown, which you probably are I love that

252
00:15:26,025 --> 00:15:29,705
channel. Where he gives examples where, you know, famous historical

253
00:15:29,705 --> 00:15:33,325
leaders from Britain have the same distance

254
00:15:33,385 --> 00:15:36,910
from you change the country to Italy

255
00:15:37,290 --> 00:15:41,050
or the United States have the same kind of distance. So you can kind

256
00:15:41,050 --> 00:15:44,730
of infer I'm not saying that the AI it

257
00:15:44,730 --> 00:15:48,190
almost seems like this knowledge graph is also is also a byproduct

258
00:15:48,330 --> 00:15:51,405
of of of building this out. Like, the there's some

259
00:15:51,865 --> 00:15:55,625
type of encoding or semantic, I guess, is this is really what it is. Right?

260
00:15:55,625 --> 00:15:59,465
Like, that that you get with it. And, I wanna get

261
00:15:59,465 --> 00:16:03,240
your thoughts because yesterday, I I caught the part the

262
00:16:03,240 --> 00:16:06,780
first half of the Jetson Juan keynote at c s CES,

263
00:16:07,320 --> 00:16:10,520
which this you know, we're recording this on January 8th. Right? And one of the

264
00:16:10,520 --> 00:16:13,880
things that the video starts off with is, you know, the idea

265
00:16:13,880 --> 00:16:17,560
that tokens are kind of fundamental elements of

266
00:16:17,560 --> 00:16:21,295
knowledge. And I did a live stream where I'm like, well, I never really thought

267
00:16:21,295 --> 00:16:24,654
about it this way. Right? They're they're building blocks of knowledge or the pixels, if

268
00:16:24,654 --> 00:16:28,415
you will, of of of of knowledge. And I wanted to get your

269
00:16:28,415 --> 00:16:32,115
thoughts on that because, like, that kind of blew my mind and maybe I'm simple.

270
00:16:32,130 --> 00:16:35,410
I don't know. Maybe I'm not. But it all it seems like we've been kinda

271
00:16:35,410 --> 00:16:38,850
dancing around this idea where and now NVIDIA is really

272
00:16:38,850 --> 00:16:42,690
fully, you know, going all in on this, the idea that, you know,

273
00:16:42,690 --> 00:16:46,530
these are not, this isn't an AI system. It's a token factory

274
00:16:46,530 --> 00:16:49,795
or a token score. What are your what are your thoughts on that? I'm curious.

275
00:16:50,495 --> 00:16:54,334
So when I started learning about how, like, tokenization works

276
00:16:54,334 --> 00:16:57,855
and how we're able to kind of, like, basically build these

277
00:16:57,855 --> 00:17:00,595
models without having massive, massive vocabularies,

278
00:17:01,740 --> 00:17:05,520
it is it is pretty it it is pretty

279
00:17:05,660 --> 00:17:08,560
interesting to be, like, okay. Like, maybe maybe there's some,

280
00:17:10,140 --> 00:17:13,900
abstract notion of information that each token has that

281
00:17:13,900 --> 00:17:17,735
is being that is what the model is learning during training time. And then

282
00:17:17,735 --> 00:17:21,515
we're just combining these sets of information in order to kind of, like, understand

283
00:17:21,815 --> 00:17:24,855
what words mean or what documents mean, so on and so forth. Because when you

284
00:17:24,855 --> 00:17:28,695
look at how, tokenizers work and the size of the number of

285
00:17:28,695 --> 00:17:31,835
tokens for, like, maybe the English language or maybe, like, a really multilingual

286
00:17:32,100 --> 00:17:35,780
model like Roberta or multilingual e five large, they're a lot

287
00:17:35,780 --> 00:17:39,400
less than you would expect. Like, it's on the order of, like, maybe a 100000,

288
00:17:39,940 --> 00:17:42,840
200000, 300000, tokens.

289
00:17:43,539 --> 00:17:47,115
So it is kind of

290
00:17:47,115 --> 00:17:50,794
odd to think about whether those tokens

291
00:17:50,794 --> 00:17:54,554
themselves hold information that's readily interpretable for us. But I

292
00:17:54,554 --> 00:17:57,990
think that we've gotten so far with using

293
00:17:57,990 --> 00:18:01,830
systems that are just combining, the operations on top of

294
00:18:01,830 --> 00:18:05,270
these tokens in order to retrieve the information that these systems have learned, that there's

295
00:18:05,270 --> 00:18:08,950
definitely something important there. And I would love to, like, know

296
00:18:08,950 --> 00:18:12,535
exactly, like, what is happening when we're able to do that. The the

297
00:18:12,535 --> 00:18:16,375
heuristic that I like to use is, large

298
00:18:16,375 --> 00:18:20,215
language models are generally reflections of the training datasets that they've been trained on,

299
00:18:20,215 --> 00:18:23,735
and they're basically creating, like, really efficient indexes over that

300
00:18:23,735 --> 00:18:27,440
information. And sometimes those indices hallucinate. And the reason

301
00:18:27,440 --> 00:18:31,280
why is because we are when we ask, quote, unquote, what

302
00:18:31,520 --> 00:18:35,360
a question to a large language model or query a large language model, we

303
00:18:35,360 --> 00:18:39,059
are kind of conditioning that model, on a probability

304
00:18:39,200 --> 00:18:42,875
space where every token being generated after is

305
00:18:42,875 --> 00:18:46,475
likely to exist given the query or the context or whatever we're passing to

306
00:18:46,475 --> 00:18:50,235
it. And once you think about it that way, then it just feels like

307
00:18:50,235 --> 00:18:53,820
instead of thinking about what each of the tokens are doing, you're kind of just

308
00:18:54,059 --> 00:18:57,820
querying what the model has been trained on and what it will tell you

309
00:18:57,820 --> 00:19:01,039
based on what it, quote unquote, learned or knows.

310
00:19:01,580 --> 00:19:04,460
And then you can kind of run with that metaphor a lot and build systems

311
00:19:04,460 --> 00:19:08,255
on on top of that. That seems, much more actionable than thinking about,

312
00:19:08,255 --> 00:19:11,135
like, what each of the tokens are doing individually. Does that kinda make sense? No.

313
00:19:11,135 --> 00:19:13,455
That makes a lot of sense. I think the whole gestalt of it is what

314
00:19:13,455 --> 00:19:16,895
really makes it magical. Right? Like Yeah. You know, you can you

315
00:19:16,895 --> 00:19:20,580
can obviously, I I don't this is not this is not, like, the newest iPhone

316
00:19:20,580 --> 00:19:23,960
or whatever. But, you know, if you go through the the text auto complete,

317
00:19:24,740 --> 00:19:28,340
you can maybe make a sentence that sounds like

318
00:19:28,340 --> 00:19:32,155
something you would write. But much beyond that, it starts getting weird. In

319
00:19:32,155 --> 00:19:35,535
early generative AI was very much like that, particularly the images.

320
00:19:35,915 --> 00:19:39,455
Well, you know Don't like, yes. A 100%

321
00:19:39,515 --> 00:19:43,115
understand. I started learning about generative, text

322
00:19:43,115 --> 00:19:46,840
generation before we had instruction fine tune model. So are you

323
00:19:46,840 --> 00:19:50,520
familiar with, like, the concept of instruction fine tuning, Frank? I think I am,

324
00:19:50,520 --> 00:19:53,880
but I IBM slash Red Hat defines it one way. I would like to get

325
00:19:53,880 --> 00:19:57,720
your opinion. Yeah. So, this is the idea that

326
00:19:57,720 --> 00:20:01,195
you can train or fine tune large language models to follow

327
00:20:01,195 --> 00:20:04,955
instructions to complete tasks. So, before we had,

328
00:20:04,955 --> 00:20:08,715
like, models that could that we could just, like, ask questions of and just, like,

329
00:20:08,715 --> 00:20:12,495
receive answers directly, you had to craft text

330
00:20:13,110 --> 00:20:16,710
that would increase the probability that the document that you want to

331
00:20:16,710 --> 00:20:20,470
generate would happen. So if you wanted a story about, like, unicorns or something,

332
00:20:20,470 --> 00:20:24,070
you would have to start your query to the LLM as there

333
00:20:24,070 --> 00:20:27,190
once was, like, a set of unicorns living in the forest. Blah blah blah blah.

334
00:20:27,190 --> 00:20:30,415
And then it would just, like, complete sentence, just like a fancy version of autocomplete.

335
00:20:30,875 --> 00:20:34,555
Right. And that that's kind of, like, what we used to have, and that was

336
00:20:34,555 --> 00:20:37,995
pretty hard to work with. And then once researchers kinda cracked, like, wait a second.

337
00:20:37,995 --> 00:20:41,755
We can create a dataset of, like, instruction pairs and, like, document

338
00:20:41,755 --> 00:20:45,080
sets and fine tune models on them. And it turns out now we can just,

339
00:20:45,080 --> 00:20:48,920
like, ask models to do things, and they will do them. Whether or not

340
00:20:48,920 --> 00:20:52,200
those are correct is kind of the next part of the story. But getting to

341
00:20:52,200 --> 00:20:55,180
that point, it was, like, pretty interesting and pretty significant.

342
00:20:56,115 --> 00:20:59,575
Interesting. Interesting. When I think of

343
00:20:59,955 --> 00:21:03,635
fine tuning, I think of I think of

344
00:21:03,635 --> 00:21:07,235
primarily InstruqtLab, where you basically kinda have a

345
00:21:07,235 --> 00:21:10,740
LoRa layer on top of the base LLM doing

346
00:21:10,740 --> 00:21:14,420
that. Is that the same thing? Or is it kind of slightly

347
00:21:14,580 --> 00:21:18,260
it sounds like it's slightly nuanced. So the nuance there

348
00:21:18,260 --> 00:21:22,100
is that, one, though this the methodology that I'm

349
00:21:22,100 --> 00:21:25,945
describing is mostly dataset driven. So you have, like, your original LLM,

350
00:21:26,005 --> 00:21:29,845
and then you have, like, a new dataset that allows the LLM to learn a

351
00:21:29,845 --> 00:21:33,605
specific task. Or in this case, like, a generalized form of tasks,

352
00:21:33,605 --> 00:21:37,390
which is you have instruction, answer, user query,

353
00:21:37,390 --> 00:21:41,150
give it an instruction. Whereas in your case, you're kind of, like, adding another layer

354
00:21:41,150 --> 00:21:44,830
to the LLM and, like, forcing the LLM to learn all the new

355
00:21:44,830 --> 00:21:48,210
methodology inside that layer in order to accomplish a specific

356
00:21:48,270 --> 00:21:52,115
task. So that's kind of like what client cleaning ends up doing. So the other

357
00:21:52,115 --> 00:21:55,475
way there's multiple ways to do this, it seems. Right? Like, there there's that way

358
00:21:55,475 --> 00:21:58,775
we add the layer, but there's also kind of I hate the term prompt engineering

359
00:21:58,915 --> 00:22:02,755
because it's just so over overblown. But, like, giving it

360
00:22:02,755 --> 00:22:06,559
more context and samples. And now that the the token context

361
00:22:06,559 --> 00:22:10,320
window is large enough that you don't have to be well, if you wanna

362
00:22:10,320 --> 00:22:12,799
save money, you have to be very mindful of that. But if you're running it

363
00:22:12,799 --> 00:22:16,480
locally, like, doesn't really matter. Well, you could give it an example of

364
00:22:16,639 --> 00:22:19,905
let's just say you had I'm trying to think of a short story or a

365
00:22:19,905 --> 00:22:23,365
novel. I don't know. Let's pretend,

366
00:22:23,905 --> 00:22:27,745
Moby Dick was only a 100 pages. Right? I

367
00:22:27,745 --> 00:22:30,785
could give it that as the part of the prompt. Let's say write a sequel

368
00:22:30,785 --> 00:22:34,580
to this book based on what happens in this one. Is that what you're talking

369
00:22:34,580 --> 00:22:38,260
about? Were you kinda giving an example as part of the prompt? Or is there

370
00:22:38,260 --> 00:22:41,779
some and not part of the layer? Or some combination thereof? Or was some third

371
00:22:41,779 --> 00:22:45,525
thing entirely? So this would be like, what what

372
00:22:45,525 --> 00:22:49,365
you're describing is more like few shot learning, which is you gave kind of an

373
00:22:49,365 --> 00:22:53,045
example, and then you're, like, okay. Like, given these examples, can you do this other

374
00:22:53,045 --> 00:22:56,885
task this test that I've described on this unseen example? What I'm describing is

375
00:22:56,885 --> 00:23:00,289
kind of, like, slightly before that. So, like, before we had the ability to, like,

376
00:23:00,289 --> 00:23:03,750
give models examples, we had to, like, give them we have to

377
00:23:03,809 --> 00:23:07,570
create the ability to follow instructions. And then once you have the ability to

378
00:23:07,570 --> 00:23:11,155
follow instructions, you can be like, okay. Here are the instructions. Here's

379
00:23:11,155 --> 00:23:14,615
examples of correctly completing the instruction, now do the instruction.

380
00:23:14,995 --> 00:23:18,355
And that is the reason why that happens in that order is

381
00:23:18,355 --> 00:23:21,795
because first, you have, like, just, like, sequence completion, like,

382
00:23:21,795 --> 00:23:25,395
autocomplete. Then you have, like, okay, given this

383
00:23:25,395 --> 00:23:29,120
task given this set of instructions, just follow the instruction instead of,

384
00:23:29,120 --> 00:23:32,800
like, trying to do autocomplete. And then you have, okay, now you know how to

385
00:23:32,800 --> 00:23:36,560
follow instructions. I'm gonna give you a few data points in order to

386
00:23:36,560 --> 00:23:40,160
learn a new task. Now do this new task. So you're kind of,

387
00:23:40,160 --> 00:23:43,955
like, moving from a situation where you need tons and tons

388
00:23:43,955 --> 00:23:47,635
of data just to get the, sequence completion. And then you need

389
00:23:47,635 --> 00:23:51,095
a smaller set of data to, like, get the capability to follow instructions.

390
00:23:51,555 --> 00:23:55,320
And then you need a very, very, very small amount of data, like,

391
00:23:55,320 --> 00:23:59,160
maybe 3 points or 10 examples or 15 examples to complete kind of, like,

392
00:23:59,160 --> 00:24:02,760
a new task. So there's a lot of kind of nuance in, like, how

393
00:24:02,760 --> 00:24:06,120
modern LLMs are being used and how they're kind of trained and fine tuned, so

394
00:24:06,120 --> 00:24:09,559
on and so forth. And I think there's a lot of, like,

395
00:24:09,559 --> 00:24:13,135
important importance in, like, learning what what happened kind of

396
00:24:13,135 --> 00:24:16,975
before because the advancements have happened so quickly. It can be really hard to kind

397
00:24:16,975 --> 00:24:20,815
of differentiate, or, like, oh, why is why do models perform like this? Why

398
00:24:20,815 --> 00:24:24,429
do things kind of happen like that? And even though, prompt

399
00:24:24,429 --> 00:24:28,190
engineering has kind of, like, let's say, traveled through the

400
00:24:28,190 --> 00:24:31,230
hype cycle where people were, like, really excited about it, and then we're, like, this

401
00:24:31,230 --> 00:24:34,830
is not actually that interesting. Right. What's interesting is that,

402
00:24:34,990 --> 00:24:38,755
doing building a good RAG system or trivial augmented generation system,

403
00:24:38,895 --> 00:24:42,195
you really need to be good at prompt engineering in a sense

404
00:24:42,415 --> 00:24:45,635
because you're assembling the correct context for this model

405
00:24:45,855 --> 00:24:49,230
to answer some downstream question, And it's not

406
00:24:49,230 --> 00:24:52,910
intuitive how to assemble that context. So understanding, like, how are these

407
00:24:52,910 --> 00:24:56,750
models are trained, like, whether they can follow instructions, how good they are at

408
00:24:56,750 --> 00:25:00,555
doing so, how many examples of information they need in order to accomplish some task

409
00:25:00,875 --> 00:25:04,395
really affects how you build that knowledge base in order to help the

410
00:25:04,395 --> 00:25:07,535
model do some sort of new thing. Interesting.

411
00:25:09,435 --> 00:25:12,895
So RAG is obviously all the rage now.

412
00:25:13,240 --> 00:25:17,080
Yep. But there's also a relatively new because this this

413
00:25:17,080 --> 00:25:20,840
space changes rapidly. Like, I mean, I took 2 weeks off in December, and

414
00:25:20,840 --> 00:25:24,380
I feel completely disconnected from the cutting edge, you know.

415
00:25:25,000 --> 00:25:28,655
Because when I was watching the keynote from CES, and I'm like, wow. That's

416
00:25:28,655 --> 00:25:32,095
really cool. And I was texting, you know, slacking with a coworker, and he goes,

417
00:25:32,095 --> 00:25:35,455
oh, no. This is a retread of their, like, last keynote they did. Like

418
00:25:35,935 --> 00:25:39,775
and I'm like, okay. Wow. Blink and you missed

419
00:25:39,775 --> 00:25:43,220
something. So what

420
00:25:43,220 --> 00:25:46,980
you're describing the fine tuning, is that really what Raft is, where the

421
00:25:46,980 --> 00:25:50,820
idea that you have kind of retrieval augmented fine tuning, which I think is what

422
00:25:50,820 --> 00:25:54,595
the acronym stands for. Is that not I'm

423
00:25:54,595 --> 00:25:58,315
not familiar with how Raft works. So I don't wanna, like, kind of venture

424
00:25:58,315 --> 00:26:01,875
and guess without without knowing what it is. But do you remember, like, what context

425
00:26:01,875 --> 00:26:04,695
you encountered this in? Basically, it's the idea that

426
00:26:06,290 --> 00:26:10,049
it's the idea that you can fine tune the results. Sounds very

427
00:26:10,049 --> 00:26:13,190
similar to what you're doing, and I've haven't read the paper in a while.

428
00:26:14,850 --> 00:26:17,745
Back when I was a Microsoft MVP, like, you know,

429
00:26:18,625 --> 00:26:22,465
they had a Microsoft Research had the thing for their calls, and they

430
00:26:22,465 --> 00:26:26,005
were all raving about it. The paper had just come out and things like that.

431
00:26:26,625 --> 00:26:29,925
It's the idea that you can kind of give it pretrained examples.

432
00:26:30,630 --> 00:26:33,910
You start with a base LLM, and you give it pre trained examples, and then

433
00:26:33,910 --> 00:26:37,750
you add on top of just to retrieve an

434
00:26:37,750 --> 00:26:41,350
augmented portion of it. It's very similar, not to

435
00:26:41,350 --> 00:26:43,990
plug my you know, for my day job. I work at Red Hat. That's why

436
00:26:43,990 --> 00:26:47,695
there's a fedora there. We have a product called Rel

437
00:26:47,695 --> 00:26:51,135
AI, which is based on an upstream open source project called instruct

438
00:26:51,135 --> 00:26:54,815
lab. And it's the idea similar idea in that you you you

439
00:26:54,815 --> 00:26:58,035
basically give it a set of data.

440
00:26:58,580 --> 00:27:01,780
And then you we there's a there's a little more to it because there's a

441
00:27:01,780 --> 00:27:05,400
teacher model. And basically what it'll do is it will and synthetic data generation.

442
00:27:05,860 --> 00:27:09,160
So you can start with a modest document set.

443
00:27:10,180 --> 00:27:13,875
And based on how the questions and answers that you

444
00:27:13,875 --> 00:27:15,174
form and the the the,

445
00:27:17,875 --> 00:27:21,015
the taxonomy that you attach to it, it will

446
00:27:21,715 --> 00:27:25,255
create a LoRa layer on top of an existing LLM.

447
00:27:26,120 --> 00:27:29,960
And it it could be that it's it's it's not quite exactly the same as

448
00:27:29,960 --> 00:27:33,320
Raft, but it's definitely in the same direction. Same same thing as, like, Bert, Elmo,

449
00:27:33,320 --> 00:27:36,540
and, you know, Roberta, which, I think

450
00:27:37,400 --> 00:27:40,645
I think I understand. So it's kind of like you so the I think the

451
00:27:40,645 --> 00:27:44,325
problem that might be addressing is kind of just really similar to the problem that

452
00:27:44,325 --> 00:27:48,085
traditional RAG tries to address, except in a more kind of deliberate fashion

453
00:27:48,245 --> 00:27:51,925
Exactly. Yeah. Where you have some document store internally. Like, let's say we

454
00:27:51,925 --> 00:27:55,465
both work at some company, and we have a giant customer support document store.

455
00:27:55,710 --> 00:27:59,150
You take some LLM off the shelf. It's not necessarily gonna know the

456
00:27:59,150 --> 00:28:02,909
contents of your internal kind of documents. So how can you get

457
00:28:02,909 --> 00:28:06,669
it to, like, successfully help answer tickets or triage tickets that

458
00:28:06,669 --> 00:28:10,365
you're trying to build, so that you can answer, like, most difficult tickets and

459
00:28:10,365 --> 00:28:13,965
kind of work toward that. In this situation, maybe you

460
00:28:13,965 --> 00:28:17,405
want to, inject some of the knowledge of

461
00:28:17,405 --> 00:28:21,005
the documents in addition to having the

462
00:28:21,005 --> 00:28:24,760
model being able to search over the document store. So maybe, like, the what this

463
00:28:24,760 --> 00:28:28,280
lower layer is doing is, like, absorbing Yeah. Some of the knowledge from the

464
00:28:28,280 --> 00:28:31,500
document store so that you can kind of more

465
00:28:32,120 --> 00:28:35,815
efficiently query, the database and so

466
00:28:35,815 --> 00:28:39,195
that you don't have to, like, query it all the time. The only,

467
00:28:39,655 --> 00:28:43,255
issue, quote, unquote, I'd have with that method is that you'd have to, like, keep

468
00:28:43,255 --> 00:28:47,015
that updated from time to time, and that's, like, not that's nontrivial. Whereas

469
00:28:47,015 --> 00:28:50,679
if you just do, like, traditional RAG, you just need to

470
00:28:50,679 --> 00:28:54,200
update your, Vector Store, and then you can just have the model

471
00:28:54,200 --> 00:28:57,559
query that new information when you need to. But, you know, it's always best to

472
00:28:57,559 --> 00:29:01,019
use whatever solution works best for your, given use case.

473
00:29:01,455 --> 00:29:04,895
And experimenting with different use cases is always really important. But I imagine that's, like,

474
00:29:04,895 --> 00:29:08,195
kind of what that is trying to address, which is the That is basically it.

475
00:29:08,255 --> 00:29:11,375
The I, you know, I don't wanna go down that rabbit hole of that. But

476
00:29:11,375 --> 00:29:15,150
but, basically, the idea is that, if

477
00:29:15,150 --> 00:29:18,590
you train an LLM or you have a layer on top of an

478
00:29:18,590 --> 00:29:22,270
LLM that not only does retrieval from a source document

479
00:29:22,270 --> 00:29:25,950
store. Right? I think that's a pretty set pattern. But it also has a

480
00:29:25,950 --> 00:29:29,170
better understanding of your business, your industry, the jargon.

481
00:29:29,644 --> 00:29:33,404
Right. Right. Blah blah blah. Right? The idea is that the retrieval success

482
00:29:33,404 --> 00:29:37,184
rate will be higher. Now we're not publishing the numbers yet,

483
00:29:37,485 --> 00:29:41,085
but the research is still ongoing. But basically, it's a

484
00:29:41,085 --> 00:29:44,799
pretty substantial from what I've seen well, I haven't

485
00:29:44,799 --> 00:29:47,679
seen the actual numbers yet, but from what I've been told those numbers are by

486
00:29:47,679 --> 00:29:51,380
the researcher, that it is a it is a substantial improvement

487
00:29:51,440 --> 00:29:55,065
that is worth the, the juice is worth the squeeze in that in that regard.

488
00:29:55,784 --> 00:29:59,544
You're not and it's also computationally, you're not quite training the

489
00:29:59,544 --> 00:30:03,225
whole thing again. You're just kinda putting a new Instagram filter, so to

490
00:30:03,225 --> 00:30:06,664
speak, together on top of the base. So it definitely

491
00:30:06,664 --> 00:30:10,230
does it definitely does some things. Now when we get the hard

492
00:30:10,230 --> 00:30:13,990
numbers, then, you know, I mean, I can

493
00:30:13,990 --> 00:30:17,669
say them publicly, then I think we'll we'll know is the juice how

494
00:30:17,669 --> 00:30:21,450
much does the the the the squeeze to juice ratio is?

495
00:30:22,325 --> 00:30:26,085
But, I can confidently say publicly now, like, there's a there

496
00:30:26,085 --> 00:30:29,685
there. Yeah. And, you know, we'll have those numbers soon

497
00:30:29,685 --> 00:30:33,445
enough. But it's it's interesting because you're right. I mean, this paper

498
00:30:33,445 --> 00:30:37,110
came out in 2019. Right? There was just an

499
00:30:37,110 --> 00:30:40,810
explosion of these different mechanisms. You mentioned Bert. You mentioned Roberta.

500
00:30:41,030 --> 00:30:44,490
Fun fact, my wife's name is Roberta. So that was kind of fun.

501
00:30:45,110 --> 00:30:48,950
There was Elmo. There was Ernie. There was a whole Sesame

502
00:30:48,950 --> 00:30:52,545
Street themed zoo of of model

503
00:30:52,545 --> 00:30:56,304
types. That seems to have kind of that branching out of

504
00:30:56,304 --> 00:31:00,145
those different directions has seemed to have stalled, and we're going into more of

505
00:31:00,145 --> 00:31:03,985
these retrieval augmented generation systems. So for those who because

506
00:31:03,985 --> 00:31:07,590
not everybody on our listeners know exactly what retrieval

507
00:31:07,590 --> 00:31:11,110
augmented systems are. Could you give kind of a a

508
00:31:11,110 --> 00:31:14,890
level 200 elevator explanation? Sure.

509
00:31:15,190 --> 00:31:18,970
So, when you speak to a modern chatbot,

510
00:31:19,635 --> 00:31:23,395
what's happening is that they've learned information through their pre

511
00:31:23,395 --> 00:31:27,095
training processes, the large corpus of basically the entire Internet,

512
00:31:27,715 --> 00:31:31,410
and are generating information based on the query that you're passing in.

513
00:31:31,970 --> 00:31:35,110
The problem that often occurs is that

514
00:31:35,570 --> 00:31:39,330
these AI models might error, and the error could

515
00:31:39,330 --> 00:31:42,770
be making, inform making information up that doesn't

516
00:31:42,770 --> 00:31:46,425
exist. For example, if a model is trained before a period of time,

517
00:31:46,425 --> 00:31:49,065
like, it might not know about that period of time, which is which happens more

518
00:31:49,065 --> 00:31:52,665
often than you think. The information could be false, untruthful, or it could

519
00:31:52,665 --> 00:31:56,490
just be incorrect in a way that's not, like, bad, but still not

520
00:31:56,490 --> 00:32:00,170
helpful. And the reason for this is the way that these

521
00:32:00,170 --> 00:32:03,850
models are accessing that information. The idea behind retrieval

522
00:32:03,850 --> 00:32:07,370
augmented generation is that instead of having the model try

523
00:32:07,370 --> 00:32:10,745
to, generate the correct document or the correct

524
00:32:10,745 --> 00:32:14,365
response given its pretraining process, you instead

525
00:32:14,424 --> 00:32:18,265
add factual content to the query that you're asking

526
00:32:18,265 --> 00:32:22,025
the model for. You first search for that content, which is where

527
00:32:22,025 --> 00:32:25,740
the retrieval part comes, and then you augment the generation of what that

528
00:32:25,740 --> 00:32:29,260
model is going to create based on that content, hence

529
00:32:29,260 --> 00:32:33,020
retrieval augmented generation. There's usually, a querying

530
00:32:33,020 --> 00:32:36,620
step. So you take in a user query, you hit it against some sort

531
00:32:36,620 --> 00:32:39,895
of database, usually a vector database. In our case, it could be Pinecone.

532
00:32:40,355 --> 00:32:43,895
You find a set of relevant documents. You pass that to the generating LLM.

533
00:32:44,435 --> 00:32:47,795
The generating LLM uses those documents to generate a final

534
00:32:47,795 --> 00:32:50,835
response. And it turns out that if you do this, you can reduce the right

535
00:32:50,835 --> 00:32:54,679
hallucinations. And that makes sense because if the model was given true

536
00:32:54,679 --> 00:32:58,360
information and then conditioned its generation on that information, it

537
00:32:58,360 --> 00:33:01,799
follows that the probability of generating information that is

538
00:33:01,799 --> 00:33:05,415
correct could be higher. That's a good exam that's a good

539
00:33:05,415 --> 00:33:09,174
explanation. So you're basically giving it a

540
00:33:09,174 --> 00:33:12,215
crash course in what documents you care about. Right? Like

541
00:33:12,934 --> 00:33:16,695
Exactly. Interesting. And that's a good segue

542
00:33:16,695 --> 00:33:20,200
because you work for Pinecone. So so tell me about Pinecone. What is Pinecone?

543
00:33:20,980 --> 00:33:24,740
Yeah. So Pinecone is a, knowledge layer for AI. It's

544
00:33:24,740 --> 00:33:28,020
kind of like the way we like to describe it. We the main product that

545
00:33:28,020 --> 00:33:31,640
we provide is a vector database. So this is a way of storing

546
00:33:31,780 --> 00:33:35,275
information, information that has been vectorized, in a really

547
00:33:35,275 --> 00:33:39,054
efficient manner. And it turns out that if you have the ability to store information

548
00:33:39,115 --> 00:33:42,875
in this manner, you can search against it really quickly, with

549
00:33:42,875 --> 00:33:46,715
low latency and to find the things that you need to find really interesting for

550
00:33:46,715 --> 00:33:50,360
these types of semantic search and rag systems. Pinecone has a few other

551
00:33:50,360 --> 00:33:54,039
offerings now that kind of help people build these systems a lot easier. There's

552
00:33:54,039 --> 00:33:57,640
Pinecone Inference, which lets you embed data in order to do that querying

553
00:33:57,640 --> 00:34:01,135
step. Pinecone Assistant, which lets you just build a RAG

554
00:34:01,135 --> 00:34:04,915
system immediately just by upsurting documents into our vector database,

555
00:34:06,095 --> 00:34:09,695
so on and so forth. But the reason why, like, you

556
00:34:09,695 --> 00:34:13,455
need a vector database is because all of this advance of

557
00:34:13,455 --> 00:34:16,850
semantic search of embedding models. People have gotten really, really

558
00:34:16,850 --> 00:34:20,210
good at representing chunks of information using these dense sized

559
00:34:20,210 --> 00:34:23,910
vectors. But once you have 1,000, millions,

560
00:34:24,130 --> 00:34:27,890
even billions of vectors across tons of different users, you need a way

561
00:34:27,890 --> 00:34:31,304
of indexing this information to access it really quickly at

562
00:34:31,385 --> 00:34:35,065
scale, especially if your chatbot's gonna be querying this vector database really

563
00:34:35,065 --> 00:34:38,905
often. And so having a specialized data store that can handle that type

564
00:34:38,905 --> 00:34:42,505
of search becomes really useful. That's why Pinecone is here, and that's

565
00:34:42,505 --> 00:34:45,980
why we exist. Interesting. Interesting.

566
00:34:47,320 --> 00:34:51,160
One of the other interesting things from your bio, aside from

567
00:34:51,160 --> 00:34:53,820
the the the origami,

568
00:34:55,494 --> 00:34:58,194
Tell me about this. So so you

569
00:34:59,095 --> 00:35:02,695
your crew does your do you create the YouTube videos, or do you use your

570
00:35:02,695 --> 00:35:05,655
tools, or is it something completely it's just part of your job as a developer

571
00:35:05,655 --> 00:35:09,415
advocate? So it is just part of my job as a

572
00:35:09,415 --> 00:35:13,150
developer advocate. Oh, okay. Like, often that, you

573
00:35:13,150 --> 00:35:16,830
know, I do that because we are interviewing people or because there's a new

574
00:35:16,830 --> 00:35:20,590
concept we wanna teach people, so on and so forth. Or we do a webinar,

575
00:35:20,590 --> 00:35:24,030
and we just upload it to YouTube. Oh, very cool. Very cool.

576
00:35:24,030 --> 00:35:27,835
Yeah. I started my career in developer

577
00:35:27,835 --> 00:35:31,055
advocacy. One was called evangelism. So I was a a Microsoft

578
00:35:31,355 --> 00:35:34,795
evangelist for a while. So yeah. Yeah. Cool. YouTube

579
00:35:34,795 --> 00:35:38,635
is very important. Yep. But it's

580
00:35:38,635 --> 00:35:41,375
also it's also, I think, speaks to how people learn,

581
00:35:43,380 --> 00:35:47,060
but, how people learn. YouTube University is very

582
00:35:47,060 --> 00:35:50,820
real. Right? And Yep. You know, not not a knock on

583
00:35:50,820 --> 00:35:54,660
traditional schools, not a knock on traditional publishing, but this space

584
00:35:54,660 --> 00:35:58,444
is moving so fast that if it weren't for YouTubers like 3blueonebrown

585
00:35:59,545 --> 00:36:03,325
I think his real name is, Grant Sanderson. I think that's his real name.

586
00:36:04,184 --> 00:36:07,005
Somebody will send me hate mail if I get it wrong. But,

587
00:36:08,515 --> 00:36:12,070
he he is, like, really good at explaining these

588
00:36:12,070 --> 00:36:15,770
really abstract mathematical concepts. And

589
00:36:15,910 --> 00:36:19,510
unlike you, I didn't study math undergrad. I didn't I mean, I had to. I

590
00:36:19,510 --> 00:36:23,030
only took the requirements. Right? But I have comp sci degrees. So, like, for me

591
00:36:23,030 --> 00:36:26,845
to kind of fall in love with math again or for the first time, depending

592
00:36:26,845 --> 00:36:30,685
on depending on how you wanna say that, for me, that

593
00:36:30,685 --> 00:36:34,385
was very helpful. And under having an understanding of this, if you're a data engineer

594
00:36:34,445 --> 00:36:37,805
and, you know, or wanna get into this space, it's

595
00:36:37,805 --> 00:36:41,500
definitely vector databases for traditional kinda SQL kinda

596
00:36:41,500 --> 00:36:45,339
RDBMS person will look very awkward at first. But

597
00:36:45,339 --> 00:36:48,300
I know a lot of people that have made the transition, and they kinda love

598
00:36:48,300 --> 00:36:51,280
it. Right? Because in a lot of ways, it's way more efficient,

599
00:36:52,780 --> 00:36:56,195
than, I dare say, traditional data stores. But when you're

600
00:36:56,195 --> 00:36:59,795
processing the large blocks of text, it's really good for kind of

601
00:36:59,795 --> 00:37:03,475
parsing through that. But

602
00:37:03,475 --> 00:37:07,095
that's that's really cool. So, we do have the preset

603
00:37:07,220 --> 00:37:09,779
questions if you're good for doing those. I'll put them in the chat in case

604
00:37:09,779 --> 00:37:13,539
you don't have them. Sure. They're not brain teasers

605
00:37:13,539 --> 00:37:16,279
or anything like that. They are pretty basic of,

606
00:37:17,700 --> 00:37:20,839
questions, and I will paste them in the chat.

607
00:37:22,155 --> 00:37:25,855
So the first question is, how did you find your way into

608
00:37:26,075 --> 00:37:29,915
AI? Did you did you find AI, or did

609
00:37:29,915 --> 00:37:33,515
AI find you? So this is a little bit of a

610
00:37:33,515 --> 00:37:36,495
crazy story, but AI definitely found me.

611
00:37:37,110 --> 00:37:40,950
So when I was in college, when I was looking for my 1st

612
00:37:40,950 --> 00:37:44,790
internship, I couldn't find any internships, basically, because I had, like, no

613
00:37:44,790 --> 00:37:48,390
previous experience in working at tech or anything like that. And,

614
00:37:48,710 --> 00:37:51,990
the first company I worked for, Speeko, took a chance on me because they were

615
00:37:51,990 --> 00:37:55,645
building public speaking, tools to kind of help people learn how to do

616
00:37:55,645 --> 00:37:59,405
public speaking better, for an iOS app. And I had some

617
00:37:59,405 --> 00:38:02,205
public speaking experience. They were, like, close enough. We'll have you come on and kind

618
00:38:02,205 --> 00:38:05,805
of help us, like, work work things out. And while I was there, it was

619
00:38:05,805 --> 00:38:09,240
made very obvious to me how important building

620
00:38:10,580 --> 00:38:14,260
very basic deep learning systems and AI systems to kind

621
00:38:14,260 --> 00:38:17,940
of accomplish really specific tasks that could help serve an

622
00:38:17,940 --> 00:38:21,220
ultimate goal. Like, what we were trying to do is just, like, see how many

623
00:38:21,220 --> 00:38:24,925
filler words people are using or how quickly or slowly you were speaking.

624
00:38:24,925 --> 00:38:28,464
And that requires a lot of, complicated

625
00:38:28,525 --> 00:38:31,405
processing because you have to do transcription and because you have to figure out what

626
00:38:31,405 --> 00:38:34,525
words are being said, so on and so forth. So kind of experiencing that and

627
00:38:34,525 --> 00:38:37,970
seeing that firsthand really opened my eyes to how powerful

628
00:38:38,350 --> 00:38:42,190
the technology had been even back in, like, 2017. And ever

629
00:38:42,190 --> 00:38:45,730
since then, I started learning more and more and more about statistics,

630
00:38:45,950 --> 00:38:49,184
AI, natural language processing through my internships,

631
00:38:49,565 --> 00:38:52,944
learning more complicated problems, reading research papers, so on and so forth.

632
00:38:53,405 --> 00:38:56,845
And I got to where I am now. A lot of where I learned is

633
00:38:56,845 --> 00:39:00,125
just out of pure curiosity. Just like, okay. There's this new thing. I wanna learn

634
00:39:00,125 --> 00:39:03,619
about it. That's where I wanna be. And that's kind of how I fell into

635
00:39:03,619 --> 00:39:06,980
large language models and AI, just by wanting to learn about what was going to

636
00:39:06,980 --> 00:39:10,740
happen and then eventually being there. So it definitely found me. I was

637
00:39:10,740 --> 00:39:14,415
not looking for it. Didn't even know I liked statistics until I started doing

638
00:39:14,415 --> 00:39:17,935
statistical modeling. And I was like, wait. This is really fun. I wanna do a

639
00:39:17,935 --> 00:39:20,735
lot more of this. I wanna learn a lot more of this. And I knew

640
00:39:20,735 --> 00:39:24,335
that, once I was in college and I bought a statistics book for fun, and

641
00:39:24,335 --> 00:39:27,160
I was like, okay. I'm I'm past the point of no return. Like, this is

642
00:39:27,160 --> 00:39:30,040
definitely Right. Right. Right. Right. That that might be one of the first times in

643
00:39:30,040 --> 00:39:33,560
history that that's been said. Right. Because I I learned statistics for

644
00:39:33,560 --> 00:39:37,320
fun. I I took stats in college.

645
00:39:37,320 --> 00:39:40,715
I hated it. Hated every minute of it. But

646
00:39:40,715 --> 00:39:43,775
when I got into data science,

647
00:39:44,635 --> 00:39:48,315
I the first two weeks were not fun. I'm not gonna lie. Yep. But

648
00:39:48,315 --> 00:39:51,535
just like the VI editor, once you stick with it,

649
00:39:51,835 --> 00:39:55,610
Stockholm syndrome kicks in, And you start loving

650
00:39:55,610 --> 00:39:59,450
it. That's cool. 2, what's your favorite

651
00:39:59,450 --> 00:40:03,210
part of your current gig? The favorite part of my

652
00:40:03,210 --> 00:40:06,670
current job is being able to learn interesting,

653
00:40:06,810 --> 00:40:10,375
fun, even complicated things in data science and AI,

654
00:40:10,675 --> 00:40:14,115
and figuring out how to communicate them to a wide

655
00:40:14,115 --> 00:40:17,635
audience. It's a really fun challenge. It's really similar to, like,

656
00:40:17,635 --> 00:40:21,235
what, 3 blue one brown does all the time on the YouTube channel, and it's

657
00:40:21,235 --> 00:40:24,940
something that I get to learn and practice and keep keep doing. That's the best

658
00:40:24,940 --> 00:40:28,060
part of the job. I love learning things and, like, teaching other people about them

659
00:40:28,060 --> 00:40:31,820
and learning even more things. And the fact that I have an opportunity to do

660
00:40:31,820 --> 00:40:35,260
that every single day is, like, the best. That's cool. That's

661
00:40:35,260 --> 00:40:39,025
cool. We have 3 complete sentences. When I'm

662
00:40:39,025 --> 00:40:42,705
not working, I enjoy blank. When I'm

663
00:40:42,705 --> 00:40:46,385
not working, I enjoy, baking sweet treats and

664
00:40:46,385 --> 00:40:50,099
goods. I can't have any dairy. So very often, I had to kind

665
00:40:50,099 --> 00:40:52,980
of give up a lot of the cakes and desserts that I loved eating when

666
00:40:52,980 --> 00:40:56,019
I was younger. So now I, like, spend my time trying to figure out how

667
00:40:56,019 --> 00:40:59,460
I can make them again without dairy so they taste really good. So that's that's

668
00:40:59,460 --> 00:41:02,835
something I enjoy I really enjoy doing. Very cool.

669
00:41:04,015 --> 00:41:07,155
Next, complete the sentence. I think the coolest thing in technology

670
00:41:07,375 --> 00:41:10,815
today is blank. I

671
00:41:10,815 --> 00:41:14,340
thought really hard about this question because we're living in a

672
00:41:14,420 --> 00:41:18,180
crazy time of technological development. But the thing that really

673
00:41:18,180 --> 00:41:22,020
stuck out to me and the thing that was also the moment for me

674
00:41:22,020 --> 00:41:25,540
when I started working with, like, chatbots and LLMs was code

675
00:41:25,540 --> 00:41:29,195
generation models. The first time I learned how to

676
00:41:29,195 --> 00:41:32,875
use, GitHub Copilot specifically, I

677
00:41:32,875 --> 00:41:36,475
was I was completing some function, and it completed it before I was done typing

678
00:41:36,475 --> 00:41:40,075
it. And I was like, what the heck? This is amazing. Like, this this this

679
00:41:40,075 --> 00:41:43,860
actually figured out exactly what I needed. And because I was still, like,

680
00:41:43,860 --> 00:41:47,000
a budding developer, it was extremely helpful because I could learn

681
00:41:47,380 --> 00:41:51,220
faster rather than having already a huge kind of store knowledge already in my

682
00:41:51,220 --> 00:41:54,680
brain and kind of pulling from that. So I could see it benefiting my workflow.

683
00:41:54,740 --> 00:41:58,125
So I think the development of those tools and modern tools like

684
00:41:58,125 --> 00:42:01,965
Cursor, so on and so forth, extremely cool. And I can't wait to

685
00:42:01,965 --> 00:42:05,805
see, like, what the next generation of those technologies will look like. Yeah. I

686
00:42:05,805 --> 00:42:09,420
mean, that's a that's a great example. It's almost like you don't

687
00:42:09,420 --> 00:42:13,180
need, you know, the the classic 10000 hours to master a skill or something like

688
00:42:13,180 --> 00:42:16,940
that. It's almost like you can leverage the AI to take on the

689
00:42:16,940 --> 00:42:20,780
lion's share of the 10000 hours. You're still gonna need to know something. You still

690
00:42:20,780 --> 00:42:23,825
have to put in some reps, but not to the degree that you used to.

691
00:42:23,825 --> 00:42:27,505
No. I think that's gonna be very transformative. I mean, I mean, I'm

692
00:42:27,505 --> 00:42:31,105
learning, JavaScript and Next. Js on the side because it's something I have no

693
00:42:31,105 --> 00:42:34,805
experience in. Right. And I was able to build my personal website

694
00:42:35,025 --> 00:42:38,579
entirely through using Cursor and Progression. Nice. I

695
00:42:38,579 --> 00:42:42,420
often check that out. Which is insane. Right? Which is, like, really, really

696
00:42:42,420 --> 00:42:45,859
fascinating. And and I'm not gonna claim to, like, suddenly be an expert in

697
00:42:45,859 --> 00:42:49,140
NextGen or anything like that. Right? Right. Right. Right. I still wanna learn, like, exactly

698
00:42:49,140 --> 00:42:52,075
what's going on under the hood, But having a project that you can kind of,

699
00:42:52,075 --> 00:42:55,755
like, tinker on that's, like, pretty small in scale and that you can kind of

700
00:42:55,755 --> 00:42:59,115
afford to make a few mistakes on and having, like, an expert system kind of

701
00:42:59,115 --> 00:43:02,955
help you go through that, expert, quote, unquote, being close enough, really cool

702
00:43:02,955 --> 00:43:06,760
learning experience. No. That's a great way to put it because, like, I I

703
00:43:06,980 --> 00:43:10,580
I don't have any apps on the modern devices. Right? Like,

704
00:43:10,580 --> 00:43:14,420
so, it would be nice if I

705
00:43:14,420 --> 00:43:18,040
had an Android app that could kick off some automation process that I have.

706
00:43:18,234 --> 00:43:21,855
Right? Or do some kind of tie in with, you know, Copilot

707
00:43:21,994 --> 00:43:25,675
into that or things like that. Like, where, you know, I

708
00:43:25,675 --> 00:43:29,115
originally wrote a content automation system I wrote. I originally wrote in

709
00:43:29,115 --> 00:43:32,494
dotnet, but I ported it to Python with the help of

710
00:43:33,039 --> 00:43:36,880
the help of AI. And I could well, that's just it. Right?

711
00:43:36,880 --> 00:43:40,640
It really the true valuable resource in in life is

712
00:43:40,640 --> 00:43:44,480
time. Right? Yes. It's not Yes. I mean, I could have done it by hand.

713
00:43:44,480 --> 00:43:47,619
I could have done it by myself, but it was one of those things where

714
00:43:48,425 --> 00:43:52,045
am I gonna do it because it's gonna take x number of hours or whatever?

715
00:43:53,065 --> 00:43:56,585
But if I can just kinda here's the dot net version that I, you know,

716
00:43:56,585 --> 00:44:00,185
I posted. This is before there was Copilot, so I pasted it into chat g

717
00:44:00,185 --> 00:44:03,900
p t. And it basically spit out a Python

718
00:44:03,900 --> 00:44:07,740
version, had some errors. You know, this was a while ago. But I

719
00:44:07,740 --> 00:44:11,180
was able to, inside of a day, get it done as opposed to

720
00:44:11,180 --> 00:44:14,865
before. Like, I know how my ADD works. Right? Like, I'll start it.

721
00:44:14,945 --> 00:44:18,705
First 3 days, working on it, grinding on it, and then

722
00:44:18,705 --> 00:44:22,465
I don't touch it again for 2 weeks. And it never gets built. But

723
00:44:22,465 --> 00:44:25,985
with this, I'm able to kinda harness the the spark of

724
00:44:25,985 --> 00:44:29,529
inspiration and and execute much faster. Now I think I don't think

725
00:44:29,529 --> 00:44:33,289
people fully realize, like, you know, it's not all doom and gloom. Nobody's

726
00:44:33,289 --> 00:44:37,130
gonna have any programming jobs. There's a lot of upside too. And I

727
00:44:37,130 --> 00:44:40,910
guess that's just where we are in the hype cycle. As you said.

728
00:44:41,210 --> 00:44:44,924
Yeah. Yeah. Yeah. Exactly. That's a good segue into I look forward to

729
00:44:44,924 --> 00:44:48,684
the day when I can use technology to blank. I look

730
00:44:48,684 --> 00:44:52,525
forward to the day where I can use technology to get a high quality

731
00:44:52,525 --> 00:44:56,000
education on any subject for free. So Nice.

732
00:44:56,380 --> 00:45:00,220
Free education is really important to me. A lot of

733
00:45:00,220 --> 00:45:03,980
what I learned about large language models, deep learning, all that

734
00:45:03,980 --> 00:45:07,500
stuff was online courses that I took for free on places like

735
00:45:07,500 --> 00:45:11,005
EDX, Coursera, so on and so forth. Or people sharing

736
00:45:11,005 --> 00:45:14,205
articles and kind of learning from them, or YouTube videos, or all that sort of

737
00:45:14,205 --> 00:45:16,965
things, in addition to my education. But there's a lot of things you kinda have

738
00:45:16,965 --> 00:45:20,605
to learn after that. Right? And I think that especially with, like,

739
00:45:20,605 --> 00:45:24,349
cogeneration models, it's, like, very easy to be, like, okay. Build me this app

740
00:45:24,349 --> 00:45:26,829
and, like, just make it work. And you can sit there for a couple hours,

741
00:45:26,829 --> 00:45:29,730
and it'll, like, work. But I think the missing piece is

742
00:45:30,190 --> 00:45:33,710
creating a structured kind of learning path that's, like,

743
00:45:33,710 --> 00:45:37,365
personalized to whoever you are for the

744
00:45:37,365 --> 00:45:41,125
thing that you're really interested in with the context of

745
00:45:41,125 --> 00:45:44,885
having, like, these tools that can help you do that thing. And I'm not sure

746
00:45:44,885 --> 00:45:48,645
if we have anybody or any offering that can

747
00:45:48,645 --> 00:45:51,590
kind of do that technologically, because you need a lot of information about what the

748
00:45:51,590 --> 00:45:54,710
user knows or doesn't know. You need to be able to create ability, and then

749
00:45:54,710 --> 00:45:57,830
you need to be able to kind of create, like, an entire mini course that's

750
00:45:57,830 --> 00:46:01,510
personalized to whatever that person needs. But if we can do that, we can solve

751
00:46:01,510 --> 00:46:05,085
so many wonderful problems. Absolutely. I'm

752
00:46:05,085 --> 00:46:08,845
thinking about special education needs and things like that. I don't think we're that

753
00:46:08,845 --> 00:46:12,445
far off from this. No. But I

754
00:46:12,605 --> 00:46:15,965
the biggest issue, is going to be just hallucinations. Right? And,

755
00:46:15,965 --> 00:46:19,810
hopefully, people can build, like, rag systems using tools like PineCone to kind

756
00:46:19,810 --> 00:46:23,490
of produce those hallucinations. But we will also for for something like

757
00:46:23,490 --> 00:46:26,930
that specific use case, we probably need, like, another breakthrough in

758
00:46:26,930 --> 00:46:30,530
indexing information or kind of presenting it, or we need a process that

759
00:46:30,530 --> 00:46:34,125
really allows people to create this information quickly

760
00:46:34,345 --> 00:46:38,025
and verifiably in order to kind of make that happen. But if if that is

761
00:46:38,025 --> 00:46:41,225
a future that we can live in, where technology can can kind of, like, help

762
00:46:41,225 --> 00:46:44,985
people learn, like, really important things really well, that would be

763
00:46:44,985 --> 00:46:48,125
wonderful. And I think that would be, like, amazing for for humanity.

764
00:46:48,730 --> 00:46:52,490
Oh, absolutely. Share something different

765
00:46:52,490 --> 00:46:54,670
about yourself, but remember as a family podcast.

766
00:46:57,130 --> 00:47:00,490
One of my favorite hobbies for about a decade is

767
00:47:00,490 --> 00:47:04,255
designing and folding origami. And it's really fun.

768
00:47:04,255 --> 00:47:07,935
It's very easy, but it's also very hard. There's a lot

769
00:47:07,935 --> 00:47:11,695
of comp complexity inside it as well. One thing people

770
00:47:11,695 --> 00:47:14,995
don't know about that is that there's a lot of mathematical complexity.

771
00:47:15,320 --> 00:47:19,160
So once you get to a point where you wanna design a model with

772
00:47:19,160 --> 00:47:22,680
really specific qualities, really specific features, it suddenly

773
00:47:22,680 --> 00:47:26,520
becomes a paper optimization problem where you

774
00:47:26,520 --> 00:47:30,145
have, like, a fixed size square, and you have different

775
00:47:30,145 --> 00:47:33,825
regions of that paper that you're allocating to portions of the model you're

776
00:47:33,825 --> 00:47:37,125
designing. And it turns out that there are entire mathematical

777
00:47:37,424 --> 00:47:40,944
principles and procedures to solve this problem. So much

778
00:47:40,944 --> 00:47:44,410
so that one of the leading, like, practitioners in the

779
00:47:44,410 --> 00:47:48,250
field is, like, this physicist who wrote a textbook on how to do origami design,

780
00:47:48,250 --> 00:47:51,870
and that's, like, the textbook everyone looks at. So, like, learn how to solve it.

781
00:47:51,930 --> 00:47:55,470
Yeah. I'm not surprised. There's definitely there's definitely a a correlation

782
00:47:55,530 --> 00:47:59,185
between the mathematics of that. And I look at origami creations, and I

783
00:47:59,185 --> 00:48:03,025
just fascinated that could be done from a single sheet. Like, it's

784
00:48:03,025 --> 00:48:06,705
just how is that I mean, that's just mind bending. Now it's

785
00:48:06,785 --> 00:48:09,984
and and makes sense that there's a mathematical because you have a certain type of

786
00:48:09,984 --> 00:48:13,260
constraint, And there's obviously

787
00:48:14,039 --> 00:48:17,799
folds factor into it and things like that. And, yeah, that's that's

788
00:48:17,799 --> 00:48:20,920
interesting. I I should what's the name of that book? I should pick it up.

789
00:48:20,920 --> 00:48:24,680
It's called Origami Design Secrets. Got it. Alright. I will check

790
00:48:24,680 --> 00:48:28,394
it out. So where can people learn more about

791
00:48:28,394 --> 00:48:32,075
you and Pinecone? Of course. You wanna learn more about Pinecone? The

792
00:48:32,075 --> 00:48:35,914
best place is our website, pinecone. Io. You can also find

793
00:48:35,914 --> 00:48:39,295
us on LinkedIn and on x and other social media platforms.

794
00:48:39,830 --> 00:48:42,870
You wanna learn more about me? You can go to my LinkedIn, which you can

795
00:48:42,870 --> 00:48:46,710
find at Arjun Girthi Patel, or you can go to my website, which is also

796
00:48:46,710 --> 00:48:50,070
my name, arjun, k I r t I p

797
00:48:50,070 --> 00:48:53,885
a t e l.com. Cool. And we can also check out your

798
00:48:53,885 --> 00:48:57,565
Next JS skills there too. Exactly. Hopefully, nothing is

799
00:48:57,565 --> 00:49:01,405
broken, but, you can you can see you can see how well I've gotten by

800
00:49:01,405 --> 00:49:05,210
with the Awesome. Trust me.

801
00:49:05,210 --> 00:49:07,630
JavaScript alone is is a is a frustration

802
00:49:08,890 --> 00:49:09,789
creation device.

803
00:49:12,410 --> 00:49:15,609
Audible sponsors the podcast. Do you do audio books? Is there a book that you

804
00:49:15,609 --> 00:49:19,115
would recommend? I do do audiobooks, but I've just

805
00:49:19,115 --> 00:49:22,955
started recently, so I don't have a huge, audiobook library. But

806
00:49:22,955 --> 00:49:26,715
there is I I am a huge fan of short story collections, and

807
00:49:26,715 --> 00:49:30,329
kind of the one that comes to mind is really anything by Ted

808
00:49:30,329 --> 00:49:33,289
Chiang, who does a lot of kind of sci fi short stories. If you've seen

809
00:49:33,289 --> 00:49:37,130
the movie Arrival, the short story based on that is story of your life,

810
00:49:37,130 --> 00:49:40,650
and it's wonderfully written. It's one of my favorite short stories ever.

811
00:49:40,650 --> 00:49:44,255
Yep. So highly recommend that. I believe the collection is

812
00:49:44,255 --> 00:49:47,694
called, story of your life and others, something like that. So

813
00:49:47,934 --> 00:49:51,295
Oh, interesting. Careful with audiobooks. They are very

814
00:49:51,295 --> 00:49:54,850
addictive. So,

815
00:49:55,710 --> 00:49:58,590
with Audible is a sponsor of the show. So if you go to the data

816
00:49:58,590 --> 00:50:02,130
driven book.com, you'll get routed to Audible and

817
00:50:02,350 --> 00:50:05,650
you'll get a free book on us. And if you

818
00:50:06,105 --> 00:50:09,305
choose to subscribe, we'll get a little bit of kickback. It helps run the show

819
00:50:09,305 --> 00:50:13,145
and helps, helps us bring, bring some good stuff to to

820
00:50:13,145 --> 00:50:16,445
the masses. So any any parting thoughts?

821
00:50:18,425 --> 00:50:21,145
No. But thank you so much for having me on, Frank. This was a ton

822
00:50:21,145 --> 00:50:24,160
of fun. I learned a lot from you, and I hope I I helped you

823
00:50:24,160 --> 00:50:28,000
learn one one small thing as well. Absolutely. It was it was

824
00:50:28,000 --> 00:50:31,600
a great conversation, and, we'll let the nice British lady finish the

825
00:50:31,600 --> 00:50:35,305
show. And that's a wrap for this episode of Data Driven, where we

826
00:50:35,305 --> 00:50:38,765
journeyed from the intricacies of vector databases to the surprising

827
00:50:38,905 --> 00:50:42,665
elegance of origami. A huge thank you to Arjun Patel for

828
00:50:42,665 --> 00:50:46,505
sharing his insights on retrieval augmented generation and his passion

829
00:50:46,505 --> 00:50:50,330
for making AI accessible to all. From turning raw data

830
00:50:50,330 --> 00:50:54,010
into actionable knowledge to turning paper into art, Arjun

831
00:50:54,010 --> 00:50:57,850
proves there's beauty in both precision and creativity. If today's

832
00:50:57,850 --> 00:51:01,610
episode left you curious, inspired, or just itching to fold a

833
00:51:01,610 --> 00:51:04,994
piece of paper into something meaningful, be sure to check out

834
00:51:04,994 --> 00:51:08,535
Arjun's work and Pinecones innovative tools. Remember,

835
00:51:08,755 --> 00:51:12,515
knowledge might be power, but sharing it makes you a force to be reckoned

836
00:51:12,515 --> 00:51:16,275
with. As always, I'm Bailey, your semi sentient guide to

837
00:51:16,275 --> 00:51:19,660
all things data. Reminding you that while AI might shape our

838
00:51:19,660 --> 00:51:23,340
future, it's the human touch or sometimes the paper fold that

839
00:51:23,340 --> 00:51:26,720
gives it meaning. Until next time, stay curious,

840
00:51:27,020 --> 00:51:30,160
stay analytical, and don't forget to back up your data.

841
00:51:30,540 --> 00:51:31,040
Cheerio.