Speaker: 00:00:00

Welcome back to Data Driven, the podcast where we chart the thrilling

Speaker: 00:00:03

terrains of data science, AI, and everything in between.

Speaker: 00:00:07

I'm Bailey, your semiscient host with a pangshang for

Speaker: 00:00:11

sarcasm and a wit sharper than a histogram spike.

Speaker: 00:00:14

Today's episode promises a delightful mix of the analytical and the

Speaker: 00:00:18

artistic as we dive into the fascinating world of vector databases,

Speaker: 00:00:22

retrieval augmented generation, and origami. Yes.

Speaker: 00:00:26

You heard that right. Origami, the ancient art of

Speaker: 00:00:29

folding paper, somehow finds itself intersecting with AI,

Speaker: 00:00:33

proving that the future really does have layers or should I say folds.

Speaker: 00:00:37

Our guest, Arjun Patel, is a developer advocate at Pinecone

Speaker: 00:00:41

who's on a mission to demystify vector databases and semantic

Speaker: 00:00:45

search, turning complex AI concepts into snackable bits of

Speaker: 00:00:48

brilliance. He's also a self taught origami artist and a

Speaker: 00:00:52

former statistics student who actually enjoyed it. So if

Speaker: 00:00:56

you're ready to unravel the secrets of modern AI and maybe pick up a trick

Speaker: 00:01:00

or two about folding life into geometric perfection, you're in the

Speaker: 00:01:03

right place.

Speaker: 00:01:08

Hello, and welcome back to Data Driven, the podcast where we explore the emergent

Speaker: 00:01:11

fields of data science, AI, data engineering.

Speaker: 00:01:16

Now today, due to a scheduling conflict, my most favorite is data engineer

Speaker: 00:01:19

in the world will not be able to make it. But I will

Speaker: 00:01:23

continue on, despite the recent snowstorms that we've had here in

Speaker: 00:01:27

the DC Baltimore area. With me today, I have

Speaker: 00:01:30

Arjun Patel, a developer advocate at Pinecone,

Speaker: 00:01:34

who aims to make vector databases retrieval augmented generation,

Speaker: 00:01:38

also known as RAG, and semantic search accessible by

Speaker: 00:01:42

creating engaging YouTube videos, code notebooks, and blog

Speaker: 00:01:45

posts that transform complex AI concepts

Speaker: 00:01:49

into easily understandable content. After graduating with

Speaker: 00:01:53

a BA in statistics from the University of Chicago, his journey through

Speaker: 00:01:57

tech world stands spans from making speech coaching

Speaker: 00:02:00

accessible with AI at Speeko to tackling AI

Speaker: 00:02:04

generated content detection at Appen. Arjun's

Speaker: 00:02:08

interest spans traditional natural language processing into modern

Speaker: 00:02:12

large language model development and applications.

Speaker: 00:02:15

Behind beyond his technical prowess, Arjun has been designing and folding his

Speaker: 00:02:19

own origami creations for over a decade. Interesting.

Speaker: 00:02:23

Seamlessly blending analytical thinking with artistic expression and his

Speaker: 00:02:26

professional and personal pursuits. Welcome to the show, Arjun.

Speaker: 00:02:30

Hey. Nice to meet you, Frank. Thanks for having me on. Excited to be here.

Speaker: 00:02:34

Awesome. Awesome. There's a lot to unpack from there, but I think it's interesting to

Speaker: 00:02:37

note that you have a BA in statistics. Yes. So you were probably

Speaker: 00:02:41

studying, this sort of stuff before it was cool?

Speaker: 00:02:45

Yeah. Yeah. A lot of the old school ways of analyzing

Speaker: 00:02:49

data, understanding what's going on, so on and so forth.

Speaker: 00:02:53

It was kind of, like, made clear to me pretty early that

Speaker: 00:02:56

understanding how to work with data at small scale and at large scale is gonna

Speaker: 00:03:00

be very important going to the future. So I kinda just took that and ran

Speaker: 00:03:03

with it with my education. Very cool. It was

Speaker: 00:03:07

definitely, you know, one of those things where I don't

Speaker: 00:03:10

think people realized how important statistics would be until,

Speaker: 00:03:15

you know, until the revolution happens, so to speak. So and it's also

Speaker: 00:03:19

interesting to see because there's a lot of people that I think could benefit from,

Speaker: 00:03:23

you know, picking up that old picking up a, an old statistics book and

Speaker: 00:03:26

reading through it and understanding, like, a lot of the fundamentals. Obviously, there's a lot

Speaker: 00:03:30

of new things, but a lot of the fundamentals are largely the

Speaker: 00:03:34

same. You know, just I'll

Speaker: 00:03:37

use this example. You know, McDonald's can add a Mc McRib sandwich,

Speaker: 00:03:41

but it's still a McDonald's. Right? Like, it's This

Speaker: 00:03:45

is what happens when you're shoveling snow. Like, your

Speaker: 00:03:49

brain gets I absolutely agree. And, like,

Speaker: 00:03:52

another proof on that point is that Anthropic just released a

Speaker: 00:03:56

blog recently kind of recapping how to do statistical analysis when you're

Speaker: 00:04:00

comparing different large language models. And when you read the paper in the blog,

Speaker: 00:04:04

it's basically just like 2 sample t tests and kind of going over really,

Speaker: 00:04:08

like, not introductory, but still statistics that's easily accessible for people to

Speaker: 00:04:12

learn and understand. So it's still relevant, and it's still important.

Speaker: 00:04:15

Interesting. One of the things that that that stood out in your in your bio

Speaker: 00:04:19

was, people tend to forget that there

Speaker: 00:04:23

was a natural language processing field prior

Speaker: 00:04:27

to chat gpt launching.

Speaker: 00:04:31

How do you, you know,

Speaker: 00:04:36

we wanna talk about the difference between those 2? Sure.

Speaker: 00:04:40

So the one of the first and probably only

Speaker: 00:04:44

course I took in college related to natural language processing was

Speaker: 00:04:48

called geometric models of meaning. And everything I learned in that

Speaker: 00:04:51

course was like everything before, what we now would

Speaker: 00:04:55

consider, like, modern embedding models. So bag of

Speaker: 00:04:59

word methods, understanding how to represent documents and text purely

Speaker: 00:05:03

based on, like, the frequency of the words that exist in the text,

Speaker: 00:05:06

and then trying to understand, like, okay. Based on that information, how can

Speaker: 00:05:10

we learn about the concepts that exist in text from the words that are being

Speaker: 00:05:14

used? Like, what is the framework we can use to understand what these

Speaker: 00:05:17

words mean based on their, co occurrences with the other words and

Speaker: 00:05:21

texts that you're working with and based on, what those

Speaker: 00:05:25

words mean as well. So, like, what the words' neighbors are and what their meaning

Speaker: 00:05:28

helps and also what those words are doing. And I think a lot of traditional

Speaker: 00:05:32

natural language processing, methodologies kinda stem from that, and

Speaker: 00:05:36

there's a there's a lot of mileage you can get out of just thinking about

Speaker: 00:05:39

approaching problems there before you step into these more complicated methods,

Speaker: 00:05:43

like, these embed modern embedding models that exist. So that's kind of, like, what I

Speaker: 00:05:46

would consider, like, traditional NLP, like, doing named entity recognition,

Speaker: 00:05:50

trying to understand how to, find keywords really

Speaker: 00:05:54

quickly. And then once you get really good at that, there's a whole host of

Speaker: 00:05:58

problems that you encounter afterward that kind of modern techniques try to

Speaker: 00:06:02

solve. Right. That's interesting. So so

Speaker: 00:06:05

what was it, what was your thoughts

Speaker: 00:06:09

when you first, like given that you were an NLP practitioner

Speaker: 00:06:13

prior to the release of transformers and things like that, what was your initial thought?

Speaker: 00:06:17

Because I'm curious because there's not a lot of people there are a

Speaker: 00:06:21

lot of experts today that really kind of started a couple of years ago. No

Speaker: 00:06:24

fault on them. They see where the industry is going. Totally understand it. But what

Speaker: 00:06:28

was your thoughts? What was your thoughts when

Speaker: 00:06:31

you when you first saw the attention all you need? The

Speaker: 00:06:35

attention is all you need paper. So that would have been

Speaker: 00:06:39

probably around the time I graduated college, around

Speaker: 00:06:42

maybe a year or 2 after I took the course that I was just describing.

Speaker: 00:06:45

So I I just started learning about, like, okay. Like, this is

Speaker: 00:06:49

how, like, old school, quote unquote, like, embedding

Speaker: 00:06:52

methodologies work. And the biggest takeaway that I got from those is that they work

Speaker: 00:06:56

pretty well. They work pretty well for, like, a lots of different kinds of

Speaker: 00:07:00

queries. And I think what the attention all you need paper did

Speaker: 00:07:03

was it kinda helped you, understand how

Speaker: 00:07:07

to rigorously create representations of text that

Speaker: 00:07:11

generalize way better than, any sort of, like,

Speaker: 00:07:14

normal, keyword based, bag of word based search methodology.

Speaker: 00:07:18

And I think that at the time, I probably didn't

Speaker: 00:07:23

grasp as much what impact the attention all you need paper would have on the

Speaker: 00:07:26

field until we started getting embedding models that people could use really

Speaker: 00:07:30

easily, like Roberta or Bert. And we're like, okay. Now we can do, like,

Speaker: 00:07:34

multilingual search without any issue. Now we can represent,

Speaker: 00:07:37

like, any sentence without keyword overlap when we

Speaker: 00:07:41

wanna find some document that's interesting, without doing any

Speaker: 00:07:45

additional work. Like, once those papers started hitting the scene, I think now we start

Speaker: 00:07:48

seeing, like, okay, this is what attention is doing for us. This is what the

Speaker: 00:07:51

ability to, like, contextualize our vector embeddings is doing for us.

Speaker: 00:07:55

And now we can see what's kind of getting benefited there. But I think I

Speaker: 00:07:58

think my, understanding of how beneficial that

Speaker: 00:08:02

was kind of lagged until we started seeing these other models kind of hit. And

Speaker: 00:08:06

I'm like, okay. Now I can kinda see why this is important and why, like,

Speaker: 00:08:09

future and future models are gonna get better and better based on this architecture.

Speaker: 00:08:13

Interesting. So so for those that don't know kind of and even I'm rusty on

Speaker: 00:08:17

this. Right? Yeah. One of the things that was interesting about this was the in

Speaker: 00:08:17

on this. Right? Yeah. One of the things that was interesting about this was the

Speaker: 00:08:21

in first, appearance. What was it? You you just described it a

Speaker: 00:08:25

minute ago, but it was something like the the prevalence of a word

Speaker: 00:08:29

in a bit of text versus the lack of prevalence and how that

Speaker: 00:08:32

metric becomes was very important in in

Speaker: 00:08:36

I'll call it classical natural language processing.

Speaker: 00:08:40

Right. So this is the idea that if you have words that co

Speaker: 00:08:44

occur together in some document space, the meaning of those words are gonna be

Speaker: 00:08:48

more similar than words that don't co occur in some other given document

Speaker: 00:08:51

space. This is rooted in something called the

Speaker: 00:08:55

distributional hypothesis, which is basically this idea and the other

Speaker: 00:08:58

idea that, concepts cluster in in this type of

Speaker: 00:09:02

space. So what what does that mean actually? Right? So if you have the word

Speaker: 00:09:06

like hot dog, it's probably gonna be seen in a corpus that's

Speaker: 00:09:09

near other food related words than it would be if you picked some

Speaker: 00:09:13

other word like space or moon. And there's something we can

Speaker: 00:09:17

learn from that relationship to infer the meaning of what that word

Speaker: 00:09:20

is and how we can use that meaning of that word to learn about what

Speaker: 00:09:24

other words are doing. So So this is kind of, like, the theoretical

Speaker: 00:09:28

basis of, like, why we can represent words geometrically,

Speaker: 00:09:32

with with a little bit of hand waving. But that's kind of the core idea.

Speaker: 00:09:35

And attention kind of takes this a little further by allowing the

Speaker: 00:09:39

representation of these tokens or words to be altered based

Speaker: 00:09:43

on the words that occur in a given sentence. So you might have a

Speaker: 00:09:47

word like does, like, does this mean something?

Speaker: 00:09:50

You might say something like that. Or you might say, I saw some

Speaker: 00:09:54

does in the forest. Both spelled exactly the same, but have

Speaker: 00:09:58

completely different meanings based on their context. And if you used a

Speaker: 00:10:01

traditional, maybe, bag of words model where you're just counting the

Speaker: 00:10:05

words that occur in a given document and kind of creating a representation of what

Speaker: 00:10:08

that document looks like based on the words that are composed in there, you're gonna

Speaker: 00:10:12

overlap and conflict with the meaning of those of of the word

Speaker: 00:10:16

does and does because they're spelled exactly the same. They might look

Speaker: 00:10:19

exactly the same with this type of representation. But if you have a way of

Speaker: 00:10:23

informing what that word means with its context, which is what attention

Speaker: 00:10:27

allows us to do, then you can completely change how that's being

Speaker: 00:10:30

represented in your downstream system, which allows you to do interesting things

Speaker: 00:10:34

with with search. So that's kind of, like, the biggest benefit that's coming out of

Speaker: 00:10:38

that type of methodology, and that kinda enables what is now known as

Speaker: 00:10:41

semantic search and retrieval augmented generation and so on and so forth. I was gonna

Speaker: 00:10:45

say, that sounds very it's almost like it was, like, the old pre

Speaker: 00:10:50

that error, the vectorization of this and the distance in

Speaker: 00:10:53

that vector in that geometric space. I guess

Speaker: 00:10:57

we've been doing that for a lot longer than most people realize in in a

Speaker: 00:11:00

sense. Yeah. I mean,

Speaker: 00:11:04

looking through, indexes or document stores with some sort of

Speaker: 00:11:07

vectorization has has has been,

Speaker: 00:11:12

something that people have done, except instead of being dense vectors, which is, like,

Speaker: 00:11:16

you have some fixed size representation that isn't necessarily interpretable

Speaker: 00:11:19

to the human eye for some given query or document, it would

Speaker: 00:11:23

be, like, the size of your vocabulary. So you think of, like, Wikipedia. You

Speaker: 00:11:27

can find, like, every unique word on Wikipedia, and, like, that is gonna be how

Speaker: 00:11:31

big your vector's gonna be. And every time you have a new document come in,

Speaker: 00:11:34

a new article, somebody's kind of, like, wrote up and published to Wikipedia, like, you're

Speaker: 00:11:37

representing that in terms of its vocabulary. But now instead of doing that, we

Speaker: 00:11:41

have, like, this magical fixed sized box that allows us

Speaker: 00:11:45

to represent chunks of text in a way that is

Speaker: 00:11:49

extremely fascinating and abstract. And every time I think about it, it just, like, blows

Speaker: 00:11:52

my mind, but that's kind of, like, the main kind of difference is the way

Speaker: 00:11:56

we're representing that information and how compact compact that is and

Speaker: 00:11:59

generalizable it has become. Yeah. That is, like, it it's almost

Speaker: 00:12:03

like you're, you know correct me if I'm wrong, but, you know,

Speaker: 00:12:06

creating these vectors, these large vector databases, right, with, you

Speaker: 00:12:10

know, 10, 12,000 dimensions, right, of how these words

Speaker: 00:12:14

are measured in relationship to others.

Speaker: 00:12:17

It's almost as a consequence of training a large language

Speaker: 00:12:21

model, you create a knowledge graph. Is that is that true? Is that really the

Speaker: 00:12:24

case where, you know, like, you know, dog is most likely to be

Speaker: 00:12:28

next to, you know, the word pet, you know, or

Speaker: 00:12:32

it has the same distance. Is that I'm not

Speaker: 00:12:35

explaining it right. No. No. No. You're you're on you're on the right track exactly.

Speaker: 00:12:39

And I think this is, like, one of the most fascinating qualities

Speaker: 00:12:43

of even, like, what people would consider, like, older

Speaker: 00:12:46

embedding models is this idea that you can take, like, a training test that

Speaker: 00:12:50

seems completely unrelated to the quality that you want in a downstream model,

Speaker: 00:12:54

and it turns out that that actually achieves that quality. So, what you were referring

Speaker: 00:12:58

to, Frank, is this idea that you might have, like, a sentence. You

Speaker: 00:13:02

might have, like, I took my dog out on a walk, and you might say,

Speaker: 00:13:05

okay. I'm gonna remove the word, walk, and I'm gonna have

Speaker: 00:13:09

I'm gonna train some model that tries to predict what that word

Speaker: 00:13:13

where I removed was. This is masked language modeling, which is this idea that you're

Speaker: 00:13:17

kind of getting at of, like, okay, what are the words and how are they

Speaker: 00:13:20

in relation to the other words in that sentence? And it turns out that if

Speaker: 00:13:23

you, like, do this with, like, 100 of 1,000 of millions of sentences and

Speaker: 00:13:26

words, in some corpus that is somewhat representative of

Speaker: 00:13:30

how people, use human language, you can

Speaker: 00:13:34

act you will get really good at this task, number 1, because you're training the

Speaker: 00:13:37

model on that task exactly. But if you are training a neural

Speaker: 00:13:41

network on that model, some intermediate layer representation

Speaker: 00:13:45

in that model so somewhere in that set of matrix

Speaker: 00:13:48

multiplications where you're turning this input sentence into some fixed size

Speaker: 00:13:52

vector representation is gonna be a good representation

Speaker: 00:13:56

of what that word or that token or that sentence is going to be.

Speaker: 00:14:00

And the fact that that works is not intuitive. Right?

Speaker: 00:14:04

The the fact that that works has been shown empirically, and it turns out that

Speaker: 00:14:07

we can kind of do that and kind of have these models work really well.

Speaker: 00:14:10

And nowadays, in addition to kind of doing that, which is what we would consider

Speaker: 00:14:13

pretraining on some large corpus, we now fine tune those

Speaker: 00:14:17

embedding models on specific tasks that are important to us

Speaker: 00:14:21

for retrieval. Like, okay, we have this query or question we're

Speaker: 00:14:24

asking. We have the set of documents that might answer this question or might

Speaker: 00:14:28

not. We want a model that makes it so that the query's embedding and the

Speaker: 00:14:31

document relevance embeddings are in the same vector space. So you're on the right track.

Speaker: 00:14:35

That's, like, basically how these models are able to learn these things. I don't know

Speaker: 00:14:39

if I would call them, graph representation, maybe a little bit

Speaker: 00:14:43

of, being being pandactic on, like, use of words there because that can

Speaker: 00:14:46

be a little bit, different how how you're organizing that information.

Speaker: 00:14:50

But you can make the argument that the way that these large language models are

Speaker: 00:14:54

representing information is a compressed form of, like, the giant dataset that they're

Speaker: 00:14:57

trained on. And we don't actually know exactly, like, where that

Speaker: 00:15:01

information lies inside that neural network. There's some research that's,

Speaker: 00:15:05

like, trying to get at answering that question, But you could, for the sake of

Speaker: 00:15:08

argument, be like, yeah. There's probably, like, a a a dog

Speaker: 00:15:12

node somewhere in this neural network that knows a ton about dogs, and that's how

Speaker: 00:15:15

we're able to kind of learn this information. That is the stuff that we don't

Speaker: 00:15:18

exactly know. Interesting. Because, there was a really good

Speaker: 00:15:22

video by 3 blue one brown, which you probably are I love that

Speaker: 00:15:26

channel. Where he gives examples where, you know, famous historical

Speaker: 00:15:29

leaders from Britain have the same distance

Speaker: 00:15:33

from you change the country to Italy

Speaker: 00:15:37

or the United States have the same kind of distance. So you can kind

Speaker: 00:15:41

of infer I'm not saying that the AI it

Speaker: 00:15:44

almost seems like this knowledge graph is also is also a byproduct

Speaker: 00:15:48

of of of building this out. Like, the there's some

Speaker: 00:15:51

type of encoding or semantic, I guess, is this is really what it is. Right?

Speaker: 00:15:55

Like, that that you get with it. And, I wanna get

Speaker: 00:15:59

your thoughts because yesterday, I I caught the part the

Speaker: 00:16:03

first half of the Jetson Juan keynote at c s CES,

Speaker: 00:16:07

which this you know, we're recording this on January 8th. Right? And one of the

Speaker: 00:16:10

things that the video starts off with is, you know, the idea

Speaker: 00:16:13

that tokens are kind of fundamental elements of

Speaker: 00:16:17

knowledge. And I did a live stream where I'm like, well, I never really thought

Speaker: 00:16:21

about it this way. Right? They're they're building blocks of knowledge or the pixels, if

Speaker: 00:16:24

you will, of of of of knowledge. And I wanted to get your

Speaker: 00:16:28

thoughts on that because, like, that kind of blew my mind and maybe I'm simple.

Speaker: 00:16:32

I don't know. Maybe I'm not. But it all it seems like we've been kinda

Speaker: 00:16:35

dancing around this idea where and now NVIDIA is really

Speaker: 00:16:38

fully, you know, going all in on this, the idea that, you know,

Speaker: 00:16:42

these are not, this isn't an AI system. It's a token factory

Speaker: 00:16:46

or a token score. What are your what are your thoughts on that? I'm curious.

Speaker: 00:16:50

So when I started learning about how, like, tokenization works

Speaker: 00:16:54

and how we're able to kind of, like, basically build these

Speaker: 00:16:57

models without having massive, massive vocabularies,

Speaker: 00:17:01

it is it is pretty it it is pretty

Speaker: 00:17:05

interesting to be, like, okay. Like, maybe maybe there's some,

Speaker: 00:17:10

abstract notion of information that each token has that

Speaker: 00:17:13

is being that is what the model is learning during training time. And then

Speaker: 00:17:17

we're just combining these sets of information in order to kind of, like, understand

Speaker: 00:17:21

what words mean or what documents mean, so on and so forth. Because when you

Speaker: 00:17:24

look at how, tokenizers work and the size of the number of

Speaker: 00:17:28

tokens for, like, maybe the English language or maybe, like, a really multilingual

Speaker: 00:17:32

model like Roberta or multilingual e five large, they're a lot

Speaker: 00:17:35

less than you would expect. Like, it's on the order of, like, maybe a 100000,

Speaker: 00:17:39

200000, 300000, tokens.

Speaker: 00:17:43

So it is kind of

Speaker: 00:17:47

odd to think about whether those tokens

Speaker: 00:17:50

themselves hold information that's readily interpretable for us. But I

Speaker: 00:17:54

think that we've gotten so far with using

Speaker: 00:17:57

systems that are just combining, the operations on top of

Speaker: 00:18:01

these tokens in order to retrieve the information that these systems have learned, that there's

Speaker: 00:18:05

definitely something important there. And I would love to, like, know

Speaker: 00:18:08

exactly, like, what is happening when we're able to do that. The the

Speaker: 00:18:12

heuristic that I like to use is, large

Speaker: 00:18:16

language models are generally reflections of the training datasets that they've been trained on,

Speaker: 00:18:20

and they're basically creating, like, really efficient indexes over that

Speaker: 00:18:23

information. And sometimes those indices hallucinate. And the reason

Speaker: 00:18:27

why is because we are when we ask, quote, unquote, what

Speaker: 00:18:31

a question to a large language model or query a large language model, we

Speaker: 00:18:35

are kind of conditioning that model, on a probability

Speaker: 00:18:39

space where every token being generated after is

Speaker: 00:18:42

likely to exist given the query or the context or whatever we're passing to

Speaker: 00:18:46

it. And once you think about it that way, then it just feels like

Speaker: 00:18:50

instead of thinking about what each of the tokens are doing, you're kind of just

Speaker: 00:18:54

querying what the model has been trained on and what it will tell you

Speaker: 00:18:57

based on what it, quote unquote, learned or knows.

Speaker: 00:19:01

And then you can kind of run with that metaphor a lot and build systems

Speaker: 00:19:04

on on top of that. That seems, much more actionable than thinking about,

Speaker: 00:19:08

like, what each of the tokens are doing individually. Does that kinda make sense? No.

Speaker: 00:19:11

That makes a lot of sense. I think the whole gestalt of it is what

Speaker: 00:19:13

really makes it magical. Right? Like Yeah. You know, you can you

Speaker: 00:19:16

can obviously, I I don't this is not this is not, like, the newest iPhone

Speaker: 00:19:20

or whatever. But, you know, if you go through the the text auto complete,

Speaker: 00:19:24

you can maybe make a sentence that sounds like

Speaker: 00:19:28

something you would write. But much beyond that, it starts getting weird. In

Speaker: 00:19:32

early generative AI was very much like that, particularly the images.

Speaker: 00:19:35

Well, you know Don't like, yes. A 100%

Speaker: 00:19:39

understand. I started learning about generative, text

Speaker: 00:19:43

generation before we had instruction fine tune model. So are you

Speaker: 00:19:46

familiar with, like, the concept of instruction fine tuning, Frank? I think I am,

Speaker: 00:19:50

but I IBM slash Red Hat defines it one way. I would like to get

Speaker: 00:19:53

your opinion. Yeah. So, this is the idea that

Speaker: 00:19:57

you can train or fine tune large language models to follow

Speaker: 00:20:01

instructions to complete tasks. So, before we had,

Speaker: 00:20:04

like, models that could that we could just, like, ask questions of and just, like,

Speaker: 00:20:08

receive answers directly, you had to craft text

Speaker: 00:20:13

that would increase the probability that the document that you want to

Speaker: 00:20:16

generate would happen. So if you wanted a story about, like, unicorns or something,

Speaker: 00:20:20

you would have to start your query to the LLM as there

Speaker: 00:20:24

once was, like, a set of unicorns living in the forest. Blah blah blah blah.

Speaker: 00:20:27

And then it would just, like, complete sentence, just like a fancy version of autocomplete.

Speaker: 00:20:30

Right. And that that's kind of, like, what we used to have, and that was

Speaker: 00:20:34

pretty hard to work with. And then once researchers kinda cracked, like, wait a second.

Speaker: 00:20:37

We can create a dataset of, like, instruction pairs and, like, document

Speaker: 00:20:41

sets and fine tune models on them. And it turns out now we can just,

Speaker: 00:20:45

like, ask models to do things, and they will do them. Whether or not

Speaker: 00:20:48

those are correct is kind of the next part of the story. But getting to

Speaker: 00:20:52

that point, it was, like, pretty interesting and pretty significant.

Speaker: 00:20:56

Interesting. Interesting. When I think of

Speaker: 00:20:59

fine tuning, I think of I think of

Speaker: 00:21:03

primarily InstruqtLab, where you basically kinda have a

Speaker: 00:21:07

LoRa layer on top of the base LLM doing

Speaker: 00:21:10

that. Is that the same thing? Or is it kind of slightly

Speaker: 00:21:14

it sounds like it's slightly nuanced. So the nuance there

Speaker: 00:21:18

is that, one, though this the methodology that I'm

Speaker: 00:21:22

describing is mostly dataset driven. So you have, like, your original LLM,

Speaker: 00:21:26

and then you have, like, a new dataset that allows the LLM to learn a

Speaker: 00:21:29

specific task. Or in this case, like, a generalized form of tasks,

Speaker: 00:21:33

which is you have instruction, answer, user query,

Speaker: 00:21:37

give it an instruction. Whereas in your case, you're kind of, like, adding another layer

Speaker: 00:21:41

to the LLM and, like, forcing the LLM to learn all the new

Speaker: 00:21:44

methodology inside that layer in order to accomplish a specific

Speaker: 00:21:48

task. So that's kind of like what client cleaning ends up doing. So the other

Speaker: 00:21:52

way there's multiple ways to do this, it seems. Right? Like, there there's that way

Speaker: 00:21:55

we add the layer, but there's also kind of I hate the term prompt engineering

Speaker: 00:21:58

because it's just so over overblown. But, like, giving it

Speaker: 00:22:02

more context and samples. And now that the the token context

Speaker: 00:22:06

window is large enough that you don't have to be well, if you wanna

Speaker: 00:22:10

save money, you have to be very mindful of that. But if you're running it

Speaker: 00:22:12

locally, like, doesn't really matter. Well, you could give it an example of

Speaker: 00:22:16

let's just say you had I'm trying to think of a short story or a

Speaker: 00:22:19

novel. I don't know. Let's pretend,

Speaker: 00:22:23

Moby Dick was only a 100 pages. Right? I

Speaker: 00:22:27

could give it that as the part of the prompt. Let's say write a sequel

Speaker: 00:22:30

to this book based on what happens in this one. Is that what you're talking

Speaker: 00:22:34

about? Were you kinda giving an example as part of the prompt? Or is there

Speaker: 00:22:38

some and not part of the layer? Or some combination thereof? Or was some third

Speaker: 00:22:41

thing entirely? So this would be like, what what

Speaker: 00:22:45

you're describing is more like few shot learning, which is you gave kind of an

Speaker: 00:22:49

example, and then you're, like, okay. Like, given these examples, can you do this other

Speaker: 00:22:53

task this test that I've described on this unseen example? What I'm describing is

Speaker: 00:22:56

kind of, like, slightly before that. So, like, before we had the ability to, like,

Speaker: 00:23:00

give models examples, we had to, like, give them we have to

Speaker: 00:23:03

create the ability to follow instructions. And then once you have the ability to

Speaker: 00:23:07

follow instructions, you can be like, okay. Here are the instructions. Here's

Speaker: 00:23:11

examples of correctly completing the instruction, now do the instruction.

Speaker: 00:23:14

And that is the reason why that happens in that order is

Speaker: 00:23:18

because first, you have, like, just, like, sequence completion, like,

Speaker: 00:23:21

autocomplete. Then you have, like, okay, given this

Speaker: 00:23:25

task given this set of instructions, just follow the instruction instead of,

Speaker: 00:23:29

like, trying to do autocomplete. And then you have, okay, now you know how to

Speaker: 00:23:32

follow instructions. I'm gonna give you a few data points in order to

Speaker: 00:23:36

learn a new task. Now do this new task. So you're kind of,

Speaker: 00:23:40

like, moving from a situation where you need tons and tons

Speaker: 00:23:43

of data just to get the, sequence completion. And then you need

Speaker: 00:23:47

a smaller set of data to, like, get the capability to follow instructions.

Speaker: 00:23:51

And then you need a very, very, very small amount of data, like,

Speaker: 00:23:55

maybe 3 points or 10 examples or 15 examples to complete kind of, like,

Speaker: 00:23:59

a new task. So there's a lot of kind of nuance in, like, how

Speaker: 00:24:02

modern LLMs are being used and how they're kind of trained and fine tuned, so

Speaker: 00:24:06

on and so forth. And I think there's a lot of, like,

Speaker: 00:24:09

important importance in, like, learning what what happened kind of

Speaker: 00:24:13

before because the advancements have happened so quickly. It can be really hard to kind

Speaker: 00:24:16

of differentiate, or, like, oh, why is why do models perform like this? Why

Speaker: 00:24:20

do things kind of happen like that? And even though, prompt

Speaker: 00:24:24

engineering has kind of, like, let's say, traveled through the

Speaker: 00:24:28

hype cycle where people were, like, really excited about it, and then we're, like, this

Speaker: 00:24:31

is not actually that interesting. Right. What's interesting is that,

Speaker: 00:24:34

doing building a good RAG system or trivial augmented generation system,

Speaker: 00:24:38

you really need to be good at prompt engineering in a sense

Speaker: 00:24:42

because you're assembling the correct context for this model

Speaker: 00:24:45

to answer some downstream question, And it's not

Speaker: 00:24:49

intuitive how to assemble that context. So understanding, like, how are these

Speaker: 00:24:52

models are trained, like, whether they can follow instructions, how good they are at

Speaker: 00:24:56

doing so, how many examples of information they need in order to accomplish some task

Speaker: 00:25:00

really affects how you build that knowledge base in order to help the

Speaker: 00:25:04

model do some sort of new thing. Interesting.

Speaker: 00:25:09

So RAG is obviously all the rage now.

Speaker: 00:25:13

Yep. But there's also a relatively new because this this

Speaker: 00:25:17

space changes rapidly. Like, I mean, I took 2 weeks off in December, and

Speaker: 00:25:20

I feel completely disconnected from the cutting edge, you know.

Speaker: 00:25:25

Because when I was watching the keynote from CES, and I'm like, wow. That's

Speaker: 00:25:28

really cool. And I was texting, you know, slacking with a coworker, and he goes,

Speaker: 00:25:32

oh, no. This is a retread of their, like, last keynote they did. Like

Speaker: 00:25:35

and I'm like, okay. Wow. Blink and you missed

Speaker: 00:25:39

something. So what

Speaker: 00:25:43

you're describing the fine tuning, is that really what Raft is, where the

Speaker: 00:25:46

idea that you have kind of retrieval augmented fine tuning, which I think is what

Speaker: 00:25:50

the acronym stands for. Is that not I'm

Speaker: 00:25:54

not familiar with how Raft works. So I don't wanna, like, kind of venture

Speaker: 00:25:58

and guess without without knowing what it is. But do you remember, like, what context

Speaker: 00:26:01

you encountered this in? Basically, it's the idea that

Speaker: 00:26:06

it's the idea that you can fine tune the results. Sounds very

Speaker: 00:26:10

similar to what you're doing, and I've haven't read the paper in a while.

Speaker: 00:26:14

Back when I was a Microsoft MVP, like, you know,

Speaker: 00:26:18

they had a Microsoft Research had the thing for their calls, and they

Speaker: 00:26:22

were all raving about it. The paper had just come out and things like that.

Speaker: 00:26:26

It's the idea that you can kind of give it pretrained examples.

Speaker: 00:26:30

You start with a base LLM, and you give it pre trained examples, and then

Speaker: 00:26:33

you add on top of just to retrieve an

Speaker: 00:26:37

augmented portion of it. It's very similar, not to

Speaker: 00:26:41

plug my you know, for my day job. I work at Red Hat. That's why

Speaker: 00:26:43

there's a fedora there. We have a product called Rel

Speaker: 00:26:47

AI, which is based on an upstream open source project called instruct

Speaker: 00:26:51

lab. And it's the idea similar idea in that you you you

Speaker: 00:26:54

basically give it a set of data.

Speaker: 00:26:58

And then you we there's a there's a little more to it because there's a

Speaker: 00:27:01

teacher model. And basically what it'll do is it will and synthetic data generation.

Speaker: 00:27:05

So you can start with a modest document set.

Speaker: 00:27:10

And based on how the questions and answers that you

Speaker: 00:27:13

form and the the the,

Speaker: 00:27:17

the taxonomy that you attach to it, it will

Speaker: 00:27:21

create a LoRa layer on top of an existing LLM.

Speaker: 00:27:26

And it it could be that it's it's it's not quite exactly the same as

Speaker: 00:27:29

Raft, but it's definitely in the same direction. Same same thing as, like, Bert, Elmo,

Speaker: 00:27:33

and, you know, Roberta, which, I think

Speaker: 00:27:37

I think I understand. So it's kind of like you so the I think the

Speaker: 00:27:40

problem that might be addressing is kind of just really similar to the problem that

Speaker: 00:27:44

traditional RAG tries to address, except in a more kind of deliberate fashion

Speaker: 00:27:48

Exactly. Yeah. Where you have some document store internally. Like, let's say we

Speaker: 00:27:51

both work at some company, and we have a giant customer support document store.

Speaker: 00:27:55

You take some LLM off the shelf. It's not necessarily gonna know the

Speaker: 00:27:59

contents of your internal kind of documents. So how can you get

Speaker: 00:28:02

it to, like, successfully help answer tickets or triage tickets that

Speaker: 00:28:06

you're trying to build, so that you can answer, like, most difficult tickets and

Speaker: 00:28:10

kind of work toward that. In this situation, maybe you

Speaker: 00:28:13

want to, inject some of the knowledge of

Speaker: 00:28:17

the documents in addition to having the

Speaker: 00:28:21

model being able to search over the document store. So maybe, like, the what this

Speaker: 00:28:24

lower layer is doing is, like, absorbing Yeah. Some of the knowledge from the

Speaker: 00:28:28

document store so that you can kind of more

Speaker: 00:28:32

efficiently query, the database and so

Speaker: 00:28:35

that you don't have to, like, query it all the time. The only,

Speaker: 00:28:39

issue, quote, unquote, I'd have with that method is that you'd have to, like, keep

Speaker: 00:28:43

that updated from time to time, and that's, like, not that's nontrivial. Whereas

Speaker: 00:28:47

if you just do, like, traditional RAG, you just need to

Speaker: 00:28:50

update your, Vector Store, and then you can just have the model

Speaker: 00:28:54

query that new information when you need to. But, you know, it's always best to

Speaker: 00:28:57

use whatever solution works best for your, given use case.

Speaker: 00:29:01

And experimenting with different use cases is always really important. But I imagine that's, like,

Speaker: 00:29:04

kind of what that is trying to address, which is the That is basically it.

Speaker: 00:29:08

The I, you know, I don't wanna go down that rabbit hole of that. But

Speaker: 00:29:11

but, basically, the idea is that, if

Speaker: 00:29:15

you train an LLM or you have a layer on top of an

Speaker: 00:29:18

LLM that not only does retrieval from a source document

Speaker: 00:29:22

store. Right? I think that's a pretty set pattern. But it also has a

Speaker: 00:29:25

better understanding of your business, your industry, the jargon.

Speaker: 00:29:29

Right. Right. Blah blah blah. Right? The idea is that the retrieval success

Speaker: 00:29:33

rate will be higher. Now we're not publishing the numbers yet,

Speaker: 00:29:37

but the research is still ongoing. But basically, it's a

Speaker: 00:29:41

pretty substantial from what I've seen well, I haven't

Speaker: 00:29:44

seen the actual numbers yet, but from what I've been told those numbers are by

Speaker: 00:29:47

the researcher, that it is a it is a substantial improvement

Speaker: 00:29:51

that is worth the, the juice is worth the squeeze in that in that regard.

Speaker: 00:29:55

You're not and it's also computationally, you're not quite training the

Speaker: 00:29:59

whole thing again. You're just kinda putting a new Instagram filter, so to

Speaker: 00:30:03

speak, together on top of the base. So it definitely

Speaker: 00:30:06

does it definitely does some things. Now when we get the hard

Speaker: 00:30:10

numbers, then, you know, I mean, I can

Speaker: 00:30:13

say them publicly, then I think we'll we'll know is the juice how

Speaker: 00:30:17

much does the the the the squeeze to juice ratio is?

Speaker: 00:30:22

But, I can confidently say publicly now, like, there's a there

Speaker: 00:30:26

there. Yeah. And, you know, we'll have those numbers soon

Speaker: 00:30:29

enough. But it's it's interesting because you're right. I mean, this paper

Speaker: 00:30:33

came out in 2019. Right? There was just an

Speaker: 00:30:37

explosion of these different mechanisms. You mentioned Bert. You mentioned Roberta.

Speaker: 00:30:41

Fun fact, my wife's name is Roberta. So that was kind of fun.

Speaker: 00:30:45

There was Elmo. There was Ernie. There was a whole Sesame

Speaker: 00:30:48

Street themed zoo of of model

Speaker: 00:30:52

types. That seems to have kind of that branching out of

Speaker: 00:30:56

those different directions has seemed to have stalled, and we're going into more of

Speaker: 00:31:00

these retrieval augmented generation systems. So for those who because

Speaker: 00:31:03

not everybody on our listeners know exactly what retrieval

Speaker: 00:31:07

augmented systems are. Could you give kind of a a

Speaker: 00:31:11

level 200 elevator explanation? Sure.

Speaker: 00:31:15

So, when you speak to a modern chatbot,

Speaker: 00:31:19

what's happening is that they've learned information through their pre

Speaker: 00:31:23

training processes, the large corpus of basically the entire Internet,

Speaker: 00:31:27

and are generating information based on the query that you're passing in.

Speaker: 00:31:31

The problem that often occurs is that

Speaker: 00:31:35

these AI models might error, and the error could

Speaker: 00:31:39

be making, inform making information up that doesn't

Speaker: 00:31:42

exist. For example, if a model is trained before a period of time,

Speaker: 00:31:46

like, it might not know about that period of time, which is which happens more

Speaker: 00:31:49

often than you think. The information could be false, untruthful, or it could

Speaker: 00:31:52

just be incorrect in a way that's not, like, bad, but still not

Speaker: 00:31:56

helpful. And the reason for this is the way that these

Speaker: 00:32:00

models are accessing that information. The idea behind retrieval

Speaker: 00:32:03

augmented generation is that instead of having the model try

Speaker: 00:32:07

to, generate the correct document or the correct

Speaker: 00:32:10

response given its pretraining process, you instead

Speaker: 00:32:14

add factual content to the query that you're asking

Speaker: 00:32:18

the model for. You first search for that content, which is where

Speaker: 00:32:22

the retrieval part comes, and then you augment the generation of what that

Speaker: 00:32:25

model is going to create based on that content, hence

Speaker: 00:32:29

retrieval augmented generation. There's usually, a querying

Speaker: 00:32:33

step. So you take in a user query, you hit it against some sort

Speaker: 00:32:36

of database, usually a vector database. In our case, it could be Pinecone.

Speaker: 00:32:40

You find a set of relevant documents. You pass that to the generating LLM.

Speaker: 00:32:44

The generating LLM uses those documents to generate a final

Speaker: 00:32:47

response. And it turns out that if you do this, you can reduce the right

Speaker: 00:32:50

hallucinations. And that makes sense because if the model was given true

Speaker: 00:32:54

information and then conditioned its generation on that information, it

Speaker: 00:32:58

follows that the probability of generating information that is

Speaker: 00:33:01

correct could be higher. That's a good exam that's a good

Speaker: 00:33:05

explanation. So you're basically giving it a

Speaker: 00:33:09

crash course in what documents you care about. Right? Like

Speaker: 00:33:12

Exactly. Interesting. And that's a good segue

Speaker: 00:33:16

because you work for Pinecone. So so tell me about Pinecone. What is Pinecone?

Speaker: 00:33:20

Yeah. So Pinecone is a, knowledge layer for AI. It's

Speaker: 00:33:24

kind of like the way we like to describe it. We the main product that

Speaker: 00:33:28

we provide is a vector database. So this is a way of storing

Speaker: 00:33:31

information, information that has been vectorized, in a really

Speaker: 00:33:35

efficient manner. And it turns out that if you have the ability to store information

Speaker: 00:33:39

in this manner, you can search against it really quickly, with

Speaker: 00:33:42

low latency and to find the things that you need to find really interesting for

Speaker: 00:33:46

these types of semantic search and rag systems. Pinecone has a few other

Speaker: 00:33:50

offerings now that kind of help people build these systems a lot easier. There's

Speaker: 00:33:54

Pinecone Inference, which lets you embed data in order to do that querying

Speaker: 00:33:57

step. Pinecone Assistant, which lets you just build a RAG

Speaker: 00:34:01

system immediately just by upsurting documents into our vector database,

Speaker: 00:34:06

so on and so forth. But the reason why, like, you

Speaker: 00:34:09

need a vector database is because all of this advance of

Speaker: 00:34:13

semantic search of embedding models. People have gotten really, really

Speaker: 00:34:16

good at representing chunks of information using these dense sized

Speaker: 00:34:20

vectors. But once you have 1,000, millions,

Speaker: 00:34:24

even billions of vectors across tons of different users, you need a way

Speaker: 00:34:27

of indexing this information to access it really quickly at

Speaker: 00:34:31

scale, especially if your chatbot's gonna be querying this vector database really

Speaker: 00:34:35

often. And so having a specialized data store that can handle that type

Speaker: 00:34:38

of search becomes really useful. That's why Pinecone is here, and that's

Speaker: 00:34:42

why we exist. Interesting. Interesting.

Speaker: 00:34:47

One of the other interesting things from your bio, aside from

Speaker: 00:34:51

the the the origami,

Speaker: 00:34:55

Tell me about this. So so you

Speaker: 00:34:59

your crew does your do you create the YouTube videos, or do you use your

Speaker: 00:35:02

tools, or is it something completely it's just part of your job as a developer

Speaker: 00:35:05

advocate? So it is just part of my job as a

Speaker: 00:35:09

developer advocate. Oh, okay. Like, often that, you

Speaker: 00:35:13

know, I do that because we are interviewing people or because there's a new

Speaker: 00:35:16

concept we wanna teach people, so on and so forth. Or we do a webinar,

Speaker: 00:35:20

and we just upload it to YouTube. Oh, very cool. Very cool.

Speaker: 00:35:24

Yeah. I started my career in developer

Speaker: 00:35:27

advocacy. One was called evangelism. So I was a a Microsoft

Speaker: 00:35:31

evangelist for a while. So yeah. Yeah. Cool. YouTube

Speaker: 00:35:34

is very important. Yep. But it's

Speaker: 00:35:38

also it's also, I think, speaks to how people learn,

Speaker: 00:35:43

but, how people learn. YouTube University is very

Speaker: 00:35:47

real. Right? And Yep. You know, not not a knock on

Speaker: 00:35:50

traditional schools, not a knock on traditional publishing, but this space

Speaker: 00:35:54

is moving so fast that if it weren't for YouTubers like 3blueonebrown

Speaker: 00:35:59

I think his real name is, Grant Sanderson. I think that's his real name.

Speaker: 00:36:04

Somebody will send me hate mail if I get it wrong. But,

Speaker: 00:36:08

he he is, like, really good at explaining these

Speaker: 00:36:12

really abstract mathematical concepts. And

Speaker: 00:36:15

unlike you, I didn't study math undergrad. I didn't I mean, I had to. I

Speaker: 00:36:19

only took the requirements. Right? But I have comp sci degrees. So, like, for me

Speaker: 00:36:23

to kind of fall in love with math again or for the first time, depending

Speaker: 00:36:26

on depending on how you wanna say that, for me, that

Speaker: 00:36:30

was very helpful. And under having an understanding of this, if you're a data engineer

Speaker: 00:36:34

and, you know, or wanna get into this space, it's

Speaker: 00:36:37

definitely vector databases for traditional kinda SQL kinda

Speaker: 00:36:41

RDBMS person will look very awkward at first. But

Speaker: 00:36:45

I know a lot of people that have made the transition, and they kinda love

Speaker: 00:36:48

it. Right? Because in a lot of ways, it's way more efficient,

Speaker: 00:36:52

than, I dare say, traditional data stores. But when you're

Speaker: 00:36:56

processing the large blocks of text, it's really good for kind of

Speaker: 00:36:59

parsing through that. But

Speaker: 00:37:03

that's that's really cool. So, we do have the preset

Speaker: 00:37:07

questions if you're good for doing those. I'll put them in the chat in case

Speaker: 00:37:09

you don't have them. Sure. They're not brain teasers

Speaker: 00:37:13

or anything like that. They are pretty basic of,

Speaker: 00:37:17

questions, and I will paste them in the chat.

Speaker: 00:37:22

So the first question is, how did you find your way into

Speaker: 00:37:26

AI? Did you did you find AI, or did

Speaker: 00:37:29

AI find you? So this is a little bit of a

Speaker: 00:37:33

crazy story, but AI definitely found me.

Speaker: 00:37:37

So when I was in college, when I was looking for my 1st

Speaker: 00:37:40

internship, I couldn't find any internships, basically, because I had, like, no

Speaker: 00:37:44

previous experience in working at tech or anything like that. And,

Speaker: 00:37:48

the first company I worked for, Speeko, took a chance on me because they were

Speaker: 00:37:51

building public speaking, tools to kind of help people learn how to do

Speaker: 00:37:55

public speaking better, for an iOS app. And I had some

Speaker: 00:37:59

public speaking experience. They were, like, close enough. We'll have you come on and kind

Speaker: 00:38:02

of help us, like, work work things out. And while I was there, it was

Speaker: 00:38:05

made very obvious to me how important building

Speaker: 00:38:10

very basic deep learning systems and AI systems to kind

Speaker: 00:38:14

of accomplish really specific tasks that could help serve an

Speaker: 00:38:17

ultimate goal. Like, what we were trying to do is just, like, see how many

Speaker: 00:38:21

filler words people are using or how quickly or slowly you were speaking.

Speaker: 00:38:24

And that requires a lot of, complicated

Speaker: 00:38:28

processing because you have to do transcription and because you have to figure out what

Speaker: 00:38:31

words are being said, so on and so forth. So kind of experiencing that and

Speaker: 00:38:34

seeing that firsthand really opened my eyes to how powerful

Speaker: 00:38:38

the technology had been even back in, like, 2017. And ever

Speaker: 00:38:42

since then, I started learning more and more and more about statistics,

Speaker: 00:38:45

AI, natural language processing through my internships,

Speaker: 00:38:49

learning more complicated problems, reading research papers, so on and so forth.

Speaker: 00:38:53

And I got to where I am now. A lot of where I learned is

Speaker: 00:38:56

just out of pure curiosity. Just like, okay. There's this new thing. I wanna learn

Speaker: 00:39:00

about it. That's where I wanna be. And that's kind of how I fell into

Speaker: 00:39:03

large language models and AI, just by wanting to learn about what was going to

Speaker: 00:39:06

happen and then eventually being there. So it definitely found me. I was

Speaker: 00:39:10

not looking for it. Didn't even know I liked statistics until I started doing

Speaker: 00:39:14

statistical modeling. And I was like, wait. This is really fun. I wanna do a

Speaker: 00:39:17

lot more of this. I wanna learn a lot more of this. And I knew

Speaker: 00:39:20

that, once I was in college and I bought a statistics book for fun, and

Speaker: 00:39:24

I was like, okay. I'm I'm past the point of no return. Like, this is

Speaker: 00:39:27

definitely Right. Right. Right. Right. That that might be one of the first times in

Speaker: 00:39:30

history that that's been said. Right. Because I I learned statistics for

Speaker: 00:39:33

fun. I I took stats in college.

Speaker: 00:39:37

I hated it. Hated every minute of it. But

Speaker: 00:39:40

when I got into data science,

Speaker: 00:39:44

I the first two weeks were not fun. I'm not gonna lie. Yep. But

Speaker: 00:39:48

just like the VI editor, once you stick with it,

Speaker: 00:39:51

Stockholm syndrome kicks in, And you start loving

Speaker: 00:39:55

it. That's cool. 2, what's your favorite

Speaker: 00:39:59

part of your current gig? The favorite part of my

Speaker: 00:40:03

current job is being able to learn interesting,

Speaker: 00:40:06

fun, even complicated things in data science and AI,

Speaker: 00:40:10

and figuring out how to communicate them to a wide

Speaker: 00:40:14

audience. It's a really fun challenge. It's really similar to, like,

Speaker: 00:40:17

what, 3 blue one brown does all the time on the YouTube channel, and it's

Speaker: 00:40:21

something that I get to learn and practice and keep keep doing. That's the best

Speaker: 00:40:24

part of the job. I love learning things and, like, teaching other people about them

Speaker: 00:40:28

and learning even more things. And the fact that I have an opportunity to do

Speaker: 00:40:31

that every single day is, like, the best. That's cool. That's

Speaker: 00:40:35

cool. We have 3 complete sentences. When I'm

Speaker: 00:40:39

not working, I enjoy blank. When I'm

Speaker: 00:40:42

not working, I enjoy, baking sweet treats and

Speaker: 00:40:46

goods. I can't have any dairy. So very often, I had to kind

Speaker: 00:40:50

of give up a lot of the cakes and desserts that I loved eating when

Speaker: 00:40:52

I was younger. So now I, like, spend my time trying to figure out how

Speaker: 00:40:56

I can make them again without dairy so they taste really good. So that's that's

Speaker: 00:40:59

something I enjoy I really enjoy doing. Very cool.

Speaker: 00:41:04

Next, complete the sentence. I think the coolest thing in technology

Speaker: 00:41:07

today is blank. I

Speaker: 00:41:10

thought really hard about this question because we're living in a

Speaker: 00:41:14

crazy time of technological development. But the thing that really

Speaker: 00:41:18

stuck out to me and the thing that was also the moment for me

Speaker: 00:41:22

when I started working with, like, chatbots and LLMs was code

Speaker: 00:41:25

generation models. The first time I learned how to

Speaker: 00:41:29

use, GitHub Copilot specifically, I

Speaker: 00:41:32

was I was completing some function, and it completed it before I was done typing

Speaker: 00:41:36

it. And I was like, what the heck? This is amazing. Like, this this this

Speaker: 00:41:40

actually figured out exactly what I needed. And because I was still, like,

Speaker: 00:41:43

a budding developer, it was extremely helpful because I could learn

Speaker: 00:41:47

faster rather than having already a huge kind of store knowledge already in my

Speaker: 00:41:51

brain and kind of pulling from that. So I could see it benefiting my workflow.

Speaker: 00:41:54

So I think the development of those tools and modern tools like

Speaker: 00:41:58

Cursor, so on and so forth, extremely cool. And I can't wait to

Speaker: 00:42:01

see, like, what the next generation of those technologies will look like. Yeah. I

Speaker: 00:42:05

mean, that's a that's a great example. It's almost like you don't

Speaker: 00:42:09

need, you know, the the classic 10000 hours to master a skill or something like

Speaker: 00:42:13

that. It's almost like you can leverage the AI to take on the

Speaker: 00:42:16

lion's share of the 10000 hours. You're still gonna need to know something. You still

Speaker: 00:42:20

have to put in some reps, but not to the degree that you used to.

Speaker: 00:42:23

No. I think that's gonna be very transformative. I mean, I mean, I'm

Speaker: 00:42:27

learning, JavaScript and Next. Js on the side because it's something I have no

Speaker: 00:42:31

experience in. Right. And I was able to build my personal website

Speaker: 00:42:35

entirely through using Cursor and Progression. Nice. I

Speaker: 00:42:38

often check that out. Which is insane. Right? Which is, like, really, really

Speaker: 00:42:42

fascinating. And and I'm not gonna claim to, like, suddenly be an expert in

Speaker: 00:42:45

NextGen or anything like that. Right? Right. Right. Right. I still wanna learn, like, exactly

Speaker: 00:42:49

what's going on under the hood, But having a project that you can kind of,

Speaker: 00:42:52

like, tinker on that's, like, pretty small in scale and that you can kind of

Speaker: 00:42:55

afford to make a few mistakes on and having, like, an expert system kind of

Speaker: 00:42:59

help you go through that, expert, quote, unquote, being close enough, really cool

Speaker: 00:43:02

learning experience. No. That's a great way to put it because, like, I I

Speaker: 00:43:06

I don't have any apps on the modern devices. Right? Like,

Speaker: 00:43:10

so, it would be nice if I

Speaker: 00:43:14

had an Android app that could kick off some automation process that I have.

Speaker: 00:43:18

Right? Or do some kind of tie in with, you know, Copilot

Speaker: 00:43:21

into that or things like that. Like, where, you know, I

Speaker: 00:43:25

originally wrote a content automation system I wrote. I originally wrote in

Speaker: 00:43:29

dotnet, but I ported it to Python with the help of

Speaker: 00:43:33

the help of AI. And I could well, that's just it. Right?

Speaker: 00:43:36

It really the true valuable resource in in life is

Speaker: 00:43:40

time. Right? Yes. It's not Yes. I mean, I could have done it by hand.

Speaker: 00:43:44

I could have done it by myself, but it was one of those things where

Speaker: 00:43:48

am I gonna do it because it's gonna take x number of hours or whatever?

Speaker: 00:43:53

But if I can just kinda here's the dot net version that I, you know,

Speaker: 00:43:56

I posted. This is before there was Copilot, so I pasted it into chat g

Speaker: 00:44:00

p t. And it basically spit out a Python

Speaker: 00:44:03

version, had some errors. You know, this was a while ago. But I

Speaker: 00:44:07

was able to, inside of a day, get it done as opposed to

Speaker: 00:44:11

before. Like, I know how my ADD works. Right? Like, I'll start it.

Speaker: 00:44:14

First 3 days, working on it, grinding on it, and then

Speaker: 00:44:18

I don't touch it again for 2 weeks. And it never gets built. But

Speaker: 00:44:22

with this, I'm able to kinda harness the the spark of

Speaker: 00:44:25

inspiration and and execute much faster. Now I think I don't think

Speaker: 00:44:29

people fully realize, like, you know, it's not all doom and gloom. Nobody's

Speaker: 00:44:33

gonna have any programming jobs. There's a lot of upside too. And I

Speaker: 00:44:37

guess that's just where we are in the hype cycle. As you said.

Speaker: 00:44:41

Yeah. Yeah. Yeah. Exactly. That's a good segue into I look forward to

Speaker: 00:44:44

the day when I can use technology to blank. I look

Speaker: 00:44:48

forward to the day where I can use technology to get a high quality

Speaker: 00:44:52

education on any subject for free. So Nice.

Speaker: 00:44:56

Free education is really important to me. A lot of

Speaker: 00:45:00

what I learned about large language models, deep learning, all that

Speaker: 00:45:03

stuff was online courses that I took for free on places like

Speaker: 00:45:07

EDX, Coursera, so on and so forth. Or people sharing

Speaker: 00:45:11

articles and kind of learning from them, or YouTube videos, or all that sort of

Speaker: 00:45:14

things, in addition to my education. But there's a lot of things you kinda have

Speaker: 00:45:16

to learn after that. Right? And I think that especially with, like,

Speaker: 00:45:20

cogeneration models, it's, like, very easy to be, like, okay. Build me this app

Speaker: 00:45:24

and, like, just make it work. And you can sit there for a couple hours,

Speaker: 00:45:26

and it'll, like, work. But I think the missing piece is

Speaker: 00:45:30

creating a structured kind of learning path that's, like,

Speaker: 00:45:33

personalized to whoever you are for the

Speaker: 00:45:37

thing that you're really interested in with the context of

Speaker: 00:45:41

having, like, these tools that can help you do that thing. And I'm not sure

Speaker: 00:45:44

if we have anybody or any offering that can

Speaker: 00:45:48

kind of do that technologically, because you need a lot of information about what the

Speaker: 00:45:51

user knows or doesn't know. You need to be able to create ability, and then

Speaker: 00:45:54

you need to be able to kind of create, like, an entire mini course that's

Speaker: 00:45:57

personalized to whatever that person needs. But if we can do that, we can solve

Speaker: 00:46:01

so many wonderful problems. Absolutely. I'm

Speaker: 00:46:05

thinking about special education needs and things like that. I don't think we're that

Speaker: 00:46:08

far off from this. No. But I

Speaker: 00:46:12

the biggest issue, is going to be just hallucinations. Right? And,

Speaker: 00:46:15

hopefully, people can build, like, rag systems using tools like PineCone to kind

Speaker: 00:46:19

of produce those hallucinations. But we will also for for something like

Speaker: 00:46:23

that specific use case, we probably need, like, another breakthrough in

Speaker: 00:46:26

indexing information or kind of presenting it, or we need a process that

Speaker: 00:46:30

really allows people to create this information quickly

Speaker: 00:46:34

and verifiably in order to kind of make that happen. But if if that is

Speaker: 00:46:38

a future that we can live in, where technology can can kind of, like, help

Speaker: 00:46:41

people learn, like, really important things really well, that would be

Speaker: 00:46:44

wonderful. And I think that would be, like, amazing for for humanity.

Speaker: 00:46:48

Oh, absolutely. Share something different

Speaker: 00:46:52

about yourself, but remember as a family podcast.

Speaker: 00:46:57

One of my favorite hobbies for about a decade is

Speaker: 00:47:00

designing and folding origami. And it's really fun.

Speaker: 00:47:04

It's very easy, but it's also very hard. There's a lot

Speaker: 00:47:07

of comp complexity inside it as well. One thing people

Speaker: 00:47:11

don't know about that is that there's a lot of mathematical complexity.

Speaker: 00:47:15

So once you get to a point where you wanna design a model with

Speaker: 00:47:19

really specific qualities, really specific features, it suddenly

Speaker: 00:47:22

becomes a paper optimization problem where you

Speaker: 00:47:26

have, like, a fixed size square, and you have different

Speaker: 00:47:30

regions of that paper that you're allocating to portions of the model you're

Speaker: 00:47:33

designing. And it turns out that there are entire mathematical

Speaker: 00:47:37

principles and procedures to solve this problem. So much

Speaker: 00:47:40

so that one of the leading, like, practitioners in the

Speaker: 00:47:44

field is, like, this physicist who wrote a textbook on how to do origami design,

Speaker: 00:47:48

and that's, like, the textbook everyone looks at. So, like, learn how to solve it.

Speaker: 00:47:51

Yeah. I'm not surprised. There's definitely there's definitely a a correlation

Speaker: 00:47:55

between the mathematics of that. And I look at origami creations, and I

Speaker: 00:47:59

just fascinated that could be done from a single sheet. Like, it's

Speaker: 00:48:03

just how is that I mean, that's just mind bending. Now it's

Speaker: 00:48:06

and and makes sense that there's a mathematical because you have a certain type of

Speaker: 00:48:09

constraint, And there's obviously

Speaker: 00:48:14

folds factor into it and things like that. And, yeah, that's that's

Speaker: 00:48:17

interesting. I I should what's the name of that book? I should pick it up.

Speaker: 00:48:20

It's called Origami Design Secrets. Got it. Alright. I will check

Speaker: 00:48:24

it out. So where can people learn more about

Speaker: 00:48:28

you and Pinecone? Of course. You wanna learn more about Pinecone? The

Speaker: 00:48:32

best place is our website, pinecone. Io. You can also find

Speaker: 00:48:35

us on LinkedIn and on x and other social media platforms.

Speaker: 00:48:39

You wanna learn more about me? You can go to my LinkedIn, which you can

Speaker: 00:48:42

find at Arjun Girthi Patel, or you can go to my website, which is also

Speaker: 00:48:46

my name, arjun, k I r t I p

Speaker: 00:48:50

a t e l.com. Cool. And we can also check out your

Speaker: 00:48:53

Next JS skills there too. Exactly. Hopefully, nothing is

Speaker: 00:48:57

broken, but, you can you can see you can see how well I've gotten by

Speaker: 00:49:01

with the Awesome. Trust me.

Speaker: 00:49:05

JavaScript alone is is a is a frustration

Speaker: 00:49:08

creation device.

Speaker: 00:49:12

Audible sponsors the podcast. Do you do audio books? Is there a book that you

Speaker: 00:49:15

would recommend? I do do audiobooks, but I've just

Speaker: 00:49:19

started recently, so I don't have a huge, audiobook library. But

Speaker: 00:49:22

there is I I am a huge fan of short story collections, and

Speaker: 00:49:26

kind of the one that comes to mind is really anything by Ted

Speaker: 00:49:30

Chiang, who does a lot of kind of sci fi short stories. If you've seen

Speaker: 00:49:33

the movie Arrival, the short story based on that is story of your life,

Speaker: 00:49:37

and it's wonderfully written. It's one of my favorite short stories ever.

Speaker: 00:49:40

Yep. So highly recommend that. I believe the collection is

Speaker: 00:49:44

called, story of your life and others, something like that. So

Speaker: 00:49:47

Oh, interesting. Careful with audiobooks. They are very

Speaker: 00:49:51

addictive. So,

Speaker: 00:49:55

with Audible is a sponsor of the show. So if you go to the data

Speaker: 00:49:58

driven book.com, you'll get routed to Audible and

Speaker: 00:50:02

you'll get a free book on us. And if you

Speaker: 00:50:06

choose to subscribe, we'll get a little bit of kickback. It helps run the show

Speaker: 00:50:09

and helps, helps us bring, bring some good stuff to to

Speaker: 00:50:13

the masses. So any any parting thoughts?

Speaker: 00:50:18

No. But thank you so much for having me on, Frank. This was a ton

Speaker: 00:50:21

of fun. I learned a lot from you, and I hope I I helped you

Speaker: 00:50:24

learn one one small thing as well. Absolutely. It was it was

Speaker: 00:50:28

a great conversation, and, we'll let the nice British lady finish the

Speaker: 00:50:31

show. And that's a wrap for this episode of Data Driven, where we

Speaker: 00:50:35

journeyed from the intricacies of vector databases to the surprising

Speaker: 00:50:38

elegance of origami. A huge thank you to Arjun Patel for

Speaker: 00:50:42

sharing his insights on retrieval augmented generation and his passion

Speaker: 00:50:46

for making AI accessible to all. From turning raw data

Speaker: 00:50:50

into actionable knowledge to turning paper into art, Arjun

Speaker: 00:50:54

proves there's beauty in both precision and creativity. If today's

Speaker: 00:50:57

episode left you curious, inspired, or just itching to fold a

Speaker: 00:51:01

piece of paper into something meaningful, be sure to check out

Speaker: 00:51:04

Arjun's work and Pinecones innovative tools. Remember,

Speaker: 00:51:08

knowledge might be power, but sharing it makes you a force to be reckoned

Speaker: 00:51:12

with. As always, I'm Bailey, your semi sentient guide to

Speaker: 00:51:16

all things data. Reminding you that while AI might shape our

Speaker: 00:51:19

future, it's the human touch or sometimes the paper fold that

Speaker: 00:51:23

gives it meaning. Until next time, stay curious,

Speaker: 00:51:27

stay analytical, and don't forget to back up your data.

Speaker: 00:51:30

Cheerio.