1
00:00:00,300 --> 00:00:06,370
I'm Miko Pawlikowski
and this is HockeyStick.

2
00:00:07,130 --> 00:00:09,590
Today we talk about algorithms
in machine learning.

3
00:00:09,970 --> 00:00:14,229
I'm joined by Vadim Smolyakov, the
author of "Machine Learning Algorithms

4
00:00:14,250 --> 00:00:18,820
in Depth" by Manning, a data scientist
and in the Enterprise and Security

5
00:00:18,820 --> 00:00:24,659
DI R&D team at Microsoft and a former
PhD student in AI at MIT CSAIL.

6
00:00:25,190 --> 00:00:28,650
His job today is to simplify ML
algorithms enough for me to understand.

7
00:00:29,190 --> 00:00:32,340
And if that wasn't hard enough, he's
not allowed to use any pictures.

8
00:00:33,160 --> 00:00:37,009
Welcome to this episode and
thank you for flying HockeyStick.

9
00:00:37,649 --> 00:00:41,720
I don't get to speak to a lot of
people who have done the MIT CSAIL.

10
00:00:41,759 --> 00:00:47,430
It's like a mystical, legendary course at
this stage with all the hype around AI.

11
00:00:48,200 --> 00:00:49,320
Maybe let's start there.

12
00:00:49,399 --> 00:00:50,180
How was it?

13
00:00:50,670 --> 00:00:51,560
How did you enjoy it?

14
00:00:52,050 --> 00:00:53,850
it was definitely an experience.

15
00:00:53,950 --> 00:00:58,729
I really liked the theoretical aspects
of, the content treatment, Right now

16
00:00:58,779 --> 00:01:03,579
there's a lot of news articles that
comes out about AI, And, people try to

17
00:01:03,629 --> 00:01:09,379
catch up on the latest large language
models, they really took over in the past

18
00:01:09,379 --> 00:01:15,769
few years, but what I think is really
a unique about, MIT CSAIL is that, the

19
00:01:15,779 --> 00:01:20,899
theoretical treatment of the subject and
really getting in depth and understanding

20
00:01:20,939 --> 00:01:23,159
behind the hood, how things work.

21
00:01:23,802 --> 00:01:27,172
It also seems to be like the
who's who, a lot of names that are

22
00:01:27,172 --> 00:01:28,862
recognizable now went through that.

23
00:01:28,862 --> 00:01:32,232
So you focus on Bayesian inference.

24
00:01:32,262 --> 00:01:34,442
What was your thesis about?

25
00:01:35,077 --> 00:01:38,769
my focuses was on
Bayesian non parametrics.

26
00:01:39,279 --> 00:01:44,009
And it's a very interesting set of models
in which the parameters grow with data.

27
00:01:45,519 --> 00:01:50,589
So one example is, Dirichlet Process
K-means, for example, where you

28
00:01:50,609 --> 00:01:55,899
are trying to classify a number of,
let's say, species, and you don't

29
00:01:55,899 --> 00:01:57,559
know how many there are, right?

30
00:01:57,559 --> 00:02:00,809
So as you keep discovering new
species, you add new clusters.

31
00:02:01,449 --> 00:02:05,369
And, for that to work, you need
to set the number of clusters K.

32
00:02:06,339 --> 00:02:11,329
and, with Dirichlet process K means this
number of clusters is set automatically.

33
00:02:11,599 --> 00:02:14,469
which is like one of the main
advantages of the algorithm.

34
00:02:15,309 --> 00:02:18,899
so based on non parametrics deals
with models in which the number

35
00:02:18,899 --> 00:02:20,469
of parameters grows with data.

36
00:02:21,519 --> 00:02:26,589
It's a clever way of expanding
the model size and capacity

37
00:02:26,589 --> 00:02:29,159
to fit the data available.

38
00:02:30,279 --> 00:02:33,899
And did you manage to bring that kind
of research, expand on that and in

39
00:02:33,899 --> 00:02:35,559
what you do at Microsoft at the moment?

40
00:02:35,999 --> 00:02:41,759
yeah, at Microsoft at the moment, I'm like
a local ML expert on the security team.

41
00:02:41,840 --> 00:02:46,280
so the nature of machine learning problems
switched from, Bayesian inference to

42
00:02:46,280 --> 00:02:48,280
more anomaly detection type problems.

43
00:02:48,570 --> 00:02:54,100
essentially, I worked at Microsoft
on, time series anomaly detection.

44
00:02:54,140 --> 00:02:58,840
I worked on, support ticket
classification routing.

45
00:02:58,940 --> 00:03:00,550
I worked on hyper personalization.

46
00:03:01,755 --> 00:03:04,775
And, LLM data copilot most recently.

47
00:03:04,875 --> 00:03:09,005
the principles carry over, but
the Bayesian non parametric nature

48
00:03:09,005 --> 00:03:12,815
of work doesn't necessarily,
extend to my work right now.

49
00:03:14,210 --> 00:03:17,410
One of the reasons, how we actually
met is through your book, The

50
00:03:17,420 --> 00:03:20,220
Machine Learning Algorithms in Depth.

51
00:03:21,350 --> 00:03:25,680
Can you tell us a little bit about
the origin story of your book?

52
00:03:25,910 --> 00:03:26,820
Why did you write it?

53
00:03:26,915 --> 00:03:30,065
I've always liked writing
even before graduate school.

54
00:03:30,165 --> 00:03:33,675
And, to me, working on a machine
learning project in grad school

55
00:03:33,675 --> 00:03:38,425
and then, writing it up and eight
pages of publication was not enough.

56
00:03:38,475 --> 00:03:39,545
I wanted to do more.

57
00:03:39,545 --> 00:03:43,345
I wanted to blog about the
concepts I was learning.

58
00:03:43,375 --> 00:03:46,945
I wanted to, have a journal and then.

59
00:03:47,785 --> 00:03:51,055
I've done a number of blog
posts and I realized that again,

60
00:03:51,065 --> 00:03:52,495
even that wasn't enough for me.

61
00:03:52,495 --> 00:03:57,595
I wanted to compile them into a collection
of algorithms, collection of books.

62
00:03:57,615 --> 00:04:03,685
And at the same time I was writing,
this library of algorithms as part

63
00:04:03,685 --> 00:04:06,855
of graduate studies, I was getting
more experience and I figured.

64
00:04:07,395 --> 00:04:14,095
wouldn't it be nice one day to put
all of this together in a format which

65
00:04:14,395 --> 00:04:16,495
would be accessible to a wide audience?

66
00:04:17,915 --> 00:04:21,645
and for me, I was personally
transitioning, my area of study

67
00:04:21,645 --> 00:04:25,535
from wireless communications,
which I did during my master's, to

68
00:04:25,575 --> 00:04:27,915
more machine learning during PhD.

69
00:04:29,010 --> 00:04:33,330
So I had to learn, a lot of
these concepts, from scratch.

70
00:04:33,940 --> 00:04:36,620
So there was a steep learning
curve and I figured if I can do

71
00:04:36,620 --> 00:04:38,300
it, then so can other people.

72
00:04:39,060 --> 00:04:43,470
And, that was like a big motivation
behind writing a book is to be able

73
00:04:43,500 --> 00:04:48,920
to teach these cool concepts that I,
learned in graduate school to, a wide

74
00:04:48,990 --> 00:04:51,060
audience, interested in the topic.

75
00:04:51,060 --> 00:04:54,250
so that sounds like a
long time coming, right?

76
00:04:54,300 --> 00:04:58,390
from the moment you started writing
this blog post to a finished book.

77
00:04:58,390 --> 00:04:59,630
How many years did that take?

78
00:04:59,850 --> 00:05:05,470
I would say it was two years of just
writing the book, but I also had a lot

79
00:05:05,470 --> 00:05:07,820
of materials prepared ahead of time.

80
00:05:08,080 --> 00:05:13,120
like code and, some ideas on what to
write about, which would take another two

81
00:05:13,140 --> 00:05:16,160
years just to compile everything together.

82
00:05:16,830 --> 00:05:20,060
sounds a little bit like, some of
the authors I speak to and some

83
00:05:20,060 --> 00:05:26,870
of my friends are of this camp who
basically like using writing as a

84
00:05:26,870 --> 00:05:28,650
tool to understand things better.

85
00:05:29,010 --> 00:05:32,010
there's this saying that if you can't
understand something simply enough,

86
00:05:32,010 --> 00:05:33,390
you don't understand it well enough.

87
00:05:33,687 --> 00:05:38,537
are you also of that mind that writing a
book is like the best way for yourself to

88
00:05:38,537 --> 00:05:42,897
organize this information in a way that
you really can explain it to other people?

89
00:05:42,972 --> 00:05:43,772
yeah, definitely.

90
00:05:43,772 --> 00:05:47,412
And it takes several passes to, I
know this, like you have one kind

91
00:05:47,412 --> 00:05:50,852
of point of view of an algorithm and
then you start writing it and it's

92
00:05:50,852 --> 00:05:54,332
Oh, there's actually concepts like I'm
thinking of decision trees right now.

93
00:05:55,012 --> 00:05:57,822
there, you have certain exposure
to decision trees at first

94
00:05:57,822 --> 00:05:59,672
interpretable models, and then.

95
00:06:00,762 --> 00:06:04,052
You realize that, Hey, it's
actually a recursive algorithm

96
00:06:04,052 --> 00:06:08,122
and, you grow the trees recursively
until they reach maximum depth.

97
00:06:08,952 --> 00:06:12,392
And, of these parameters like mug
steps, they start making a lot of

98
00:06:12,392 --> 00:06:16,172
sense, and then you start thinking
about like bias, variance trade offs.

99
00:06:16,202 --> 00:06:20,262
And, you really understand
the algorithm in depth

100
00:06:20,692 --> 00:06:24,042
Okay, so dear listeners, I think
you know where this is going.

101
00:06:24,052 --> 00:06:27,782
Now we're going to try to go through some
of those algorithms from the book and

102
00:06:27,792 --> 00:06:32,142
give you a sneak peek enough to understand
some of the things you might not know.

103
00:06:33,127 --> 00:06:39,197
And also enough to go and buy Vadim's
book, obviously, but before we do that,

104
00:06:39,847 --> 00:06:41,897
so who's the target audience of the book?

105
00:06:41,907 --> 00:06:42,647
Who is it for?

106
00:06:42,647 --> 00:06:44,197
And who's it not for?

107
00:06:44,310 --> 00:06:50,150
I wanted to make the book intermediate
level so that, anyone who has some

108
00:06:50,150 --> 00:06:54,120
experience with machine learning could
benefit from it, but also somebody

109
00:06:54,120 --> 00:06:57,200
who's new to machine learning will
be able to pick up the concepts.

110
00:06:57,730 --> 00:07:03,230
So specifically, I'd say the audience
are, be graduate students, it could

111
00:07:03,240 --> 00:07:07,590
be undergraduate students who are
interested in the topic, it could be,

112
00:07:07,650 --> 00:07:10,710
people who are trying to get into the
field of machine learning, but are

113
00:07:10,730 --> 00:07:14,390
working in the industry right now, like
as a, let's say software developer.

114
00:07:15,095 --> 00:07:18,425
the book does, derive
algorithms from scratch.

115
00:07:18,425 --> 00:07:22,205
So there's some requirements in terms
of mathematics that are good to know,

116
00:07:22,315 --> 00:07:25,295
linear algebra, probability calculus.

117
00:07:26,225 --> 00:07:31,615
So I would say, anyone with interest
in machine learning should be

118
00:07:31,615 --> 00:07:33,095
able to benefit from this book.

119
00:07:34,355 --> 00:07:34,715
Okay.

120
00:07:34,755 --> 00:07:36,345
But who shouldn't read it?

121
00:07:36,465 --> 00:07:39,875
what kind of expectations
are going to misguide it for

122
00:07:39,875 --> 00:07:42,065
people to approach your book?

123
00:07:42,125 --> 00:07:45,655
The book is in depth written
for somebody who's interested in

124
00:07:45,975 --> 00:07:50,755
understanding the algorithms from
scratch, how they work under the hood.

125
00:07:51,365 --> 00:07:55,225
So if you don't have interest in that,
then you just want to use the libraries,

126
00:07:55,475 --> 00:08:01,015
import scikit-learn, or hugging face, you
wouldn't benefit as much, from reading it.

127
00:08:01,020 --> 00:08:01,660
Fair enough.

128
00:08:02,030 --> 00:08:04,980
So with that, warning ahead of us.

129
00:08:05,030 --> 00:08:08,040
Imagine that you're speaking
to a five year old software

130
00:08:08,040 --> 00:08:12,040
engineer, which we basically are
right now, where should we start?

131
00:08:12,200 --> 00:08:14,950
what's the first example
that you cover in book?

132
00:08:15,020 --> 00:08:17,850
does it have Bayesian, next to its name?

133
00:08:19,035 --> 00:08:22,175
yeah, first thing I talk about
is the Bayesian worldview.

134
00:08:22,945 --> 00:08:26,655
and basically, it's a way to
view the world in which you start

135
00:08:26,655 --> 00:08:28,565
with some prior knowledge, right?

136
00:08:28,605 --> 00:08:30,455
Bayesians, they talk a lot about priors.

137
00:08:30,495 --> 00:08:35,115
You start with a prior knowledge of
a particular aspect of the world.

138
00:08:35,165 --> 00:08:37,605
The world is too complex to
have priors over everything.

139
00:08:37,625 --> 00:08:40,095
So you typically try to
model a particular problem.

140
00:08:40,095 --> 00:08:41,565
You start with a prior knowledge.

141
00:08:42,425 --> 00:08:47,095
And then as you observe data,
you update that prior knowledge

142
00:08:47,115 --> 00:08:48,825
into what's called the posterior.

143
00:08:49,275 --> 00:08:53,545
it's a probability of,
parameters given the data, right?

144
00:08:53,545 --> 00:08:58,665
So as you observing data, you evolving
your understanding of the world into

145
00:08:59,155 --> 00:09:01,675
something new, a posterior distribution,

146
00:09:02,295 --> 00:09:05,235
So what's an example of
an algorithm like that?

147
00:09:06,047 --> 00:09:08,767
could anything you
important so I could learn.

148
00:09:09,327 --> 00:09:12,217
Uh, is an example of,
algorithm of this nature.

149
00:09:12,227 --> 00:09:16,827
So, let's say anything with a
graphical model to, Gaussian

150
00:09:16,847 --> 00:09:18,437
mixture model is an example of that.

151
00:09:19,067 --> 00:09:23,677
In gaussian mixture model, you're
modeling the distribution of

152
00:09:23,697 --> 00:09:29,527
data points using a mixture or a
collection of Gaussian distributions.

153
00:09:30,377 --> 00:09:34,817
So essentially, the model is, a
scaled sum of Gaussians that are

154
00:09:34,817 --> 00:09:37,877
parametrized by a mean and covariance.

155
00:09:39,037 --> 00:09:42,157
and the idea is to learn the
mean and the covariance matrix,

156
00:09:42,447 --> 00:09:44,277
and, the mixture proportions.

157
00:09:45,417 --> 00:09:46,717
from the data itself.

158
00:09:47,747 --> 00:09:50,037
so there are several
algorithms for learning it.

159
00:09:50,077 --> 00:09:52,967
one is one popular
algorithms, EM algorithm.

160
00:09:53,787 --> 00:09:55,427
But, talk about it in the book.

161
00:09:55,437 --> 00:10:03,357
Um, but the idea is to be able to
describe the data in this kind of

162
00:10:03,367 --> 00:10:06,307
forms of Gaussians, really closely.

163
00:10:07,402 --> 00:10:10,662
In a way that maximizes
the likelihood of data.

164
00:10:11,382 --> 00:10:16,262
we may start off with a knowledge
that all data is distributed as

165
00:10:16,262 --> 00:10:19,492
a uniform Gaussian distribution.

166
00:10:19,992 --> 00:10:23,212
And as we observe more points, we
update that our knowledge into,

167
00:10:23,362 --> 00:10:27,542
we evolve the shape of a uniform
into a distribution that actually

168
00:10:28,562 --> 00:10:30,632
covers the points in a close way.

169
00:10:31,587 --> 00:10:35,697
So that would be one example of how
Bayesian, approach applies here.

170
00:10:36,475 --> 00:10:39,415
So it sounds like basically some kind
of iterative process where you're

171
00:10:39,415 --> 00:10:45,885
taking new data and budge your, not
your parameters, in the right direction,

172
00:10:46,265 --> 00:10:51,355
to fit more closely, your new data,
So that's a worldview algorithm.

173
00:10:51,445 --> 00:10:54,995
You also mentioned previously
when we were talking about your

174
00:10:54,995 --> 00:10:57,145
background, non parametrics.

175
00:10:57,145 --> 00:10:59,795
Can you tell us a bit more about that?

176
00:11:00,047 --> 00:11:05,337
Bayesian nonparametrics are the ones
that, number of parameters grows with

177
00:11:05,337 --> 00:11:07,407
the number of data, the amount of data.

178
00:11:07,407 --> 00:11:11,977
the number of parameters automatically
gets inferred from, data itself.

179
00:11:13,597 --> 00:11:18,217
one example, since we just talked about
Gaussian mixture model is Dirichlet

180
00:11:18,217 --> 00:11:24,117
process mixture model, which is an
extension of Gaussian mixtures with

181
00:11:24,347 --> 00:11:26,347
potential infinite number of mixtures.

182
00:11:26,972 --> 00:11:32,082
obviously constrained towards the
simplest model, that describes the data

183
00:11:32,282 --> 00:11:34,682
best, like Occam's razor principle.

184
00:11:35,462 --> 00:11:39,052
but, yeah, Dirichlet process mixture
model where you don't know the number

185
00:11:39,052 --> 00:11:42,882
of clusters and you informed that
from the data itself automatically.

186
00:11:44,382 --> 00:11:50,482
Yeah, so that's an example of a Bayesian
nonparametric model main advantage.

187
00:11:51,452 --> 00:11:51,792
Okay.

188
00:11:52,122 --> 00:11:56,192
So that's a little bit more abstract than
a previous example of what you gave of,

189
00:11:56,222 --> 00:11:58,642
clustering various species of plants.

190
00:11:59,232 --> 00:12:05,992
Where does it have practical, application
in day to day life of a developer.

191
00:12:06,702 --> 00:12:10,102
So clustering is a type of
unsupervised learning where you

192
00:12:10,102 --> 00:12:13,762
are interested in understanding
underlying patterns and data.

193
00:12:14,002 --> 00:12:16,052
a very kind of abstract, notion.

194
00:12:16,072 --> 00:12:18,412
And, the applications are.

195
00:12:19,057 --> 00:12:19,927
many, right?

196
00:12:19,987 --> 00:12:25,077
So one example of clustering could
be to detect anomalies, for instance,

197
00:12:25,097 --> 00:12:29,817
if, you group data and there's an
outlier, a point that's sufficiently

198
00:12:29,817 --> 00:12:34,427
far away from all the existing points,
you could see that as an anomaly.

199
00:12:34,607 --> 00:12:40,397
And, basically, that would be one
application where cluster is important.

200
00:12:40,427 --> 00:12:44,997
Another application could
be customer segmentation.

201
00:12:45,177 --> 00:12:51,667
You're interested in, figuring out
different cohorts of customers and, their

202
00:12:52,207 --> 00:12:55,027
lifetime value for a particular product.

203
00:12:55,647 --> 00:13:00,597
the example I give in the book is
that of, classifying iris species.

204
00:13:01,337 --> 00:13:03,467
It's a classic machine learn data set.

205
00:13:03,547 --> 00:13:05,847
And, Yeah, it's simple
enough to understand.

206
00:13:05,887 --> 00:13:09,027
So it's a toy example, but
the applications are numerous.

207
00:13:09,187 --> 00:13:13,187
A lot of people coming, and listening
to this, they come from classical,

208
00:13:13,187 --> 00:13:14,357
software engineering background.

209
00:13:14,377 --> 00:13:19,737
And when we say algos they think about
binary search and, stuff like that and

210
00:13:20,287 --> 00:13:26,007
chasing down the complexity and thinking
about, constraints of the stuff like that.

211
00:13:26,682 --> 00:13:32,762
And then we've got the machine learning
algorithms that, somehow sound exotic.

212
00:13:32,792 --> 00:13:36,512
And, with all the hype around AI,
everybody's wondering, oh, should

213
00:13:36,512 --> 00:13:37,812
I be looking into more of that?

214
00:13:37,862 --> 00:13:41,792
You mentioned the Bayesian
worldview and non parametrics

215
00:13:41,842 --> 00:13:43,532
and a few applications of those.

216
00:13:43,872 --> 00:13:48,832
I wonder if this is like a representative
sample of machine learning algorithms.

217
00:13:49,432 --> 00:13:53,602
And, the second part of the question
is assuming that's the case, what other

218
00:13:53,622 --> 00:13:59,822
algorithms would you place firmly in
this basic 101 machine learning algorithm

219
00:13:59,822 --> 00:14:01,682
set that everybody should be aware of?

220
00:14:01,947 --> 00:14:05,997
So first I want to make a distinction
between kind of classical algorithms

221
00:14:06,017 --> 00:14:09,867
and machine learning algorithms,
you have some sort of task that

222
00:14:09,867 --> 00:14:11,177
you're trying to solve, right?

223
00:14:11,187 --> 00:14:14,807
An algorithm is essentially a sequence
of steps in solving that task.

224
00:14:15,477 --> 00:14:19,877
So example could be, like you said,
binary search, over a sorted array.

225
00:14:20,387 --> 00:14:23,087
Or it could be a, sorting itself.

226
00:14:23,397 --> 00:14:28,307
and, you're interested in, runtime and
memory complexity to characterize the

227
00:14:28,347 --> 00:14:33,827
algorithm, And run it in the fastest
possible time using a small sum of memory.

228
00:14:34,507 --> 00:14:41,267
so for instance, comparison based sorting
is, has, and log n runtime complexity.

229
00:14:41,337 --> 00:14:43,867
same carries over to
machine learning as well.

230
00:14:43,867 --> 00:14:48,387
But, with the differences that
in machine learning, you given, a

231
00:14:48,387 --> 00:14:53,337
collection of input output pairs, and
you try to learn the rules to map the

232
00:14:53,337 --> 00:14:55,607
inputs to the outputs during training.

233
00:14:56,417 --> 00:15:00,797
so instead of having a fixed set of
instructions, quicksort, for example,

234
00:15:01,297 --> 00:15:08,707
instead, if you're classifying, points,
then you are learning the classification

235
00:15:08,707 --> 00:15:10,497
boundaries between the existing points.

236
00:15:11,087 --> 00:15:14,607
So you're learning the rules when it
comes to machine learning algorithms.

237
00:15:14,777 --> 00:15:19,967
and, we did talk about nonparametrics
and we talked about Bayesian algorithms.

238
00:15:20,507 --> 00:15:24,927
actually a lot of algorithms are derived
from principles of, applied probability,

239
00:15:25,967 --> 00:15:32,547
Bayes rule so examples include Naive
Bayes, examples include, mixture models.

240
00:15:33,577 --> 00:15:37,657
some of the principles like
maximizing likelihood is a common

241
00:15:37,657 --> 00:15:40,077
theme across a variety of algorithms.

242
00:15:40,197 --> 00:15:46,197
just like in deep learning, choosing the
loss function is a common theme across

243
00:15:46,317 --> 00:15:48,320
a variety of, deep learning models.

244
00:15:49,090 --> 00:15:49,790
so definitely.

245
00:15:49,890 --> 00:15:52,640
a big kind of category of algorithms.

246
00:15:52,650 --> 00:15:55,960
So we haven't touched upon yet
is deep learning algorithms.

247
00:15:57,760 --> 00:16:02,540
and in general to classify the
algorithm types, they come in

248
00:16:02,540 --> 00:16:04,440
supervised and unsupervised fashion.

249
00:16:05,065 --> 00:16:11,445
So, the supervised algorithms have, a
label associated with every example.

250
00:16:11,735 --> 00:16:17,615
So in other words, what the right answer
looks like given the problem and given

251
00:16:17,615 --> 00:16:23,145
enough of these right answers, the
algorithm is learning how to create right

252
00:16:23,145 --> 00:16:27,415
answers by itself, through generalization.

253
00:16:28,055 --> 00:16:31,785
what I mean by that is, the goal of
machine learning is to generalize to

254
00:16:31,815 --> 00:16:35,705
unseen data, to be able to demonstrate
that something has been learned.

255
00:16:36,965 --> 00:16:39,915
So you mentioned supervised
and unsupervised.

256
00:16:39,915 --> 00:16:44,525
And from what you said, I understand
that supervised is basically, some

257
00:16:44,525 --> 00:16:47,635
kind of underlying function that
we're trying to approximate, right?

258
00:16:48,045 --> 00:16:48,745
So that.

259
00:16:49,360 --> 00:16:55,120
As many, unknown, kind of data points,
land as close to what we would like

260
00:16:55,120 --> 00:16:57,890
them to by giving it examples, right?

261
00:16:57,940 --> 00:17:03,420
By comparison, what does it actually
mean when the algorithm is unsupervised?

262
00:17:04,415 --> 00:17:10,635
So when algorithm's unsupervised, we don't
have a learning, label, to which to learn

263
00:17:10,635 --> 00:17:16,695
from, but instead what we're interested
in is, understanding, patterns in data.

264
00:17:17,255 --> 00:17:20,045
we're interested in making
sense of a lot of data.

265
00:17:20,045 --> 00:17:23,475
clustering is one example of,
unsupervised learning where we group

266
00:17:23,475 --> 00:17:28,185
data into clusters and then try to
make sense of each individual cluster

267
00:17:28,245 --> 00:17:30,385
or interpreted for our application.

268
00:17:30,970 --> 00:17:33,950
Will be some other
examples of unsupervised.

269
00:17:34,545 --> 00:17:39,365
another example that comes to mind
is, the extracting features from data.

270
00:17:40,075 --> 00:17:44,095
essentially all the encoders, they
take an input and they reconstruct.

271
00:17:44,565 --> 00:17:49,760
An output from the input, but there is
a bottleneck layer in between, which

272
00:17:50,190 --> 00:17:54,620
forces the auto encoder to learn a
compressed representation of the input

273
00:17:54,620 --> 00:17:57,540
data, before generating the output, right?

274
00:17:57,540 --> 00:18:03,000
So this bottleneck layer, which means
that it has fewer parameters than the

275
00:18:03,010 --> 00:18:09,275
input kind of forces the auto encoder to
learn something useful about the data,

276
00:18:09,385 --> 00:18:12,495
and this could be used as a feature
later on in a downstream algorithm.

277
00:18:13,240 --> 00:18:13,710
So

278
00:18:15,920 --> 00:18:18,190
this all makes sense at
the high level, right?

279
00:18:18,280 --> 00:18:24,530
But I'm trying to come up with a more
concrete example of the most basic version

280
00:18:24,580 --> 00:18:30,005
of an algorithm that you can have and
to understand what it would look like,

281
00:18:30,425 --> 00:18:34,185
because like you said, the main difference
being between something like binary

282
00:18:34,185 --> 00:18:39,855
search, when you've got a well understood
algorithm, that's well, analyzed.

283
00:18:40,910 --> 00:18:44,730
And then you just apply it to
data and you get some output.

284
00:18:44,760 --> 00:18:48,880
Whereas in the machine learning algorithm
world, you're doing kind of the opposite.

285
00:18:48,890 --> 00:18:53,740
You're learning the rules and trying
to come up with the actual algorithm.

286
00:18:53,800 --> 00:18:54,430
Really?

287
00:18:54,560 --> 00:18:56,170
I don't know if that's the
right way of saying that.

288
00:18:56,220 --> 00:19:01,130
But, that's the bit that you're trying to
figure out rather than just applying it.

289
00:19:01,970 --> 00:19:05,960
I'm trying to think, what's the
simplest, algorithm that we could maybe

290
00:19:05,960 --> 00:19:10,730
talk a little bit more in detail, of
how it works, because, like I said,

291
00:19:10,740 --> 00:19:14,500
it's abstract and it might be a little
bit hard to wrap your head around

292
00:19:14,540 --> 00:19:16,450
how that actually works in practice.

293
00:19:17,125 --> 00:19:20,065
in my book, I talk about a
lot of different algorithms.

294
00:19:20,105 --> 00:19:22,285
we can touch on a number
of different algorithms.

295
00:19:22,315 --> 00:19:24,295
but let's start with decision trees.

296
00:19:24,935 --> 00:19:26,685
I think they're widely used.

297
00:19:26,705 --> 00:19:28,765
that's their interpretable models.

298
00:19:29,005 --> 00:19:33,205
Essentially, a decision tree
learns to construct a sequence

299
00:19:33,205 --> 00:19:35,755
of if else conditions, right?

300
00:19:36,230 --> 00:19:41,250
you could trace the reasoning,
behind the decision tree just

301
00:19:41,250 --> 00:19:45,500
by looking at how decisions are
made through that if else tree.

302
00:19:45,860 --> 00:19:51,150
for example, if you're applying for
a loan and the loan gets rejected,

303
00:19:51,160 --> 00:19:56,800
then you could, analyze this decision,
why the loan got rejected by looking

304
00:19:56,830 --> 00:20:00,040
at the decision tree and figuring
out what branch of the decision tree

305
00:20:00,090 --> 00:20:02,730
was taken to lead to the outcome.

306
00:20:03,440 --> 00:20:07,900
and yeah, in some cases it's, an
important design choice to use an

307
00:20:07,900 --> 00:20:09,860
interpretable model like a decision tree.

308
00:20:10,530 --> 00:20:15,390
And, the interpretability, extends
through an ensemble of these models.

309
00:20:16,260 --> 00:20:20,680
like random forest is an ensemble of
decision trees, and we could extract

310
00:20:20,690 --> 00:20:23,340
feature importances, from that.

311
00:20:23,850 --> 00:20:26,240
let's talk about decision trees in detail.

312
00:20:26,880 --> 00:20:30,290
essentially it's a greedy and
recursive algorithm that starts

313
00:20:30,290 --> 00:20:31,890
with a certain depth of a tree.

314
00:20:32,315 --> 00:20:37,805
And it grows, depth on each iteration,
the maximum depth is reached.

315
00:20:38,330 --> 00:20:41,820
It's trying to optimize the genie
index, which is a measure of, impurity.

316
00:20:42,880 --> 00:20:48,210
and, we are at each iteration
trying to understand how to,

317
00:20:49,690 --> 00:20:57,790
divide our feature range into one
that optimizes for genie index.

318
00:20:58,345 --> 00:21:02,945
And once we complete one level, we move
on to the next level of the tree and so

319
00:21:02,945 --> 00:21:04,755
on until the maximum depth is reached.

320
00:21:05,115 --> 00:21:07,825
So it's a greedy algorithm and
it's a recursive algorithm.

321
00:21:09,095 --> 00:21:12,615
And, the one I'm talking
about is called CART, C A R T.

322
00:21:12,752 --> 00:21:18,522
So is that a deterministic way of
doing that or did the maximum, depth

323
00:21:18,602 --> 00:21:23,782
you mentioned, is that an arbitrary
decision, a hyper parameter effectively?

324
00:21:23,995 --> 00:21:26,545
the algorithm itself is, being greedy.

325
00:21:26,555 --> 00:21:27,875
It's deterministic.

326
00:21:27,935 --> 00:21:31,445
However, there's a way to introduce
randomness and this is what's done

327
00:21:31,445 --> 00:21:35,675
in random forest is, you could
introduce randomness in several ways.

328
00:21:35,705 --> 00:21:40,405
You could, sample the features that
you're evaluating at each iteration.

329
00:21:40,955 --> 00:21:42,865
You sample introduces randomness.

330
00:21:43,375 --> 00:21:50,705
you could also run the algorithm on the
subset of data, so that the data that the

331
00:21:50,705 --> 00:21:52,675
algorithm sees is different each time.

332
00:21:53,385 --> 00:21:57,135
and this is important because
you are trying to reduce the

333
00:21:57,135 --> 00:22:00,985
variance, uh, the algorithm.

334
00:22:01,610 --> 00:22:05,460
basically, if you're working in the
regression setting where you're trying

335
00:22:05,460 --> 00:22:11,320
to predict a continuous quantity using
random forest, for example, so you have

336
00:22:11,320 --> 00:22:15,880
a random forest regressor, then you
want to minimize the mean square error.

337
00:22:17,440 --> 00:22:20,850
And mean square error could be
written as bias squared plus variance.

338
00:22:21,490 --> 00:22:24,830
So to minimize mean square error,
you want to minimize, bias, and

339
00:22:24,870 --> 00:22:26,960
you want to minimize, variance.

340
00:22:28,090 --> 00:22:33,000
The way to minimize variance is by taking
an average of, large number of trees.

341
00:22:34,030 --> 00:22:37,160
And, important to make sure
that trees are de correlated.

342
00:22:37,720 --> 00:22:40,930
Because this will help,
actually minimize the variance.

343
00:22:41,830 --> 00:22:46,340
and then injecting randomness
into individual decision trees

344
00:22:46,630 --> 00:22:48,720
will help, decorrelate them.

345
00:22:49,050 --> 00:22:51,860
So they're basically
different looking trees.

346
00:22:52,445 --> 00:22:56,705
So like in a practical sense, let's
say, go back to the example of what you,

347
00:22:56,755 --> 00:23:04,065
suggested, what you mentioned, decision
tree to either grant or deny your request

348
00:23:04,065 --> 00:23:10,728
for a loan Would that, decision tree be
recalculated on the fly, or you run the

349
00:23:10,728 --> 00:23:16,018
algorithm once you've got your current
best model for deciding whether to

350
00:23:16,018 --> 00:23:19,878
give people loans and you version that?

351
00:23:20,178 --> 00:23:23,528
because I understand that the decision
tree is the actual output of your

352
00:23:23,608 --> 00:23:25,218
machine learning algorithm, right?

353
00:23:25,358 --> 00:23:29,208
And then the decision tree is like
an algorithm in itself, right?

354
00:23:29,958 --> 00:23:32,578
That you run to evaluate whether
you give the loan or not.

355
00:23:33,048 --> 00:23:34,888
So how does that work in practice?

356
00:23:34,998 --> 00:23:38,308
I would say there are two different
modes of machine learning algorithms.

357
00:23:38,308 --> 00:23:40,638
One is training and the other is testing.

358
00:23:41,278 --> 00:23:45,268
So in training, you're learning all
the parameters that are learnable

359
00:23:45,418 --> 00:23:46,708
in the machine learning algorithm.

360
00:23:47,548 --> 00:23:49,868
and, you need to have
the right data for it.

361
00:23:49,938 --> 00:23:52,118
you need to have the labels in this case.

362
00:23:52,358 --> 00:23:55,718
While during testing, you fix the
parameters that you've learned

363
00:23:56,088 --> 00:24:01,878
and you're focusing on prediction,
meaning given new input data, what

364
00:24:01,908 --> 00:24:06,718
would the output be like given a
new customer with their own profile?

365
00:24:06,748 --> 00:24:09,948
What should the output be
for that particular person?

366
00:24:10,065 --> 00:24:10,435
Okay.

367
00:24:10,565 --> 00:24:14,425
So what I'm picturing is like a
massive database of, okay, this person

368
00:24:14,425 --> 00:24:16,005
with all the parameters about them.

369
00:24:16,005 --> 00:24:17,325
This is their business plan.

370
00:24:17,325 --> 00:24:19,655
This is their, previous
exits and stuff like that.

371
00:24:19,655 --> 00:24:24,005
And this is the amount they want and
the decision, that were previously

372
00:24:24,005 --> 00:24:28,785
made by humans, you use that to somehow
feed into, the decision tree maker,

373
00:24:29,055 --> 00:24:30,645
is that the right way of saying that

374
00:24:30,720 --> 00:24:31,030
Yeah.

375
00:24:31,975 --> 00:24:35,375
and spits out a decision tree version 1.

376
00:24:35,375 --> 00:24:35,530
7.

377
00:24:36,270 --> 00:24:38,130
that you start running, right?

378
00:24:38,130 --> 00:24:39,114
is that how it works?

379
00:24:39,834 --> 00:24:40,994
yeah, I would imagine so.

380
00:24:42,014 --> 00:24:47,394
what are some of the difficulties in
terms of actual software implementation

381
00:24:47,394 --> 00:24:53,014
of this things, Again, going back to a
binary search, you got that we've got

382
00:24:53,044 --> 00:24:58,184
some people who thought about that,
they came up with this optimized idea.

383
00:24:58,234 --> 00:25:02,704
Then we got a few people who sat down
and optimized that for whatever hardware.

384
00:25:03,304 --> 00:25:06,984
And we've got a pretty speedy binary
search or I don't know, quicksort.

385
00:25:08,264 --> 00:25:11,864
this algorithm, there seem to
be much more custom and much

386
00:25:11,884 --> 00:25:13,554
more, Aligned with the data.

387
00:25:14,354 --> 00:25:17,984
So what are some of the
complications of that in terms

388
00:25:17,984 --> 00:25:19,464
of actually implementing this?

389
00:25:19,494 --> 00:25:23,314
Or maybe that's a completely wrong
way of thinking about that and

390
00:25:23,584 --> 00:25:24,864
if that's the case, just tell me.

391
00:25:25,124 --> 00:25:25,384
yeah.

392
00:25:25,384 --> 00:25:28,684
When it comes to implementation, some
of the computer science principles

393
00:25:28,684 --> 00:25:32,664
that you mentioned, they carry over,
and I can talk about Some of the

394
00:25:32,674 --> 00:25:37,014
computer science paradigms, like
algorithmic paradigms, later as well.

395
00:25:37,814 --> 00:25:40,014
but yeah, it's a matter
of getting it right.

396
00:25:40,334 --> 00:25:44,604
I think the correctness of the
algorithm is very important.

397
00:25:45,654 --> 00:25:51,574
computational complexity, like runtime
and, memory complexity are also important.

398
00:25:52,629 --> 00:25:55,299
Being able to scale the
algorithm is important.

399
00:25:55,399 --> 00:25:57,539
it's an important challenge.

400
00:25:58,759 --> 00:26:03,689
some algorithms like random forests are
more amenable to parallelization because

401
00:26:03,709 --> 00:26:08,629
the trees are generated in parallel,
whereas, another ensemble like, boosted

402
00:26:08,639 --> 00:26:16,184
algorithms, they work by fitting,
sequentially residuals of trees, they're

403
00:26:16,194 --> 00:26:20,934
work in sequential manner, so there
are less amenable to parallelization.

404
00:26:22,084 --> 00:26:26,474
I would say the number one challenge
is to get, the math correctly, and

405
00:26:26,474 --> 00:26:28,304
then to translate that math into code.

406
00:26:29,124 --> 00:26:33,444
And then from there on to have low
computational, low memory complexity.

407
00:26:33,444 --> 00:26:36,514
So I guess it's not all
that different after all.

408
00:26:37,244 --> 00:26:42,334
I'm guessing some of the things will be
common, you mentioned the greedy aspect

409
00:26:42,384 --> 00:26:44,844
that comes from, the classic algorithms.

410
00:26:45,414 --> 00:26:48,564
I'm guessing a lot of that will be,
dynamic programming, and you're probably

411
00:26:48,564 --> 00:26:53,244
going to apply all the usual tricks, like
divide and conquer and stuff like that

412
00:26:53,284 --> 00:27:00,164
Wherever you can, but is there anything
like particularly common and unusual that

413
00:27:00,164 --> 00:27:04,964
you wouldn't be doing with, classical
algorithms that you do a lot in ML?

414
00:27:05,569 --> 00:27:09,819
there's different phases like
training and testing, right?

415
00:27:09,859 --> 00:27:12,279
Learning the parameters and
predicting the parameters.

416
00:27:13,089 --> 00:27:18,409
the notion of learnable parameters
themselves, I think, is key difference.

417
00:27:18,829 --> 00:27:19,499
What's that?

418
00:27:19,519 --> 00:27:21,599
What are learnable parameters?

419
00:27:22,064 --> 00:27:25,692
essentially like variables that
you try to fit, variables that

420
00:27:25,692 --> 00:27:28,082
you try to optimize for data.

421
00:27:28,082 --> 00:27:33,260
It's like room for growth or room for,
adaptability in an algorithm itself.

422
00:27:33,260 --> 00:27:35,960
Having an objective function
is another key differentiator.

423
00:27:36,370 --> 00:27:40,740
a lot of Bayesian algorithms, they
maximize the log likelihood or

424
00:27:40,740 --> 00:27:42,590
minimize the negative log likelihood.

425
00:27:42,700 --> 00:27:44,270
that's another difference.

426
00:27:44,320 --> 00:27:48,420
a methodology for learning these
parameters would be another difference.

427
00:27:48,540 --> 00:27:52,690
for example, it could be backpropagation
and deep learning, right?

428
00:27:52,830 --> 00:27:56,570
There's a methodology for learning
the parameters of the model.

429
00:27:57,430 --> 00:28:01,360
or it could be, Bayes rule as a
way of updating the parameters,

430
00:28:01,510 --> 00:28:03,360
in a graphical model, for example.

431
00:28:04,175 --> 00:28:05,065
That makes me think.

432
00:28:05,835 --> 00:28:10,505
So is it true that at the moment
all of ML is being completely

433
00:28:10,545 --> 00:28:12,745
dominated by deep learning?

434
00:28:13,825 --> 00:28:17,575
And when people talk about ML,
they basically talk about deep

435
00:28:17,575 --> 00:28:19,085
learning most of the time?

436
00:28:20,255 --> 00:28:22,525
Back propagation and
stuff like that has been.

437
00:28:23,110 --> 00:28:27,040
Super hot topic, because of, chat
GPTs, of the world and stuff like that.

438
00:28:27,070 --> 00:28:31,720
And the rest is becoming a little
bit, less in, fashion at the moment?

439
00:28:31,998 --> 00:28:32,998
it comes in waves.

440
00:28:33,158 --> 00:28:37,498
I tend to focus on fundamentals because
fundamentals are never going to be

441
00:28:37,498 --> 00:28:42,988
out of fashion, solid, background and
applied probability calculus, linear

442
00:28:42,988 --> 00:28:48,798
algebra, Bayesian inference, deep
learning, these are all going to be

443
00:28:48,798 --> 00:28:51,148
in fashion for a really long time.

444
00:28:51,888 --> 00:28:58,528
definitely large language models showed,
so much, growth in the past few years.

445
00:28:59,268 --> 00:29:03,698
And, these are deep learning models,
starting from like natural language,

446
00:29:03,758 --> 00:29:10,638
machine translation, encoder decoder
type architectures and, going to, GPT.

447
00:29:11,778 --> 00:29:17,748
For, and wherever the next GPT
is, in size and in performance.

448
00:29:17,748 --> 00:29:19,138
it's interesting to.

449
00:29:19,138 --> 00:29:19,728
Think about it.

450
00:29:19,768 --> 00:29:23,098
I'm really happy that, they
took off at such speed and

451
00:29:23,108 --> 00:29:24,668
there's so much interest in AI,

452
00:29:24,778 --> 00:29:25,838
so what was that?

453
00:29:25,948 --> 00:29:31,118
2019 or something like that when the
first version of ChatGPT came out, right?

454
00:29:31,188 --> 00:29:32,428
it's been a few years now.

455
00:29:33,508 --> 00:29:39,318
as someone who specializes in a lot
of this, fundamental, algorithms and

456
00:29:39,348 --> 00:29:43,278
understands how they're derived and where
they come from and their limitations.

457
00:29:43,878 --> 00:29:47,098
What do you think of all the hype
that's currently flowing around,

458
00:29:47,158 --> 00:29:53,508
AGI being just around the corner and
AI taking your job and all of that,

459
00:29:53,838 --> 00:29:56,018
I'm a believer in co pilots.

460
00:29:56,028 --> 00:29:59,638
So I think, AI is helping
people with their job.

461
00:29:59,848 --> 00:30:02,958
I'm not sure if they're going to be
taking over the job, but, also a big

462
00:30:02,958 --> 00:30:08,603
believer in automation, automation as
a way of helping a developer deal with

463
00:30:08,603 --> 00:30:10,903
less pleasant aspects of the job, right?

464
00:30:10,983 --> 00:30:13,393
if AI can do that, that's fantastic.

465
00:30:13,903 --> 00:30:17,033
But I think a lot of the planning
and thinking is still up to the

466
00:30:17,033 --> 00:30:19,663
human, to reason, to decide.

467
00:30:19,953 --> 00:30:22,953
Yeah, I benefited a lot from co pilots.

468
00:30:23,003 --> 00:30:28,933
they're really great at summarizing a lot
of resources available online and through,

469
00:30:29,313 --> 00:30:31,543
retro augmented generation systems.

470
00:30:32,333 --> 00:30:33,733
You could accomplish a lot.

471
00:30:33,783 --> 00:30:35,593
I'm a big believer in co pilots.

472
00:30:36,285 --> 00:30:39,745
I think this is probably something
that might be getting a little bit

473
00:30:40,615 --> 00:30:45,285
of, bad rep, because everybody just
wants like the final step, right?

474
00:30:45,575 --> 00:30:47,885
It was the same thing
with self driving cars.

475
00:30:48,615 --> 00:30:53,855
my Tesla is driving itself pretty
well, maybe 95% of the time, if I'm

476
00:30:53,855 --> 00:30:59,200
on like a longer route and I'm on the
motorway or whatever, It's doing most

477
00:30:59,200 --> 00:31:02,870
of the work already pretty well, I'm
still responsible for it and I have to

478
00:31:02,870 --> 00:31:06,200
look but what everybody wants is like
the final step when you can just kick

479
00:31:06,200 --> 00:31:12,450
back and relax and not do any of that
and I think that's understandable.

480
00:31:12,460 --> 00:31:17,590
But at the same time it's like making
the current, intermediate step of a co

481
00:31:17,590 --> 00:31:21,720
pilot situation, maybe sounds a little
bit less glamorous than it actually

482
00:31:21,740 --> 00:31:23,430
is because it's already pretty cool.

483
00:31:24,400 --> 00:31:26,140
so totally agree with you on that.

484
00:31:26,160 --> 00:31:31,240
we've done one example
of the decision tree.

485
00:31:32,020 --> 00:31:38,240
I wonder what would be like your top
three hall of fame machine algorithms.

486
00:31:38,270 --> 00:31:38,700
I.

487
00:31:39,155 --> 00:31:42,555
saw your book, and there are some
of the things that I keep seeing

488
00:31:42,595 --> 00:31:47,365
elsewhere, like Markov chains
and Monte Carlo stuff like that.

489
00:31:47,415 --> 00:31:50,695
there are some of the things that sound
interesting, like genetic algorithms

490
00:31:50,695 --> 00:31:52,375
and, I wonder what that actually means.

491
00:31:52,375 --> 00:31:56,965
But if you were to give us like your
top three favorite, Hall of Fame

492
00:31:57,005 --> 00:32:01,655
algorithms and tell us a little bit
how they work, high level again for

493
00:32:01,715 --> 00:32:03,285
a five year old software engineer.

494
00:32:03,830 --> 00:32:05,070
What will be your selection?

495
00:32:05,120 --> 00:32:06,050
What's on that menu?

496
00:32:06,235 --> 00:32:09,135
definitely have to mention one
of them would be a Markov chain

497
00:32:09,135 --> 00:32:10,645
Monte Carlo type algorithm.

498
00:32:10,695 --> 00:32:15,285
so what Markov chains are essentially
it's a sequence of random variables and,

499
00:32:15,335 --> 00:32:18,015
the future is independent of the past.

500
00:32:18,065 --> 00:32:22,795
So the future state of random variable
only depends on the present state, which

501
00:32:22,795 --> 00:32:26,425
reminds me of a quote that, doesn't
really matter where you're coming from,

502
00:32:26,455 --> 00:32:30,325
all that really matters is where you're
going, so Markov chain Monte Carlo,

503
00:32:30,415 --> 00:32:35,015
one of my favorite algorithms in that
area is Metropolis Hastings algorithm.

504
00:32:35,715 --> 00:32:39,315
And, idea there is you're
after a posterior distribution.

505
00:32:39,765 --> 00:32:44,065
you want to draw samples from
this posterior distribution.

506
00:32:44,095 --> 00:32:46,145
You want to, study it, analyze it.

507
00:32:46,525 --> 00:32:48,855
Posterior is like the goal, the answer.

508
00:32:49,385 --> 00:32:53,265
But it's hard to sample from
it, because it's, in real

509
00:32:53,265 --> 00:32:54,965
life models, they're complex.

510
00:32:55,665 --> 00:32:58,935
And, what you do instead is you
approximate it with something

511
00:32:58,935 --> 00:33:00,585
called a proposal distribution.

512
00:33:01,495 --> 00:33:04,405
And a proposal distribution
is easier to sample from.

513
00:33:04,405 --> 00:33:08,905
So what happens is you draw samples
from a proposal distribution, and then

514
00:33:08,905 --> 00:33:13,445
based on Metropolis Hastings ratio,
you evaluate these samples, and you

515
00:33:13,445 --> 00:33:15,555
either accept them or reject them.

516
00:33:15,665 --> 00:33:17,850
You either take them or you drop them.

517
00:33:18,560 --> 00:33:20,810
And you repeat this process many times.

518
00:33:22,280 --> 00:33:26,270
so Metropolis Hastings enables
sampling from these high dimensional

519
00:33:26,280 --> 00:33:31,960
distribution spaces, and, it's, simple
enough to implement from scratch.

520
00:33:31,990 --> 00:33:33,880
it's a great algorithm, overall.

521
00:33:33,890 --> 00:33:35,800
There are various
improvements on top of it.

522
00:33:35,800 --> 00:33:38,870
It's definitely not the most efficient.

523
00:33:40,605 --> 00:33:42,445
Algorithm, but it's a really good one.

524
00:33:42,735 --> 00:33:45,105
It's, that's why I bring it up.

525
00:33:45,135 --> 00:33:48,215
but how do you come up with
this proposal distribution?

526
00:33:48,605 --> 00:33:52,245
Proposals are something
that's easier to sample from.

527
00:33:52,245 --> 00:33:55,935
So it could be a Gaussian with certain
mean covariance, like a multivariate

528
00:33:55,975 --> 00:33:57,985
Gaussian and high dimensional problems.

529
00:33:58,870 --> 00:34:03,750
You know, typically you want to
have a high acceptance ratio, so

530
00:34:03,750 --> 00:34:07,340
the closer your proposal is to the
actual target distribution, the

531
00:34:07,370 --> 00:34:09,540
target posterior, then the better.

532
00:34:09,590 --> 00:34:15,680
so you're trying to estimate,
based on domain knowledge or

533
00:34:15,680 --> 00:34:19,960
otherwise, the proximity, how
close can you get to the target.

534
00:34:20,011 --> 00:34:20,421
I see.

535
00:34:20,861 --> 00:34:23,601
because I keep, thinking
the classical way about it.

536
00:34:23,611 --> 00:34:26,981
So it's not like one of those
algorithms where you just have

537
00:34:26,981 --> 00:34:31,541
the steps, there's a step which
is basically suggest a reasonable

538
00:34:32,601 --> 00:34:34,911
distribution that approximates it.

539
00:34:35,206 --> 00:34:38,826
with something that's well known and
look at the data and come up with,

540
00:34:38,876 --> 00:34:40,656
something that should be reasonable.

541
00:34:40,686 --> 00:34:44,766
And then you used, Metropolis
Hastings to evaluate, basically.

542
00:34:44,943 --> 00:34:46,613
it's much more artisanal, right?

543
00:34:46,623 --> 00:34:51,223
there's always this step of, staring
at the data and looking and coming

544
00:34:51,223 --> 00:34:54,778
up with, mix of your experience
and creativity to come up with

545
00:34:54,778 --> 00:34:56,218
something that sounds about right.

546
00:34:56,968 --> 00:34:57,788
which is scary.

547
00:34:58,108 --> 00:35:01,068
for someone who comes from,
very exact word of, algorithms.

548
00:35:01,108 --> 00:35:02,808
This is a, this is scary stuff.

549
00:35:02,808 --> 00:35:05,888
All right, cool.

550
00:35:05,918 --> 00:35:10,218
So Metropolis Hastings, is that two names

551
00:35:10,298 --> 00:35:14,008
I believe these are names are named
after the inventors of the algorithm.

552
00:35:14,248 --> 00:35:15,748
So that's an interesting approach.

553
00:35:15,748 --> 00:35:18,708
What will be your number
two of your top three

554
00:35:18,916 --> 00:35:24,066
I would pick, approximate nearest
neighbors because of its popularity

555
00:35:24,076 --> 00:35:29,026
and, current like ritual augmented
generation systems they're used.

556
00:35:30,071 --> 00:35:30,641
everywhere.

557
00:35:31,411 --> 00:35:35,991
essentially, approximate nearest neighbors
is an improvement of K nearest neighbors.

558
00:35:36,041 --> 00:35:41,531
with K nearest neighbors, if you're given,
a query point, you want to compute the

559
00:35:41,531 --> 00:35:46,711
distance between that query point and all
the other points in the training data set.

560
00:35:46,761 --> 00:35:50,451
You want to compute the distances and
then sort these distances and then

561
00:35:50,481 --> 00:35:53,221
select top k closest distance points.

562
00:35:53,971 --> 00:35:57,341
So this is highly computationally
intensive operation.

563
00:35:57,391 --> 00:36:01,371
first of all, you have to compute,
and the dimensional distances.

564
00:36:01,431 --> 00:36:06,251
Then you have to sort and then
log in and select the top K.

565
00:36:06,771 --> 00:36:09,831
So approximate nearest neighbors
is a way to get around it.

566
00:36:10,851 --> 00:36:15,356
And, there are Three approximate nearest
neighbour flavors that I could talk

567
00:36:15,366 --> 00:36:22,226
about, one is tree based nn essentially,
what you do with tree based nn is,

568
00:36:23,466 --> 00:36:29,253
you divide up the space into regions,
and each leaf in the tree is a region.

569
00:36:29,253 --> 00:36:31,596
So one example is like KD trees.

570
00:36:31,616 --> 00:36:34,376
and then based on the problem,
is it a classification

571
00:36:34,376 --> 00:36:35,926
problem or regression problem?

572
00:36:36,496 --> 00:36:41,386
You can compute the final answer
either by, taking majority vote

573
00:36:41,776 --> 00:36:46,346
for classification or taking
an average of points in the

574
00:36:46,356 --> 00:36:48,136
region for regression problem.

575
00:36:48,801 --> 00:36:53,211
we take a quick detour to say, because
those are words that have meaning,

576
00:36:53,211 --> 00:36:58,291
but they probably have more particular
meaning in, the machine learning context.

577
00:36:58,291 --> 00:37:02,311
So could you quickly tell us what's
the difference between regression and

578
00:37:02,491 --> 00:37:04,431
classification types of algorithms?

579
00:37:04,613 --> 00:37:08,473
so in regression, you're interested
in estimating a continuous quantity.

580
00:37:09,438 --> 00:37:14,668
so a real value, such as, let's say,
a stock price in classification.

581
00:37:14,678 --> 00:37:18,368
You're interested in,
estimating a discrete quantity.

582
00:37:19,038 --> 00:37:23,258
For example, it could be a
particular, customer age group.

583
00:37:24,088 --> 00:37:27,808
so the differences in the quantity you're
estimating for continuous, it's regression

584
00:37:27,838 --> 00:37:29,328
for discrete, it's classification.

585
00:37:30,173 --> 00:37:36,043
back to approximate nearest neighbor
is, tree based nn, we have our

586
00:37:36,043 --> 00:37:38,943
space, which we divide into regions.

587
00:37:40,053 --> 00:37:43,633
and, based on the points in each
region and the task at hand,

588
00:37:43,643 --> 00:37:47,843
we either average the points if
we're looking for regression.

589
00:37:48,693 --> 00:37:54,023
answer or we take majority vote is,
looking at the class labels and taking

590
00:37:54,023 --> 00:37:59,553
the majority label as the answer, if
the problem is a classification problem.

591
00:37:59,603 --> 00:38:00,403
another.

592
00:38:01,573 --> 00:38:05,083
Example of approximate nearest
neighbors is locality sensitive hashing.

593
00:38:06,133 --> 00:38:11,423
what we do there is we essentially
group points into buckets, based

594
00:38:11,423 --> 00:38:13,223
on their proximity with each other.

595
00:38:13,953 --> 00:38:17,963
And instead of searching through all
the points, we only look inside the

596
00:38:17,963 --> 00:38:20,813
bucket to find the k nearest neighbors.

597
00:38:21,603 --> 00:38:25,303
So this helps reduce computational
complexity dramatically.

598
00:38:26,478 --> 00:38:29,648
So first cluster them using one
of the other algorithms, and then

599
00:38:29,648 --> 00:38:31,208
you just look inside the cluster.

600
00:38:32,458 --> 00:38:34,718
That will be the third type is clustering.

601
00:38:35,448 --> 00:38:40,018
in the clustering sense, we
cluster the points into clusters

602
00:38:40,018 --> 00:38:41,798
and only look inside the cluster.

603
00:38:42,303 --> 00:38:45,903
So the buckets we were talking about,
how are they different from clusters?

604
00:38:46,003 --> 00:38:48,273
the buckets are formed in
a slightly different way.

605
00:38:48,373 --> 00:38:52,993
essentially, you can visualize it as
points on a high dimensional sphere

606
00:38:53,433 --> 00:38:59,113
and you intersect the points with
hyperplanes and points that are captured

607
00:38:59,113 --> 00:39:03,503
between the hyperplanes forming, into a
bucket are placed into the same bucket.

608
00:39:03,513 --> 00:39:05,543
So they're based on locality there.

609
00:39:06,193 --> 00:39:10,143
it points are closer together on that
sphere, get grouped into the same bucket.

610
00:39:11,246 --> 00:39:11,726
Okay.

611
00:39:12,016 --> 00:39:12,416
All right.

612
00:39:12,863 --> 00:39:17,133
so these three methods, the, Tree
based nn, locality assigns to hashing

613
00:39:17,133 --> 00:39:23,283
nn, and, based nn they help speed
up, complexity of exact K and N.

614
00:39:23,283 --> 00:39:28,183
So what would be some of the applications
of this approximate nearest neighbors?

615
00:39:28,253 --> 00:39:31,323
where would we see that in
practice maybe in production?

616
00:39:32,273 --> 00:39:33,503
can you give us an example?

617
00:39:34,036 --> 00:39:40,226
so in Retrieval Augmented Generation
Systems, you have a vector store and

618
00:39:40,236 --> 00:39:44,466
you're interested in retrieving closest
or in semantic search, for example,

619
00:39:44,466 --> 00:39:45,646
you're interested in retrieving.

620
00:39:46,176 --> 00:39:49,106
closest, unit from the vector store

621
00:39:49,976 --> 00:39:52,666
And like a bunch of dimensions.

622
00:39:52,666 --> 00:39:52,966
Yeah.

623
00:39:54,041 --> 00:39:57,321
So you can use this approximation
to get something quicker.

624
00:39:57,831 --> 00:40:00,201
Even if it's not exact,
which is pretty cool.

625
00:40:00,371 --> 00:40:00,801
All right.

626
00:40:01,031 --> 00:40:04,891
So that was number two
on your top three lists.

627
00:40:05,581 --> 00:40:06,441
What's number three?

628
00:40:06,998 --> 00:40:12,308
I would say, attention and
transformers, I would say my third,

629
00:40:12,478 --> 00:40:16,878
on the list of favorite algorithms,
self attention methods, they

630
00:40:16,878 --> 00:40:19,228
really revolutionize the space.

631
00:40:20,538 --> 00:40:26,628
And, the idea there is
to attend to the context.

632
00:40:27,518 --> 00:40:30,188
originating from, neural
machine translation.

633
00:40:30,728 --> 00:40:35,478
if we are translating a target word,
to a different language, we need to

634
00:40:35,478 --> 00:40:38,058
understand the context around that word.

635
00:40:38,208 --> 00:40:42,008
We need to understand the whole sentence
around it before we can translate a

636
00:40:42,008 --> 00:40:48,138
single word and, self attention mechanisms
enable us to do just that, and in a

637
00:40:48,208 --> 00:40:55,828
paralyzed fashion, it could also be seen
as a soft dictionary lookup with query

638
00:40:55,828 --> 00:41:04,138
key value pairs in which the target
word is a query and you're computing in

639
00:41:04,168 --> 00:41:08,478
the product between the query and the
key, and multiplying that by the value

640
00:41:08,478 --> 00:41:11,348
stored in that soft lookup dictionary.

641
00:41:12,168 --> 00:41:15,548
And, that's how you get the
famous, formula for involving

642
00:41:15,548 --> 00:41:16,758
those three variables.

643
00:41:16,758 --> 00:41:21,518
essentially we're trying to understand
the context and the contribution of

644
00:41:21,578 --> 00:41:25,378
every word in the sentence to the
target word in which we're translating.

645
00:41:26,268 --> 00:41:32,148
and, we have, Learnable parameters,
so we're keeping track of, word

646
00:41:32,158 --> 00:41:36,108
embeddings and we're keeping track of
word position we have these learnable

647
00:41:36,138 --> 00:41:40,998
parameters, which help us, find the
closest map in the target language

648
00:41:40,998 --> 00:41:42,588
to the word, which we're translating.

649
00:41:42,598 --> 00:41:42,628
Okay.

650
00:41:45,008 --> 00:41:48,178
Okay, so I got, a sentence.

651
00:41:48,258 --> 00:41:48,938
I don't know.

652
00:41:49,028 --> 00:41:50,688
I like cats.

653
00:41:51,638 --> 00:41:56,828
And we want to understand, the like,
how is it connected to the cats, right?

654
00:41:57,208 --> 00:41:57,598
Uh huh.

655
00:41:57,928 --> 00:42:02,828
so does it mean that we're calculating
like the complete product of, connections

656
00:42:02,828 --> 00:42:08,438
between all the pairs of words, embeddings
or whatever is underlying there.

657
00:42:09,418 --> 00:42:10,158
How does it work?

658
00:42:10,178 --> 00:42:15,148
I do have a chapter in my book on
self attention and transformers, it

659
00:42:15,148 --> 00:42:19,068
comes back to attention's all we need
architecture, the encoder decoder

660
00:42:19,068 --> 00:42:21,498
architecture, the paper, first introduced

661
00:42:21,578 --> 00:42:21,728
it.

662
00:42:21,798 --> 00:42:23,088
famous paper.

663
00:42:23,438 --> 00:42:26,078
yeah, essentially we're
predicting one word at a time.

664
00:42:27,038 --> 00:42:29,628
in a masked, causal way, right?

665
00:42:29,758 --> 00:42:34,348
and are looking at all the words
that came before that word and

666
00:42:34,388 --> 00:42:39,378
figuring out the highest probability
next word in our dictionary.

667
00:42:39,948 --> 00:42:43,478
and of course there's different
varieties of, architectures now when

668
00:42:43,478 --> 00:42:48,998
it comes to transformers, there's the
decoder only GPT family, then there are

669
00:42:49,008 --> 00:42:54,458
encoder only BERT, then there's encoder
decoder architectures and, like T5.

670
00:42:54,538 --> 00:42:57,498
And they're suitable for
different applications.

671
00:42:58,018 --> 00:43:00,858
GPT has been very popular when
it comes to generative AI.

672
00:43:00,908 --> 00:43:04,518
We've got the top three from Vadim.

673
00:43:04,538 --> 00:43:08,598
For anybody else who, is
struggling a little bit like me

674
00:43:08,628 --> 00:43:12,998
to, go through that, probably the
best way, is to go grab a book.

675
00:43:13,048 --> 00:43:15,138
the book is still in MEAP, right?

676
00:43:15,138 --> 00:43:17,068
The mining early access program.

677
00:43:17,128 --> 00:43:19,688
And, I think I looked
it up on the website.

678
00:43:20,133 --> 00:43:23,613
I think it said August this
year for final version.

679
00:43:23,683 --> 00:43:24,323
Is that right?

680
00:43:25,093 --> 00:43:29,813
everything is written and finished, it's
up to production folks at manning to

681
00:43:30,093 --> 00:43:32,573
actually have the print version ready.

682
00:43:32,603 --> 00:43:37,683
the PDF is available and all the
contents are there right now.

683
00:43:38,553 --> 00:43:39,063
Got it.

684
00:43:39,083 --> 00:43:40,713
Manning, please hurry up.

685
00:43:41,343 --> 00:43:42,663
We want the book finished.

686
00:43:43,783 --> 00:43:45,343
What's next for you, Vadim?

687
00:43:45,403 --> 00:43:50,593
I've been thinking about, maybe making an
online course on machine learning topic.

688
00:43:51,723 --> 00:43:53,753
I'm exploring different media right now.

689
00:43:53,803 --> 00:43:55,123
writing a book is one media.

690
00:43:55,123 --> 00:43:57,153
I'm getting into YouTube
a little bit more.

691
00:43:57,173 --> 00:43:59,453
I, started posting content on YouTube.

692
00:44:00,223 --> 00:44:04,553
I also have an Instagram channel,
at the life guide now, which I talk

693
00:44:04,563 --> 00:44:10,143
about inspirational, motivational
content related to different quotes

694
00:44:10,143 --> 00:44:14,433
and different things that helped me
grow and go through, difficult periods,

695
00:44:14,593 --> 00:44:18,993
kind of things that help me and
things I want to share with the world.

696
00:44:19,923 --> 00:44:20,983
yeah, growing those.

697
00:44:22,003 --> 00:44:26,323
channels and maybe looking at
online courses is my next step.

698
00:44:27,788 --> 00:44:32,638
Yeah, I think to be honest with you,
YouTube is probably my favorite way

699
00:44:32,638 --> 00:44:34,138
of learning things at the moment.

700
00:44:34,558 --> 00:44:37,798
It's basically got everything
and anything that you need.

701
00:44:37,808 --> 00:44:41,818
And on any topic, really, you're
going to find something and many

702
00:44:41,818 --> 00:44:45,058
topics, you're going to find so many
different ways of explaining something.

703
00:44:45,778 --> 00:44:48,878
And it's a nice medium because
it's so flexible, right?

704
00:44:48,938 --> 00:44:53,978
You can explain, you can show, you can
give examples, you can demonstrate.

705
00:44:54,448 --> 00:44:55,028
It's amazing.

706
00:44:55,393 --> 00:45:00,533
if, our civilization fails some
thousand years from now, I hope that

707
00:45:00,543 --> 00:45:04,633
YouTube survives because for the next
one to pick it up, that's a lot of

708
00:45:04,653 --> 00:45:09,063
knowledge that's encoded in there
and in a very nice, to consume way.

709
00:45:09,063 --> 00:45:11,743
I'm going to ask you before I let you go.

710
00:45:12,193 --> 00:45:13,683
for some predictions.

711
00:45:13,973 --> 00:45:18,913
Given, crazy rate of
acceleration in all the things.

712
00:45:18,963 --> 00:45:22,313
There seem to be an AI
startup on every corner now.

713
00:45:22,923 --> 00:45:26,043
And they seem to be going, almost
as quickly as they're coming.

714
00:45:27,013 --> 00:45:31,623
Where do you think we're going to see
most, development in the coming years?

715
00:45:32,173 --> 00:45:36,433
where would you personally love to
see development in the coming years?

716
00:45:36,778 --> 00:45:40,128
actually recently attended a keynote
that, machine learning data science

717
00:45:40,128 --> 00:45:42,208
conference, and bloods at Microsoft.

718
00:45:42,228 --> 00:45:47,748
And, I was really inspired by this,
agents and, autonomous thinking units.

719
00:45:48,093 --> 00:45:52,373
And, as part of co pilots and
assistance, and, there's so much

720
00:45:52,373 --> 00:45:55,783
room for growth in that space.

721
00:45:55,823 --> 00:45:59,933
there's different form factors like we're
used to our phones and laptops, right?

722
00:45:59,933 --> 00:46:00,863
But imagine.

723
00:46:01,408 --> 00:46:06,338
Having co pilot that's not on your
phone or on your laptop, but, somebody

724
00:46:06,338 --> 00:46:10,958
who's portable, somebody who's with
you, somebody who's understands you

725
00:46:10,958 --> 00:46:16,538
really well and helps you do your
tasks or helps you, have a good time.

726
00:46:17,348 --> 00:46:20,558
yeah, so different form factors,
like a portable co pilot,

727
00:46:20,708 --> 00:46:22,318
like a device that could.

728
00:46:23,068 --> 00:46:28,238
Be with you and, learn from
you and interact, with you.

729
00:46:29,098 --> 00:46:36,008
So I think that's redesigning what
we have today in terms of, LLM

730
00:46:36,038 --> 00:46:41,218
agents or population of agents,
not just focused on language, but

731
00:46:41,298 --> 00:46:44,938
other types of agents, I think is
going to be the next step forward.

732
00:46:45,911 --> 00:46:49,181
I would like to challenge you a little
bit on that, because I've been thinking

733
00:46:49,211 --> 00:46:53,831
like that initially when I was watching,
for example, the rabbit R1 keynote,

734
00:46:54,271 --> 00:46:57,931
and they were giving this demo of
how you're just going to talk to it.

735
00:46:57,961 --> 00:47:02,701
And it's going to, effectively
use the UIs in various apps.

736
00:47:02,701 --> 00:47:04,281
And I was like, Oh, that's a great idea.

737
00:47:04,281 --> 00:47:08,351
All this apps, they have weird UI
things, and I don't want to click that.

738
00:47:08,351 --> 00:47:09,311
I don't want to learn it.

739
00:47:09,351 --> 00:47:11,871
I just wish it was automated.

740
00:47:12,611 --> 00:47:14,001
And there was also humane AI.

741
00:47:15,936 --> 00:47:18,286
And they both seem to suck quite a lot.

742
00:47:18,346 --> 00:47:22,646
Like I watched some of the reviews, I even
ordered the rabbit or one, and it just

743
00:47:22,646 --> 00:47:27,136
doesn't seem to be working all that well,
I think humane AI was already talking

744
00:47:27,136 --> 00:47:30,916
about hoping to be acquired by someone
who can take it in a better direction.

745
00:47:31,676 --> 00:47:36,476
And my thinking was actually,
what is so wrong with the phones?

746
00:47:36,506 --> 00:47:40,376
there's a smartphone, it's already
evolved and it's already got

747
00:47:40,876 --> 00:47:44,276
basically everything you need to run
a reasonably sized model already.

748
00:47:44,931 --> 00:47:49,291
So why don't we just like the idea
of just having that naturally evolve

749
00:47:49,301 --> 00:47:52,111
to be more prominent in your phone?

750
00:47:52,141 --> 00:47:54,881
And why do we need a new device for that?

751
00:47:54,881 --> 00:47:56,321
What do you think about that?

752
00:47:56,516 --> 00:47:57,966
it has to make sense, right?

753
00:47:58,076 --> 00:48:03,336
if it's not working as expected, then
people are not gonna, buy it, right?

754
00:48:04,036 --> 00:48:07,906
but it has to add value to our lives.

755
00:48:07,906 --> 00:48:10,896
it could be, the
interaction with the device.

756
00:48:10,986 --> 00:48:15,646
Instead of clicking, you simply use an
eye tracking software and you could click

757
00:48:15,656 --> 00:48:22,206
using your eyes as an example, something
seamless, something that removes the

758
00:48:22,206 --> 00:48:28,036
bottlenecks, instead of typing, of course,
we have now, all the interactions with

759
00:48:28,036 --> 00:48:33,386
our devices, but something that simplifies
our lives, it has to have value.

760
00:48:34,353 --> 00:48:35,393
Yeah, that's for sure.

761
00:48:36,073 --> 00:48:40,108
I'm just wondering, there's A
few startups now that are working

762
00:48:40,108 --> 00:48:41,828
on this humanoid robots, right?

763
00:48:41,828 --> 00:48:45,908
There's obviously like the Tesla
Optimus and a bunch of others.

764
00:48:45,938 --> 00:48:53,188
I think Unitree announced that you can now
order their $16,000 mini, four or five.

765
00:48:53,268 --> 00:48:58,338
Feet tall humanoid, which is,
I guess it's not mini anymore.

766
00:48:58,348 --> 00:49:06,128
It's, that's pretty big, but I'm just
wondering if I actually need that yet.

767
00:49:06,188 --> 00:49:06,888
don't get me wrong.

768
00:49:06,918 --> 00:49:08,298
I would love to get one of this.

769
00:49:08,298 --> 00:49:11,668
And if I had 16 grand lying
around that I had no use for, I

770
00:49:11,668 --> 00:49:12,958
would have ordered one already.

771
00:49:13,798 --> 00:49:19,668
But I do wonder, Whether that's literally
around the corner or whether this is

772
00:49:19,668 --> 00:49:23,668
going to be another one of the self
driving car situations where It's

773
00:49:23,688 --> 00:49:27,608
been next year for a decade and a half
at least now Have you ordered one?

774
00:49:28,173 --> 00:49:30,893
no, we do have a robo vacuum though.

775
00:49:32,013 --> 00:49:32,163
Oh

776
00:49:32,213 --> 00:49:32,633
if there's

777
00:49:32,653 --> 00:49:32,713
a

778
00:49:32,913 --> 00:49:33,523
are awesome.

779
00:49:34,903 --> 00:49:38,183
if there's a way I could, reduce
the amount of chores I need to do

780
00:49:38,203 --> 00:49:41,623
that could free up my time, but
I know it's not an easy problem.

781
00:49:41,683 --> 00:49:45,933
even things like grasping is not,
an easy problem for robotics.

782
00:49:45,933 --> 00:49:48,523
So, it might be a few more years.

783
00:49:48,523 --> 00:49:52,803
I'm glad that you're a fellow
vacuum cleaner aficionado.

784
00:49:52,913 --> 00:49:53,933
I love mine.

785
00:49:54,273 --> 00:49:58,473
I upgraded last year to one
that, finally has the mop thing.

786
00:49:58,483 --> 00:50:03,283
it not only vacuums, but also, mops
the floor and cleans itself up and

787
00:50:03,293 --> 00:50:04,893
dries itself up and everything.

788
00:50:05,633 --> 00:50:08,443
And it's been like easily one of
the best investments that I've done.

789
00:50:08,823 --> 00:50:13,353
I did have to basically change my
flat layout quite significantly.

790
00:50:13,483 --> 00:50:18,983
I got rid of all the carpets now, and I
laid better flooring just so that I know

791
00:50:18,983 --> 00:50:22,663
that all of the floods can be mopped
and cleaned by the robot and it does

792
00:50:22,663 --> 00:50:24,523
it every day and I couldn't be happier.

793
00:50:25,063 --> 00:50:27,863
in that respect, robots,
I'm looking forward.

794
00:50:28,953 --> 00:50:31,063
I can definitely bring one home.

795
00:50:31,063 --> 00:50:33,193
All right, Vadim, it's been a pleasure.

796
00:50:33,213 --> 00:50:35,703
That was probably the most
challenging episode we've done.

797
00:50:36,213 --> 00:50:40,713
When you try to talk about algorithms
without actually being able to show them

798
00:50:40,863 --> 00:50:48,263
and give an example or point to some code
and what I'm hoping we achieved here was

799
00:50:48,263 --> 00:50:54,603
a high level map that people can now go
and look up in books like yours again.

800
00:50:54,613 --> 00:50:55,713
Let me plug that.

801
00:50:55,773 --> 00:50:56,583
It's called.

802
00:50:56,753 --> 00:50:59,233
Machine learning algorithms
in depth by Manning.

803
00:51:00,003 --> 00:51:02,383
My guest was Vadim Smolyakov.

804
00:51:02,403 --> 00:51:03,163
Vadim, thank you very much.

805
00:51:03,163 --> 00:51:03,783
I'll see you next time.

806
00:51:04,088 --> 00:51:04,468
Thank you.