1 00:00:01,720 --> 00:00:09,080 I'm Miko Pawlikowski, and this is HockeyStick. 2 00:00:09,100 --> 00:00:13,200 LLMs, or Large Language Models, are taking the world by storm. 3 00:00:13,279 --> 00:00:18,069 This breakthrough artificial intelligence technology promises to fundamentally 4 00:00:18,069 --> 00:00:20,140 reshape the way we work with computers. 5 00:00:20,299 --> 00:00:23,700 Over the last year, we've witnessed its Hockey Stick moment, and as 6 00:00:23,700 --> 00:00:28,485 of early 2024, We're firmly in the Cambrian explosion phase. 7 00:00:28,645 --> 00:00:32,765 Today, we're taking a deep dive into how this models came from humble beginnings to 8 00:00:32,765 --> 00:00:35,505 making people scared of imminent Skynet. 9 00:00:35,595 --> 00:00:39,235 I'm joined by two experts, Chris Brousseau, staff machine learning 10 00:00:39,235 --> 00:00:43,655 engineer at JP Morgan, and Matthew Sharp, MLOps engineer at LTK, the 11 00:00:43,864 --> 00:00:49,115 authors of "Production LLMs" currently available in early access at manning.com. 12 00:00:49,265 --> 00:00:52,685 In this conversation, we'll cover the intricacies of human language 13 00:00:52,685 --> 00:00:54,335 and how machines can understand it. 14 00:00:54,425 --> 00:00:58,325 Give you the vocab to sound smart to the next family gathering and discuss the 15 00:00:58,325 --> 00:01:02,975 various mathematical ideas and models ultimately leading to LLMs, as well as 16 00:01:02,975 --> 00:01:05,735 some noteworthy examples beyond Chad GPT. 17 00:01:05,855 --> 00:01:08,315 Welcome to this episode and please enjoy. 18 00:01:08,416 --> 00:01:09,256 where should we start? 19 00:01:09,571 --> 00:01:11,431 How did you guys meet? 20 00:01:11,481 --> 00:01:13,631 we happen to both live in Utah, and we 21 00:01:13,631 --> 00:01:16,101 actually met at a meetup. 22 00:01:16,121 --> 00:01:19,881 It was actually an MLOps meetup, was the primary one where we met. 23 00:01:20,981 --> 00:01:25,331 It happens once a month and we'd get together, and so that's our origin story. 24 00:01:25,411 --> 00:01:28,531 we became friends through there, started helping each other, with, 25 00:01:28,581 --> 00:01:32,321 content creation, Chris was starting a YouTube channel, I write on 26 00:01:32,381 --> 00:01:35,131 LinkedIn, just giving each other feedback and helping each other out. 27 00:01:35,131 --> 00:01:37,961 It was especially helpful because I was trying to figure out how 28 00:01:38,001 --> 00:01:42,351 best to present a lot of the material that's in our book now. 29 00:01:42,921 --> 00:01:45,001 how do you explain a transformer model? 30 00:01:45,071 --> 00:01:49,251 And Matt was fantastic about helping me, find my voice on YouTube. 31 00:01:49,301 --> 00:01:54,431 Okay, so going from meeting someone at a meetup, to committing 32 00:01:54,431 --> 00:01:57,901 to spending a a couple of years working on a book from someone: 33 00:01:58,231 --> 00:01:59,531 that's a little bit of a difference. 34 00:01:59,901 --> 00:02:01,911 Was there any particular moment where I just clicked? 35 00:02:01,951 --> 00:02:03,611 "Oh, we need to write a book". 36 00:02:03,841 --> 00:02:05,751 How did you come up with the idea? 37 00:02:05,801 --> 00:02:10,541 I was approached and, I would love to write a book, but I don't 38 00:02:10,551 --> 00:02:12,361 know a lot about that process. 39 00:02:12,491 --> 00:02:15,681 And obviously, I didn't really have an authorship voice. 40 00:02:15,851 --> 00:02:18,381 I am not experienced in content creation. 41 00:02:19,061 --> 00:02:23,001 And while I was going through the process of talking with some different 42 00:02:23,001 --> 00:02:28,651 publishers, Matt approached me and said: "Hey, I was a technical reviewer 43 00:02:28,681 --> 00:02:32,751 on the fundamentals of data engineering by Joe Reese and Matt Housley. 44 00:02:33,641 --> 00:02:39,681 And so he had experience and he had, subject matter expertise, and he was 45 00:02:39,681 --> 00:02:42,331 giving me some advice and I said, "You know what, why don't you just 46 00:02:42,331 --> 00:02:47,491 come on as a coauthor?, You obviously could help a lot here ,and I need 47 00:02:47,491 --> 00:02:49,971 it, so let's just do it together". 48 00:02:50,031 --> 00:02:54,871 yeah, I think that it worked out really well because Chris has that background in 49 00:02:54,881 --> 00:02:59,211 linguistics, he understands the natural language processing side better than 50 00:02:59,331 --> 00:03:04,751 anyone else I've met in person, and I was coming more from the MLOps side, 51 00:03:04,751 --> 00:03:06,201 how do we actually deploy these things? 52 00:03:06,201 --> 00:03:13,481 And so I think it's really rounded out our book better than, anything else I'm seeing 53 00:03:13,551 --> 00:03:15,721 out there that you could buy and read. 54 00:03:15,721 --> 00:03:19,271 getting that diverse perspective, I think, really helps our book out. 55 00:03:19,816 --> 00:03:24,246 I was very excited when you said 'yes' to coming onto this because since last 56 00:03:24,246 --> 00:03:30,434 year I think in most people's minds sometime early last year with chat GPT. 57 00:03:30,914 --> 00:03:34,944 All of a sudden, everybody started talking about large language 58 00:03:34,964 --> 00:03:40,254 models, and some people started worrying about, impending doom and 59 00:03:40,274 --> 00:03:42,414 robot apocalypse, and all of that. 60 00:03:43,224 --> 00:03:47,324 But from a perspective of someone who's worked, with that for best 61 00:03:47,324 --> 00:03:49,754 part of a decade now, I'm wondering. 62 00:03:50,259 --> 00:03:54,839 what was the point when you realized that these LLMs, they're really onto 63 00:03:54,839 --> 00:04:01,689 something and they're moving from, a demo to an actual legitimate technology 64 00:04:01,689 --> 00:04:02,959 that's going to change things. 65 00:04:02,999 --> 00:04:06,289 What was the hockey stick moment for LLMs 66 00:04:06,326 --> 00:04:07,006 Oh, boy. 67 00:04:07,056 --> 00:04:11,276 for me, without a doubt, that was the release of T5. 68 00:04:12,021 --> 00:04:18,091 And looking at Google's paper about the text-to-text transformer, that set really 69 00:04:18,131 --> 00:04:20,821 the groundwork for prompting, right? 70 00:04:20,831 --> 00:04:25,631 They had a whole bunch of different tasks that you didn't have to change 71 00:04:25,701 --> 00:04:28,531 anything other than some statement. 72 00:04:29,191 --> 00:04:32,931 For the model to do that task, and then a colon and then whatever 73 00:04:32,931 --> 00:04:34,611 your input was going to be anyway. 74 00:04:34,611 --> 00:04:36,811 that was groundbreaking to me. 75 00:04:36,811 --> 00:04:39,281 I had been messing around with GPT2. 76 00:04:39,301 --> 00:04:41,811 I'd been playing with that and trying to shoehorn it into a 77 00:04:41,811 --> 00:04:43,271 product where I was working. 78 00:04:43,541 --> 00:04:49,791 T5, did everything that we were trying to do with GPT2, and it was incredibly 79 00:04:49,791 --> 00:04:54,371 flexible, it was easy to fine tune, and for me, that was the hockey stick moment 80 00:04:54,371 --> 00:04:56,871 that "oh wow, no, they're really cooking". 81 00:04:56,871 --> 00:04:57,731 when is that? 82 00:04:57,732 --> 00:05:00,994 for anybody who hasn't heard of heard 83 00:05:01,049 --> 00:05:01,139 T5? 84 00:05:01,139 --> 00:05:04,927 I think it was 2019, Yeah, exploring the limits of transfer learning 85 00:05:04,927 --> 00:05:08,817 with a unified text to text transformer was October in 2019. 86 00:05:08,877 --> 00:05:10,177 it came out in October. 87 00:05:10,197 --> 00:05:13,537 I think I picked it up in November-December of 2019. 88 00:05:13,964 --> 00:05:18,934 Yeah, I think for my hockey stick moment, like I was, in the industry 89 00:05:18,944 --> 00:05:23,604 been paying attention, obviously GPT2 coming around, T5, etc. 90 00:05:23,654 --> 00:05:30,444 But wasn't really seeing the adoption that someone who's working in MLOps 91 00:05:30,714 --> 00:05:35,024 cares more about I was seeing, , these models can do really cool things, 92 00:05:35,024 --> 00:05:36,854 but people weren't caring about them. 93 00:05:36,944 --> 00:05:40,774 Sam Altman even said it was like, "we didn't think GPT-3 94 00:05:40,794 --> 00:05:42,624 would be that big of success. 95 00:05:42,624 --> 00:05:44,974 We thought that would once GPT-4 came out. 96 00:05:45,714 --> 00:05:49,304 but I just remember, January 2023. 97 00:05:50,024 --> 00:05:51,784 ChatGPT's been out a month. 98 00:05:52,024 --> 00:05:53,704 it's still essentially in beta. 99 00:05:53,784 --> 00:05:57,674 They just released it to get feedback and to start collecting data. 100 00:05:57,674 --> 00:05:59,204 to start improving their model. 101 00:05:59,734 --> 00:06:01,144 but it blew up, right? 102 00:06:01,174 --> 00:06:07,634 I just remember being at a church function and this guy sitting 103 00:06:07,634 --> 00:06:12,224 across the table from me who has no idea anything about AI, right? 104 00:06:12,244 --> 00:06:17,324 I was stuck in this table for an hour and all he could talk about was GPT-3. 105 00:06:17,684 --> 00:06:19,164 he was obsessed with it. 106 00:06:19,564 --> 00:06:20,704 I'm like, oh, wow. 107 00:06:21,364 --> 00:06:26,984 even people who don't know anything about, machine learning or AI or the 108 00:06:26,984 --> 00:06:32,474 industry were like, really going gung ho and his wife was an English teacher. 109 00:06:32,964 --> 00:06:36,564 she was really scared of it and was like, "how are we gonna help kids 110 00:06:36,564 --> 00:06:42,234 learn how to, write and read when they can just go online and now cheat 111 00:06:42,234 --> 00:06:43,424 and write these things and stuff". 112 00:06:44,104 --> 00:06:47,544 The very beginning of what, like everyone's had conversations about now, 113 00:06:47,594 --> 00:06:54,284 but like he talked about how his brother in law owned a website that made fake 114 00:06:54,284 --> 00:06:58,854 articles you can think like the onion and so once it came out in that month like 115 00:06:58,854 --> 00:07:04,869 I said chat GPT still wasn't a product yet, and anyone who's been following 116 00:07:04,869 --> 00:07:08,749 it knows a lot of those demos just shut down and then never came back up 117 00:07:08,859 --> 00:07:14,189 His brother in law ended up firing like a hundred writers because he's 118 00:07:14,189 --> 00:07:19,739 like: "Oh chat GPT can make these funny fake articles and we're good, right?" 119 00:07:19,779 --> 00:07:24,099 that was my hockey stick moment of "okay we really are changing 120 00:07:24,149 --> 00:07:28,049 when some random guy at church is talking about it all the time". 121 00:07:28,774 --> 00:07:29,974 Yeah, I love that example. 122 00:07:30,004 --> 00:07:34,364 But even for people who are in tech who weren't directly following that 123 00:07:34,364 --> 00:07:36,554 very closely, that was a scary moment. 124 00:07:36,564 --> 00:07:42,364 I remember when I first used a copilot, I was like, what, it just does that. 125 00:07:42,574 --> 00:07:45,454 And three out of four, it would actually work. 126 00:07:45,724 --> 00:07:46,804 that was a scary moment. 127 00:07:46,854 --> 00:07:51,654 It reverberated through a lot of levels of society, including, our own. 128 00:07:51,884 --> 00:07:57,504 And, I think in many ways, technology and writing code might be the easiest 129 00:07:57,514 --> 00:07:59,714 use case for, this kind of models, right? 130 00:07:59,714 --> 00:08:00,554 Do you agree with that? 131 00:08:00,554 --> 00:08:04,844 I don't know if I completely agree with it, because, code is incredibly 132 00:08:04,844 --> 00:08:06,844 syntactically dependent, right? 133 00:08:06,914 --> 00:08:11,594 every developer who's worked with JavaScript or C++ and then moves 134 00:08:11,594 --> 00:08:13,514 to Python, they feel it, right? 135 00:08:13,534 --> 00:08:16,774 That's one of the biggest complaints is "I hate Python syntax". 136 00:08:16,814 --> 00:08:21,444 "I hate that white space matters", it's a little bit more complex than just 137 00:08:21,444 --> 00:08:25,394 repeating whatever natural language happened, but you're absolutely right 138 00:08:25,414 --> 00:08:28,344 that is one of the best use cases so far. 139 00:08:29,281 --> 00:08:33,631 because, it's better structured than just spoken language, or is there any 140 00:08:33,681 --> 00:08:37,541 other reasons that make it so well suited for that particular application? 141 00:08:37,591 --> 00:08:40,551 programming languages are not real languages, right? 142 00:08:40,551 --> 00:08:44,571 one of the things that makes it simultaneously very well and ill-suited 143 00:08:44,591 --> 00:08:49,471 for it is how much gets repeated, You use the exact same words. 144 00:08:49,756 --> 00:08:54,706 The exact same tokens to define every function that you make, but then the 145 00:08:54,706 --> 00:08:57,226 function's name can be whatever you want. 146 00:08:57,996 --> 00:09:01,326 And so using the exact same tokens is awesome. 147 00:09:01,326 --> 00:09:03,666 That provides landmarks for the probability as it's 148 00:09:03,666 --> 00:09:04,746 going through all of this. 149 00:09:05,156 --> 00:09:09,041 But then that input to just say whatever you want and put it in camel 150 00:09:09,041 --> 00:09:13,331 case or snake case or whatever, tons of different formatting for functions. 151 00:09:14,751 --> 00:09:16,761 it makes it a little bit more difficult. 152 00:09:17,146 --> 00:09:19,206 Especially while you're trying to tokenize that, 153 00:09:19,396 --> 00:09:24,876 one of the big benefits with code is the amount of data we have around code. 154 00:09:24,876 --> 00:09:26,546 lots of people are writing code. 155 00:09:26,716 --> 00:09:31,106 they all have very similar ideas of what they're trying to do, of 156 00:09:31,106 --> 00:09:33,711 what they're trying to architect, of what they're trying to design. 157 00:09:33,711 --> 00:09:38,181 and so we're not necessarily worrying about, hallucinations or 158 00:09:38,181 --> 00:09:42,071 fake news or, people disagreeing or other things like that. 159 00:09:42,111 --> 00:09:45,851 there's just a lot of data, that all agrees with each other and 160 00:09:45,851 --> 00:09:47,231 pushes in the same direction. 161 00:09:47,471 --> 00:09:48,251 It makes it good. 162 00:09:48,341 --> 00:09:53,281 there's obviously some negatives of just assuming, some of these LLMs writing 163 00:09:53,281 --> 00:09:57,811 code is going to do things well, but, I think Chris highlighted that already. 164 00:09:58,619 --> 00:10:02,029 it's actually really similar to how regular languages work. 165 00:10:02,129 --> 00:10:06,559 If we have more python data, like Matt's saying, it's going to do better at python. 166 00:10:07,019 --> 00:10:11,709 And that can create a little bit of a positive feedback loop with LLMs, where 167 00:10:11,709 --> 00:10:15,889 a lot of people want to get into python, and they're very good at it, but then 168 00:10:15,889 --> 00:10:21,279 when you look at emerging languages like mojo, for example It's really difficult 169 00:10:21,289 --> 00:10:25,979 to find that data and so LLMs are worse at it, similar to natural languages 170 00:10:25,979 --> 00:10:30,479 that have a lower number of speakers, a lower presence on the internet, 171 00:10:31,686 --> 00:10:37,096 So is the solution to use an LLM to generate a lot of Mojo and make it 172 00:10:37,096 --> 00:10:39,326 a significant percentage of GitHub? 173 00:10:41,269 --> 00:10:42,309 that'd be fun, dude. 174 00:10:42,599 --> 00:10:46,919 I think there are some problems with synthetic data that can lead 175 00:10:46,919 --> 00:10:48,379 to stuff like model collapse. 176 00:10:48,569 --> 00:10:51,199 I don't know if we're going to see that in the code space, though. 177 00:10:51,539 --> 00:10:53,549 I think we could see that in natural language. 178 00:10:53,749 --> 00:10:55,869 So that might be a valid solution. 179 00:10:56,716 --> 00:10:57,076 Okay. 180 00:10:57,126 --> 00:11:03,276 the date is 13 February, the day before Valentine's Day 2024. 181 00:11:03,276 --> 00:11:05,126 I'm going to ask you for a wild prediction. 182 00:11:05,146 --> 00:11:06,616 Where do you see that going? 183 00:11:06,766 --> 00:11:12,816 Should, all kinds of, or maybe any subset of programmers who, produce code as a 184 00:11:12,816 --> 00:11:16,246 job, should they start at least worrying? 185 00:11:16,736 --> 00:11:21,566 Is that something that's going to, decrease the pool of available jobs, 186 00:11:22,864 --> 00:11:26,844 no, I don't think it's really going to impact the amount of work. 187 00:11:27,754 --> 00:11:32,584 I just think about my job, and even when I'm in very technical roles, and I'm 188 00:11:32,584 --> 00:11:38,304 spending 50% of my time on the keyboard, still, it feels like a majority of the 189 00:11:38,304 --> 00:11:42,744 work is still just communicating with stakeholders, understanding exactly what 190 00:11:42,744 --> 00:11:48,254 the problems are, technical writing, design docs, really understanding at 191 00:11:48,254 --> 00:11:50,234 a high level, what you want to build. 192 00:11:50,284 --> 00:11:53,434 To be fair, programmers have been automating the 'writing the 193 00:11:53,434 --> 00:11:55,794 code' portion forever, right? 194 00:11:55,874 --> 00:11:56,794 From the beginning. 195 00:11:57,651 --> 00:12:02,401 yeah, with massive amounts of like scripts and configs that they use. 196 00:12:02,401 --> 00:12:06,651 And that's why they love Vim or Emacs still, right? 197 00:12:06,651 --> 00:12:08,961 It's because they have it configured just right. 198 00:12:08,961 --> 00:12:12,691 And they can move really quickly, because it provides a lot of that 199 00:12:12,741 --> 00:12:16,521 automation for them already, but this is just helping junior engineers 200 00:12:16,521 --> 00:12:21,041 already have all that configuration and set up really quickly, right? 201 00:12:21,091 --> 00:12:27,291 It mostly will just make our jobs a little bit easier, it doesn't remove the need to 202 00:12:27,301 --> 00:12:31,701 really understand the engineering aspect, the architecture aspect, the design 203 00:12:31,721 --> 00:12:34,441 aspect that still is involved with coding. 204 00:12:35,139 --> 00:12:35,689 Oh, yeah. 205 00:12:36,329 --> 00:12:39,909 this is why we love comparing LLMs to a printing press. 206 00:12:40,259 --> 00:12:42,158 That Johannes Gutenberg. 207 00:12:42,159 --> 00:12:44,919 Because did that destroy the writing industry? 208 00:12:44,969 --> 00:12:49,379 All it did was it destroyed the monopoly that certain organizations 209 00:12:49,519 --> 00:12:50,919 had on publishing books. 210 00:12:51,489 --> 00:12:55,329 Before you had to get a scribe and you had to pay the scribe and you had to 211 00:12:55,329 --> 00:12:59,669 have access to scribes You couldn't just walk up to a printing press and 212 00:13:00,039 --> 00:13:02,669 hit it and then boom you have a book. 213 00:13:02,749 --> 00:13:04,949 You have to have knowledge You have to have an idea. 214 00:13:05,159 --> 00:13:09,319 The printing press just gives you a lower barrier to entry 215 00:13:10,114 --> 00:13:11,964 Which is what we love, right? 216 00:13:12,354 --> 00:13:16,914 For coding, I think Matt is exactly right, that it's a lower barrier to 217 00:13:16,924 --> 00:13:21,204 entry for junior engineers to be able to produce significantly better work. 218 00:13:21,424 --> 00:13:26,104 and in some ways it actually accelerates it, because when you copy and paste what 219 00:13:26,104 --> 00:13:30,654 an LLM gave you and it doesn't work, you have to go figure it out, right? 220 00:13:31,004 --> 00:13:35,384 With the junior engineers, it also helps speed up senior engineers, and 221 00:13:35,684 --> 00:13:37,304 staff engineers and principal engineers. 222 00:13:37,364 --> 00:13:42,194 it's good, and lowers the barrier for the entire industry, we like that. 223 00:13:43,361 --> 00:13:43,701 Yeah. 224 00:13:43,806 --> 00:13:47,606 I've lately been spending lots of time writing chapter 10 of our book, 225 00:13:47,606 --> 00:13:51,981 and in chapter 10, we actually go through a project, where we help you 226 00:13:51,981 --> 00:13:58,471 build your own co pilot and we build the VS Code extension to get it in. 227 00:13:58,481 --> 00:14:03,751 if you want to be running your own LLM on your own computer with your own data, 228 00:14:04,241 --> 00:14:05,941 so that way, you can get your own things. 229 00:14:06,651 --> 00:14:08,301 we walk through all the steps to do that. 230 00:14:08,331 --> 00:14:12,001 And in some aspects, it's interesting cause sometimes. 231 00:14:12,591 --> 00:14:15,341 adding an extra feature, made the model work, right? 232 00:14:15,411 --> 00:14:18,121 there's still just so much to learn about it. 233 00:14:18,171 --> 00:14:20,221 ultimately, it comes down to your data, right? 234 00:14:20,621 --> 00:14:22,281 how good is your coding data? 235 00:14:22,381 --> 00:14:24,541 is really how well the co pilot works, right? 236 00:14:24,591 --> 00:14:28,741 SQL is one of the most repetitive of all of the programming languages. 237 00:14:29,106 --> 00:14:34,036 but true skill with SQL does not involve being good at SQL. 238 00:14:34,036 --> 00:14:36,496 It involves knowing the data, right? 239 00:14:36,506 --> 00:14:41,956 It's knowing which tables to query, how to merge them, how window functions, all of 240 00:14:41,956 --> 00:14:47,176 that stuff, knowing exactly what you need to be looking at is the true skill in SQL. 241 00:14:47,776 --> 00:14:51,726 And we're hopefully getting to a point where we can help the 242 00:14:51,726 --> 00:14:54,911 model know the data, right? 243 00:14:54,911 --> 00:14:59,101 We can give it some sort of context for the data that it's going to be looking 244 00:14:59,101 --> 00:15:01,371 at, so that it can generate good SQL 245 00:15:01,421 --> 00:15:02,271 that's a really good point. 246 00:15:02,281 --> 00:15:06,411 I've actually had, lots of mentees who are trying to learn SQL for the first time. 247 00:15:07,081 --> 00:15:13,051 I said: "just use ChatGPT", generating SQL is actually something that's really 248 00:15:13,051 --> 00:15:18,111 good at, you don't need GPT-4, like even GPT-3, like even GPT-2, it's not 249 00:15:18,111 --> 00:15:20,841 hard to generate really good SQL syntax. 250 00:15:20,841 --> 00:15:24,321 Cause it's so simple, it follows a very similar structure. 251 00:15:24,821 --> 00:15:28,741 But ultimately, you can have it write the SQL, but you're going to have to 252 00:15:28,741 --> 00:15:33,361 go back and figure out how to connect all the pieces and understand your 253 00:15:33,361 --> 00:15:35,081 database and understand your data. 254 00:15:35,181 --> 00:15:38,471 that's a perfect example, understanding how to write the 255 00:15:38,471 --> 00:15:39,761 code is only half the problem. 256 00:15:39,761 --> 00:15:43,021 Understanding how to integrate it is really the bigger problem. 257 00:15:43,021 --> 00:15:47,331 What's the most terrible use case, that people are currently 258 00:15:47,331 --> 00:15:48,961 trying to use LLMs for? 259 00:15:49,091 --> 00:15:55,471 What does LLM in general, or LLMs, what do they suck at the most? 260 00:15:56,591 --> 00:16:03,251 I'm going to say they, they suck at, sequence prediction, which sounds so off. 261 00:16:03,816 --> 00:16:07,876 Because that's what they're made for, but one of the things that I'm seeing 262 00:16:07,876 --> 00:16:12,946 people do, is try and automate entire workflows with LLMs, and they're trying 263 00:16:12,946 --> 00:16:18,786 to get the LLM to just do the whole workflow and they suck at that what 264 00:16:18,966 --> 00:16:20,836 they need all of this stuff to help it. 265 00:16:20,836 --> 00:16:26,486 They need tools, they need rag, they need specific fine tuning landmarks 266 00:16:26,516 --> 00:16:30,846 and they need few shot prompting, they need all sorts of stuff to make 267 00:16:30,846 --> 00:16:34,246 it work, and then it's still up in the air about whether or not it will 268 00:16:34,246 --> 00:16:36,066 do the right task in the right order. 269 00:16:36,878 --> 00:16:39,858 Yeah, I was thinking, I don't know how much I'm seeing this. 270 00:16:39,858 --> 00:16:45,468 But, three months, six months ago, I was hearing a hundred horror stories 271 00:16:45,468 --> 00:16:51,118 about, essentially CEOs being like, "we need LLMs" and like their magic, 272 00:16:51,118 --> 00:16:55,798 they can do anything, And so it didn't matter what the problem was, "oh, we need 273 00:16:55,808 --> 00:16:59,708 to, do outlier detection using LLMs". 274 00:17:00,148 --> 00:17:00,638 No, 275 00:17:00,741 --> 00:17:02,021 use stats for that. 276 00:17:02,548 --> 00:17:05,058 yeah, outlier detection is really a statistical problem. 277 00:17:05,058 --> 00:17:07,308 It's really a data and math problem. 278 00:17:07,348 --> 00:17:09,508 LLMs are good at natural language. 279 00:17:09,928 --> 00:17:13,658 And so when we can solve a problem using words and communication, 280 00:17:13,908 --> 00:17:15,288 that's when LLMs can get in. 281 00:17:15,288 --> 00:17:21,068 But problems like, outlier detection or weather prediction or these 282 00:17:21,068 --> 00:17:22,878 other things, we have, algorithm. 283 00:17:22,878 --> 00:17:25,781 stock market prediction, Super Bowl prediction, 284 00:17:25,941 --> 00:17:30,111 All these things, we have better ways to make predictions. 285 00:17:30,486 --> 00:17:31,966 And it's called math, right? 286 00:17:32,056 --> 00:17:35,906 Fourier transforms, other machine learning algorithms, other things like that. 287 00:17:37,106 --> 00:17:41,006 LLMs are not good at doing those things, cause we don't talk 288 00:17:41,016 --> 00:17:43,266 about them in natural language. 289 00:17:43,396 --> 00:17:47,086 we've invented other languages like math just to describe them 290 00:17:47,556 --> 00:17:48,726 And that's why they're not good. 291 00:17:48,736 --> 00:17:53,896 we can make tools, you can build functions for an LLM to use to do Fourier 292 00:17:53,896 --> 00:17:56,156 transitions and whatever else, right? 293 00:17:57,276 --> 00:18:02,026 But getting the LLM to know that it needs to do that is really difficult. 294 00:18:02,026 --> 00:18:07,226 Probably just as difficult to, as explaining what the Fourier transition 295 00:18:07,256 --> 00:18:11,356 is to an LLM within your training data to get it to be able to replicate it. 296 00:18:11,856 --> 00:18:16,416 This is one thing that makes it almost miraculous when stuff does 297 00:18:16,416 --> 00:18:20,666 work, and that's that feeling that we're chasing right now, and that's 298 00:18:20,756 --> 00:18:24,656 the replicability that we're trying to help people get to in a book. 299 00:18:25,016 --> 00:18:28,096 how do you actually do it, and how do you make sure that your scope 300 00:18:28,096 --> 00:18:31,756 is small enough, that it will work repeatedly and you can build a 301 00:18:31,756 --> 00:18:33,586 product off of it, that's difficult. 302 00:18:33,636 --> 00:18:34,906 I'm a big fan of chess. 303 00:18:35,156 --> 00:18:41,461 And, since ChatGPT came out, lots of people have been making memes, or just 304 00:18:41,461 --> 00:18:47,481 like: "Hey, I'll play ChatGPT in chess", and ChatGPT can play chess because we 305 00:18:47,481 --> 00:18:48,961 can talk about it in language, right? 306 00:18:48,971 --> 00:18:54,276 Like E4, move the pawn, or knight to g6, whatever it is. 307 00:18:54,886 --> 00:18:58,896 we have language of it, but ChatGPT has no idea. 308 00:18:58,946 --> 00:19:04,216 It has no idea the model behind those letter number combinations. 309 00:19:04,226 --> 00:19:06,956 all it knows is that there's certain things it can do, right? 310 00:19:07,396 --> 00:19:11,746 it writes words, and so when they do this, and these like videos or 311 00:19:11,766 --> 00:19:14,676 memes, like they just let ChatGPT do whatever it says, right? 312 00:19:14,676 --> 00:19:18,556 it just magically creates a knight out of nowhere, and magically, will take 313 00:19:18,556 --> 00:19:23,586 its own pieces as it moves its pieces around, it's always pretty funny. 314 00:19:23,596 --> 00:19:27,216 And even though it's cheating the entire way, it almost always loses, right? 315 00:19:27,216 --> 00:19:31,061 Cause It doesn't have an understanding of chess, like it doesn't 316 00:19:31,091 --> 00:19:32,721 have that model underneath it. 317 00:19:33,771 --> 00:19:36,831 sure we can talk about it in language, but not really, right? 318 00:19:36,831 --> 00:19:42,571 So we, we still have better ways to play chess, alpha zero, et cetera. 319 00:19:43,001 --> 00:19:46,851 Stockfish, like there are engines out there that play chess really well. 320 00:19:46,931 --> 00:19:51,661 And we don't need to make LLMs good at chess, but that's a very good example 321 00:19:51,671 --> 00:19:53,231 of one of the things it's not good at. 322 00:19:53,801 --> 00:19:59,901 I've seen someone on Twitter who said "I'm gonna give LLM $1000 or 323 00:19:59,901 --> 00:20:04,751 whatever initial amount, and I'm gonna ask it how to best invest it. 324 00:20:04,781 --> 00:20:06,321 I didn't follow where it went. 325 00:20:06,341 --> 00:20:09,621 But I think a lot of people had the same idea. 326 00:20:09,621 --> 00:20:11,381 this is some kind of genius system. 327 00:20:11,421 --> 00:20:16,031 I'm just gonna be its flesh and bones agent in the real world. 328 00:20:16,896 --> 00:20:18,076 and hope for the best. 329 00:20:18,276 --> 00:20:20,286 So I think that kind of goes back to your chess thing. 330 00:20:20,286 --> 00:20:24,846 So excuse me for that, but I have to ask you the AGI, 331 00:20:25,866 --> 00:20:27,976 Artificial General Intelligence. 332 00:20:28,046 --> 00:20:30,746 Any chance for that happening anytime soon? 333 00:20:31,076 --> 00:20:32,006 What's your prediction? 334 00:20:32,006 --> 00:20:33,836 not with our current systems. 335 00:20:33,896 --> 00:20:38,826 No, I don't think AGI is ever going to come out of quadratic 336 00:20:38,836 --> 00:20:42,126 equations, like not a single chance. 337 00:20:43,166 --> 00:20:47,966 maybe if there are better dropping sub-quadratic replacements, stuff 338 00:20:47,966 --> 00:20:49,886 like hyena, I've tested that out. 339 00:20:49,886 --> 00:20:51,076 I think it's really cool. 340 00:20:51,386 --> 00:20:55,696 But, the fact that attention, the query key value attention, 341 00:20:55,746 --> 00:20:58,496 ultimately generates complex numbers. 342 00:20:58,556 --> 00:21:03,326 I think that is a little too much for AGI at the moment. 343 00:21:03,326 --> 00:21:07,206 So you're not one of those people who secretly hope that OpenAI has 344 00:21:07,206 --> 00:21:08,886 something they're gonna release soon. 345 00:21:08,936 --> 00:21:10,936 I don't think they have it, right? 346 00:21:10,936 --> 00:21:12,566 I'll be hopeful, sure. 347 00:21:12,566 --> 00:21:14,186 If it comes out, that's great. 348 00:21:14,196 --> 00:21:16,216 Yeah, I'm of the same mind as Chris. 349 00:21:16,216 --> 00:21:17,486 I hope they keep pursuing it. 350 00:21:17,546 --> 00:21:21,346 we've gotten major breakthroughs from what they pursued. 351 00:21:21,396 --> 00:21:25,756 It's very possible AGI will happen in my lifetime, I'm still pretty young We 352 00:21:25,756 --> 00:21:30,276 keep on making advances really quickly, but are we relatively close to it? 353 00:21:30,276 --> 00:21:30,996 Probably not. 354 00:21:31,106 --> 00:21:31,436 No 355 00:21:31,436 --> 00:21:37,056 Oh, the thing about progress though is that it's very rarely linear, It 356 00:21:37,506 --> 00:21:39,506 tends to have a very weird curve. 357 00:21:39,506 --> 00:21:43,846 So that's why all the predictions are so funny, but hey, I had to ask you anyway. 358 00:21:43,846 --> 00:21:45,836 No, I think it's a great question. 359 00:21:46,998 --> 00:21:52,328 Okay, let's delve a little bit into, a portion of your book, 360 00:21:52,428 --> 00:21:56,598 It's basically describing the two options that you have today. 361 00:21:56,908 --> 00:22:01,668 you can either go and pay some money to OpenAI, maybe Google, or 362 00:22:01,678 --> 00:22:03,508 somebody else, or you can build, 363 00:22:03,508 --> 00:22:05,108 So you've got buy versus build. 364 00:22:05,798 --> 00:22:10,978 Could you talk to me a little bit about how someone would decide 365 00:22:11,008 --> 00:22:15,353 about this as of february 13, 2024. 366 00:22:16,233 --> 00:22:19,373 What's the things to consider, and what's the weights that 367 00:22:19,373 --> 00:22:21,643 you would put in, and biases? 368 00:22:21,693 --> 00:22:25,663 the basic consideration is just your use case, right? 369 00:22:25,703 --> 00:22:30,343 If you just want to test something out, you're a student and you don't have a 370 00:22:30,343 --> 00:22:34,923 lot of budget, and you want something up and running so that you have LLM 371 00:22:34,923 --> 00:22:43,113 experience, I would say just, shell out for that, ChatGPT+ or buy Anthropic 372 00:22:43,123 --> 00:22:48,221 or Google Bard has a fantastic API, or I guess Gemini now just do it. 373 00:22:48,311 --> 00:22:49,701 it's not that big of a thing. 374 00:22:49,741 --> 00:22:54,581 If your product that you're trying to ship is inconsequential and you 375 00:22:54,581 --> 00:22:57,891 don't need it to be right every time, you just want to sprinkle the 376 00:22:57,901 --> 00:23:00,341 AI pixie dust on it, just buy it. 377 00:23:00,581 --> 00:23:04,631 If your use case goes deeper than that, though, if you want to be able to build 378 00:23:04,631 --> 00:23:08,921 your own, if you need to make sure that it says the right things all the time, 379 00:23:09,341 --> 00:23:13,411 if you need it to behave a little bit more deterministically, There have been 380 00:23:13,431 --> 00:23:17,451 probably a thousand case studies in the last year of people building products on 381 00:23:17,451 --> 00:23:24,491 top of ChatGPT and then OpenAI rolling out an update that changes how chat 382 00:23:24,491 --> 00:23:29,031 GPT behaves, and they don't have any way to measure all of the different 383 00:23:29,181 --> 00:23:30,621 ways that it will change it, right? 384 00:23:30,621 --> 00:23:35,961 There are 176 billion parameters in GPT-3 alone, they don't know it's going 385 00:23:35,961 --> 00:23:37,661 to break your program down the line. 386 00:23:37,931 --> 00:23:41,311 they're just going to update it for what they consider to be better. 387 00:23:41,871 --> 00:23:45,791 And those programs break constantly. 388 00:23:46,391 --> 00:23:47,741 that doesn't mean you can't fix them. 389 00:23:47,741 --> 00:23:51,721 It's just a much bigger problem of maintenance, than I think a lot of 390 00:23:51,721 --> 00:23:53,811 people are expecting going into it. 391 00:23:54,361 --> 00:23:57,241 So If you want to have to maintain it less, build your own. 392 00:23:57,241 --> 00:24:01,931 Yeah, I think the other aspect is like you want that control, right? 393 00:24:01,961 --> 00:24:07,451 there's lots of examples of companies who, essentially built a small shell 394 00:24:07,471 --> 00:24:11,251 around ChatGPT that did something unique. 395 00:24:11,771 --> 00:24:15,871 And then, months down the line, now ChatGPT just does 396 00:24:15,871 --> 00:24:17,261 that out of the gate, right? 397 00:24:17,361 --> 00:24:20,291 their value proposition just completely disappeared. 398 00:24:20,321 --> 00:24:22,501 And that's because they didn't have control over the model. 399 00:24:22,521 --> 00:24:27,691 They didn't have, control over, what it did it's just interesting, right? 400 00:24:27,711 --> 00:24:30,111 Because I say these things and things have changed over time. 401 00:24:30,111 --> 00:24:33,451 But when ChatGT first came out, it was free, it was a demo, and they were 402 00:24:33,451 --> 00:24:35,171 specifically doing it to collect data. 403 00:24:35,771 --> 00:24:38,571 And that's what they did, they used collected data to improve their models. 404 00:24:39,201 --> 00:24:41,691 And that's what they continued to do for a while, right? 405 00:24:41,701 --> 00:24:43,121 Oh no, they're back. 406 00:24:43,591 --> 00:24:45,621 They it's terms and service, right? 407 00:24:45,921 --> 00:24:49,271 If you want them to save your chat, so that you can return to 408 00:24:49,271 --> 00:24:52,841 it and ask more questions, they get to train off of your data. 409 00:24:53,321 --> 00:24:57,441 So if you want to put anything private or sensitive in there, like 410 00:24:57,451 --> 00:24:59,461 it's over, you've just leaked it. 411 00:24:59,461 --> 00:25:02,321 they're back and forth about what data they're collecting, what data they're 412 00:25:02,321 --> 00:25:07,991 not collecting, and if you're with an enterprise customer, like maybe you 413 00:25:07,991 --> 00:25:13,131 can make certain rules and things like that, and oftentimes they won't, it's 414 00:25:13,131 --> 00:25:17,421 a minefield, for how people are using it, and so it's just something important 415 00:25:17,431 --> 00:25:23,566 to take into consideration, if your LLM model is doing something magical, 416 00:25:23,576 --> 00:25:27,476 that's really core to your business, that is really driving customers. 417 00:25:28,286 --> 00:25:29,396 You want to control that. 418 00:25:29,636 --> 00:25:35,226 You want to make sure that the model is working exactly as intended. 419 00:25:35,316 --> 00:25:38,966 You're not getting updates randomly, that break your application. 420 00:25:39,416 --> 00:25:45,096 You're also controlling the data flow, you're making sure that you're not 421 00:25:45,306 --> 00:25:48,606 accidentally training your competitor's model, and other things like that. 422 00:25:48,606 --> 00:25:52,496 And there's just lots of aspects where it's just important to 423 00:25:53,636 --> 00:25:55,026 make sure that you own it. 424 00:25:55,366 --> 00:25:59,321 And, no, that's not necessarily everyone's concern, right? 425 00:25:59,411 --> 00:26:02,701 if you're a student or you're just doing some side project or anything, there's 426 00:26:02,701 --> 00:26:07,031 lots of APIs out there that are very cheap that can get you up and running, 427 00:26:07,091 --> 00:26:11,391 there are literally hundreds of hugging face spaces that are free APIs. 428 00:26:11,611 --> 00:26:14,061 With, have LLMs running behind them and you can just hit 429 00:26:14,061 --> 00:26:15,891 them whenever you want, right? 430 00:26:16,493 --> 00:26:18,843 unless you're queuing behind a thousand other people. 431 00:26:18,843 --> 00:26:19,843 yeah, exactly. 432 00:26:19,843 --> 00:26:25,023 I liked the example you gave in the book, I think people at Latitude, the Dungeons 433 00:26:25,193 --> 00:26:29,563 & Dragons people would agree with a lot of what you're saying now, but can you tell 434 00:26:29,563 --> 00:26:32,013 the story of what happened with them? 435 00:26:32,753 --> 00:26:37,483 Latitude, is a local company, that was here in Utah. 436 00:26:37,543 --> 00:26:40,693 it was put together by, two guys from BYU. 437 00:26:41,093 --> 00:26:43,903 GPT-2 came out several years ago. 438 00:26:43,903 --> 00:26:45,933 They're like, "Oh, this is mind-boggling. 439 00:26:46,243 --> 00:26:48,283 Let's build a game off of it!" 440 00:26:48,813 --> 00:26:52,063 And what they came up with was like a dungeon crawler, a text 441 00:26:52,063 --> 00:26:56,373 based game it was really neat, because it would just generate, an 442 00:26:56,413 --> 00:26:57,833 infinite amount of opportunities. 443 00:26:57,833 --> 00:27:00,443 And so it created this 'choose your own adventure'. 444 00:27:01,503 --> 00:27:05,373 It got relatively big in the space, and lots of people enjoyed playing it. 445 00:27:05,463 --> 00:27:11,523 things were going really good, and then OpenAI GPT-3 came out, they offered it to 446 00:27:11,523 --> 00:27:15,803 them, hey, we can, we have this new model, it's a lot better, why don't you try it? 447 00:27:15,853 --> 00:27:19,353 they played around with it, and "oh yeah, this is, it's much more descriptive, 448 00:27:19,353 --> 00:27:23,878 it's much more interesting, it's really great", There was a lot of excitement 449 00:27:23,878 --> 00:27:29,338 around it, however, it turned out that the model itself, had a propensity 450 00:27:29,338 --> 00:27:34,638 to, generate smut, and it got really concerning people would write like, 451 00:27:34,638 --> 00:27:39,428 "I'm an eight year old girl", and then the model would complete it saying 452 00:27:39,438 --> 00:27:41,168 "....and I'm wearing a skimpy outfit", 453 00:27:41,168 --> 00:27:44,418 And oh, whoa, like the player didn't want that, but like the model generated it. 454 00:27:45,038 --> 00:27:50,108 there became this big feud between OpenAI and Latitude about creating filters. 455 00:27:50,598 --> 00:27:53,338 "hey, we don't want your players doing that. 456 00:27:53,348 --> 00:27:54,518 We don't like that". 457 00:27:54,528 --> 00:27:58,048 And, Latitude's "okay, we'll create some filters" and things like that. 458 00:27:58,048 --> 00:28:00,428 And it devolved really quickly. 459 00:28:00,448 --> 00:28:03,568 Latitude being a very startup, not necessarily knowing everything 460 00:28:03,568 --> 00:28:08,668 they were doing, they built a very shaky filtering system, and then 461 00:28:09,078 --> 00:28:10,678 OpenAI was "that's not good enough". 462 00:28:10,678 --> 00:28:13,908 So then they started banning players, and so eventually we got to this 463 00:28:13,918 --> 00:28:18,668 territory where players - paying customers would be playing a game, the 464 00:28:18,668 --> 00:28:23,028 model would randomly generate, something that the filtering system didn't 465 00:28:23,028 --> 00:28:24,618 like, and then they would get banned. 466 00:28:24,618 --> 00:28:29,178 Cause it's like the game just did itself. 467 00:28:30,238 --> 00:28:33,668 It was a very complicated time, and there was lots of back and 468 00:28:33,668 --> 00:28:38,488 forth between Latitude, who's a small company, and OpenAI. 469 00:28:38,488 --> 00:28:45,408 There's lots of ' he said they said' going on, but ultimately, it's just this 470 00:28:46,108 --> 00:28:53,868 position where Latitude They had this game that was completely dependent on OpenAI's 471 00:28:53,898 --> 00:29:01,268 model to generate good output, and it really caused a lot of drama between 472 00:29:01,268 --> 00:29:07,878 the players and Latitude and, OpenAI in the background and that is a critical 473 00:29:07,968 --> 00:29:13,868 example of LLM was very critical to their business, If they owned it, then they 474 00:29:13,868 --> 00:29:17,708 could have controlled it, they could have made sure that from the model aspect, 475 00:29:17,708 --> 00:29:21,538 they could have trained the model to make sure it didn't do any of those things. 476 00:29:22,068 --> 00:29:26,118 And then they would never need to play the little blame game, right? 477 00:29:26,128 --> 00:29:27,518 Nobody likes to play that game. 478 00:29:27,638 --> 00:29:31,218 That's whose fault is it, that the model is generating bad stuff. 479 00:29:31,248 --> 00:29:33,618 Is it the player who's prompting it? 480 00:29:33,848 --> 00:29:38,868 Is it Latitude who has some systems for tokenizing and preparing player 481 00:29:38,868 --> 00:29:40,718 output before it goes to OpenAI? 482 00:29:41,048 --> 00:29:43,898 Is it OpenAI because their model is generating that? 483 00:29:43,928 --> 00:29:48,333 Is it Latitude for post processing the content from OpenAI before 484 00:29:48,333 --> 00:29:49,163 they serve it to the player. 485 00:29:49,223 --> 00:29:51,603 I don't even know if it really matters who's to blame. 486 00:29:51,703 --> 00:29:53,403 it's just a sucky game to play. 487 00:29:53,453 --> 00:29:58,693 and that's like the ultimate example of why you might want to consider 488 00:29:58,693 --> 00:30:04,403 build versus buy is if you buy from any provider, we're picking on OpenAI here, 489 00:30:04,403 --> 00:30:08,983 because they're a big player, but you buy from Anthropic, you buy from the guys down 490 00:30:08,983 --> 00:30:13,113 the street, the startup that just barely came up and they're offering for half 491 00:30:13,113 --> 00:30:15,883 the price of whatever, Buy from anybody, 492 00:30:15,903 --> 00:30:18,363 and you will eventually have to play that blame game. 493 00:30:18,363 --> 00:30:22,883 we had another example in there of some lawyers who generated, cases that didn't 494 00:30:22,883 --> 00:30:30,773 exist they asked ChatGPT about cases and it came up with a perfect response. 495 00:30:31,433 --> 00:30:32,683 a little too perfect. 496 00:30:32,693 --> 00:30:34,833 It hallucinated stuff that didn't exist. 497 00:30:34,833 --> 00:30:37,583 and, is it ChatGPT's fault? 498 00:30:37,603 --> 00:30:41,083 Is it OpenAI's fault for, allowing their model to make 499 00:30:41,083 --> 00:30:43,063 stuff up and behave dishonestly? 500 00:30:43,583 --> 00:30:46,023 Or is it the lawyer's fault for not checking it? 501 00:30:46,043 --> 00:30:46,763 who cares? 502 00:30:46,763 --> 00:30:49,453 the problem is that it's not locked down. 503 00:30:49,453 --> 00:30:50,823 It's qnon deterministic. 504 00:30:50,823 --> 00:30:56,653 Yeah, in a way, as I was reading the chapter on that, it makes 505 00:30:56,653 --> 00:31:03,308 me think of using a machine to maybe do some farm, work. 506 00:31:03,318 --> 00:31:06,648 Let's say that you're plowing a field and you're using a 507 00:31:06,658 --> 00:31:08,628 horse versus a machine, right? 508 00:31:08,628 --> 00:31:11,388 A machine might break, but in a predictable way. 509 00:31:11,428 --> 00:31:14,118 And if you've got a mechanic around, they'll come and fix it. 510 00:31:14,178 --> 00:31:18,578 A horse can get scared, or it has a bad day, or it can be moody. 511 00:31:19,478 --> 00:31:21,538 And it can come up with something new. 512 00:31:21,638 --> 00:31:23,778 So you always have to be careful with that. 513 00:31:23,908 --> 00:31:28,608 is that an accurate feeling of someone who's working with this LLMs day-to-day? 514 00:31:29,578 --> 00:31:31,608 You work with some kind of animal? 515 00:31:32,388 --> 00:31:36,268 One of the most annoying things is even if you set the seed of it, so 516 00:31:36,268 --> 00:31:40,138 the random generator is going to be the same every single time, you 517 00:31:40,138 --> 00:31:43,598 can still give it the same prompt and get something different out. 518 00:31:43,648 --> 00:31:49,298 The truly awesome thing about LLMs is the number of non-linear activations 519 00:31:49,348 --> 00:31:51,718 that are going through the model, right? 520 00:31:52,138 --> 00:31:57,578 It's creating incredible, non-linear jumps throughout that dimensional 521 00:31:57,578 --> 00:31:59,328 space that the embeddings are in. 522 00:31:59,998 --> 00:32:01,438 you just can't really predict it. 523 00:32:01,438 --> 00:32:02,798 It is a little bit like an animal. 524 00:32:05,585 --> 00:32:08,895 the fact that like we can prompt engineer at all. 525 00:32:09,480 --> 00:32:11,480 it's a little bit telling of where we are, right? 526 00:32:11,480 --> 00:32:15,980 Cause like prompt engineering, you can change the spaces, the white space 527 00:32:15,980 --> 00:32:19,870 inside of your prompt and it can end up giving you a completely different result. 528 00:32:19,970 --> 00:32:24,290 we're still in a very interesting area, where we're trying to create 529 00:32:24,300 --> 00:32:28,960 better ways to communicate with the LLM and get predictable outputs. 530 00:32:28,970 --> 00:32:31,670 But, the fact that we can do that at all is. 531 00:32:32,070 --> 00:32:33,380 This is a bit of a miracle, right? 532 00:32:33,480 --> 00:32:34,720 you can't do that with a human. 533 00:32:35,350 --> 00:32:39,000 a human isn't going to be tricked into saying something different. 534 00:32:39,020 --> 00:32:41,120 humans are tricked all the time, but not necessarily in the 535 00:32:41,120 --> 00:32:42,710 same way that we do with LLMs. 536 00:32:42,710 --> 00:32:46,665 it's a very interesting world we are in, and a lot of people are having 537 00:32:46,665 --> 00:32:49,145 that horse versus machine experience. 538 00:32:49,195 --> 00:32:51,775 let's talk about the cost a little bit. 539 00:32:51,955 --> 00:32:57,345 you mentioned that it's super cheap to pay some big company to use their thing. 540 00:32:57,845 --> 00:33:02,685 let's focus for a minute on the cost of actually building your own LLM. 541 00:33:02,745 --> 00:33:06,275 if I wanted to build one of this foundational models, 542 00:33:06,365 --> 00:33:13,305 Let's say that I take one of those 75TB corpora from the internet and I'm 543 00:33:13,305 --> 00:33:16,975 feeling particularly GPU poor that day. 544 00:33:17,245 --> 00:33:22,965 How much money do I need to have in my little piggy bank to get something useful? 545 00:33:24,075 --> 00:33:25,265 That's difficult, man. 546 00:33:26,765 --> 00:33:31,195 because you're either paying for a GPU, right? 547 00:33:31,215 --> 00:33:36,455 Or a suite of GPUs in order to parallelize it so that you can ingest 548 00:33:36,455 --> 00:33:38,285 that over a short period of time. 549 00:33:38,785 --> 00:33:43,835 Or technically with a lot of this stuff, you can load it onto a [Geforce] 3090, 550 00:33:43,835 --> 00:33:50,885 I've done this personally, you can train in FP16, you can train up to, about, 13 551 00:33:50,895 --> 00:33:55,835 billion parameters pretty effectively, and pretty cheaply, on a 3090. 552 00:33:56,555 --> 00:33:59,655 You have to be a little bit smart about your data loading, you have to make 553 00:33:59,655 --> 00:34:03,295 sure you're streaming stuff you have to pay for the data storage anyway, it's 554 00:34:03,335 --> 00:34:07,385 incredibly slow, you have to do gradient checkpointing, you have to, do like 555 00:34:07,395 --> 00:34:13,365 gradient accumulation steps, which slow down the training even more, I trained a 556 00:34:13,715 --> 00:34:19,195 little bit bigger than that, it was about a 20 billion parameter model on my 3090, 557 00:34:19,665 --> 00:34:27,295 but what I don't, generally talk about is it took a year of just running to do that. 558 00:34:27,395 --> 00:34:32,275 it was horrendous and that all culminated in a company giving me a 559 00:34:32,275 --> 00:34:36,785 cease and desist, so I couldn't even release it, so you're either paying. 560 00:34:37,125 --> 00:34:42,765 A lot of money, hundreds of thousands of dollars in order to get something quick. 561 00:34:42,895 --> 00:34:47,465 Especially with 75TB of text or more, grab your own data, get 562 00:34:47,475 --> 00:34:51,655 more data, and you're paying to store and to process all of that. 563 00:34:51,905 --> 00:34:53,595 And that costs tons of money. 564 00:34:53,855 --> 00:35:00,750 Or you are not paying the money, but it takes a really long time and makes all 565 00:35:00,750 --> 00:35:04,400 of your shareholders really frustrated because you're ruining go to market. 566 00:35:04,410 --> 00:35:06,190 You're taking too long. 567 00:35:06,210 --> 00:35:10,390 You're not going to be the first in the space, It's a huge trade off 568 00:35:11,208 --> 00:35:15,278 as with many things, you can trade time or money, and 569 00:35:15,448 --> 00:35:17,108 training an LLM is very similar. 570 00:35:17,158 --> 00:35:23,338 I think they estimated, huge models that we see, like ChatGPT things. 571 00:35:23,408 --> 00:35:27,798 You're probably paying somewhere like what was it like a half million? 572 00:35:27,808 --> 00:35:31,958 I think they say, and that's just for the training, we're not even 573 00:35:31,958 --> 00:35:37,108 talking about all the experts you have to pay and buy in order 574 00:35:37,250 --> 00:35:38,130 data curation, 575 00:35:38,200 --> 00:35:38,460 man. 576 00:35:39,388 --> 00:35:42,478 on the very far end on the expensive side. 577 00:35:42,528 --> 00:35:45,998 it gets really expensive really quickly to train these models, just because. 578 00:35:46,048 --> 00:35:51,538 buying enough GPUs in order to parallelize this to do it within, reasonable time and 579 00:35:51,548 --> 00:35:55,528 just the sheer volume of data you have to run through to train all the parameters. 580 00:35:55,528 --> 00:36:00,958 It gets really expensive, but on the other end there's lots of good 581 00:36:00,958 --> 00:36:04,928 open source models that have done that main pre-training already. 582 00:36:04,988 --> 00:36:10,138 And so you can grab one of those, you can train it with something like 583 00:36:10,138 --> 00:36:16,868 Laura, which you, only need a handful of samples and maybe like 10 minutes 584 00:36:16,898 --> 00:36:21,888 if that, and you can train it on a very, simple GPU and you have something 585 00:36:21,888 --> 00:36:26,443 fine tuned for what you need, and you can get under $200 is very reasonable. 586 00:36:26,593 --> 00:36:27,978 $150, $20. 587 00:36:28,803 --> 00:36:33,143 It's very possible to train, these models with certain 588 00:36:33,143 --> 00:36:34,433 methods to get what you need. 589 00:36:35,753 --> 00:36:42,193 So does it mean that in a kind of natural, almost biological like evolution we're 590 00:36:42,193 --> 00:36:48,173 going to end up with few primary models that a lot of the different models branch 591 00:36:48,223 --> 00:36:51,153 off of, instead of, reinventing the wheel? 592 00:36:51,713 --> 00:36:53,263 That's where we're at currently. 593 00:36:53,648 --> 00:37:00,238 I hope that it doesn't stay that way, because I really enjoy seeing 594 00:37:00,238 --> 00:37:03,898 new people create new models for new use cases and all this stuff. 595 00:37:04,428 --> 00:37:10,313 so I hope it doesn't stay that way, but I do see a lot of value in creating industry 596 00:37:10,313 --> 00:37:15,293 standards, at least around how you are actually writing the binary files, how 597 00:37:15,293 --> 00:37:17,063 are the weights actually being stored? 598 00:37:17,073 --> 00:37:19,083 What do the different layers look like? 599 00:37:19,153 --> 00:37:22,943 I, think that standardizing what the model looks like so that you can load 600 00:37:22,943 --> 00:37:25,653 it as flexibly as possible is awesome. 601 00:37:27,183 --> 00:37:32,363 I would like to see more open source models, which is funny considering 602 00:37:32,383 --> 00:37:37,483 there are thousands of open source fine tuned versions and hundreds 603 00:37:37,553 --> 00:37:42,563 of open source foundational models on the Hugging Face Hub right now. 604 00:37:42,943 --> 00:37:44,033 I want more, right? 605 00:37:44,113 --> 00:37:44,983 I'm greedy, man. 606 00:37:46,785 --> 00:37:51,355 To me, it sounds like basically every week there is another one that's better 607 00:37:51,355 --> 00:37:56,805 at something and if you look at the Hugging Face LLM leadership board, it's 608 00:37:56,805 --> 00:38:04,465 changing by the hour, literally and it looks like a gold rush in many ways but 609 00:38:04,655 --> 00:38:08,355 I like this gold rush much better than the crypto one, couple of years ago 610 00:38:09,558 --> 00:38:12,198 Yeah, man, there's a lot higher chance that you'll come out 611 00:38:12,198 --> 00:38:17,468 of this gold rush with a great product than with the crypto one. 612 00:38:17,518 --> 00:38:22,108 yeah, there's a lot there, and just to summarize that into one sentence, 613 00:38:22,608 --> 00:38:29,908 you can probably fine tune even a gigantic model for around $200 to $500. 614 00:38:32,233 --> 00:38:33,693 And you can go lower than that. 615 00:38:33,713 --> 00:38:37,993 Even if you are smart about how you're doing it, versus training from scratch, 616 00:38:38,013 --> 00:38:42,883 which either is going to take an inordinate amount of time or will cost 617 00:38:43,533 --> 00:38:45,373 thousands and thousands of dollars. 618 00:38:46,530 --> 00:38:50,530 So I'm willing to bet money that a lot of our listeners are going to pause 619 00:38:50,550 --> 00:38:52,590 this now and start Googling furiously. 620 00:38:52,590 --> 00:38:55,410 How do I fine tune a model? 621 00:38:56,000 --> 00:38:58,790 Where would you point them as a good starting point? 622 00:38:58,910 --> 00:39:04,275 any particular paper, any particular, company, anything that's, a 623 00:39:04,275 --> 00:39:06,435 good place to start with that 624 00:39:07,303 --> 00:39:09,663 a bit selfishly, I would say you should buy our book. 625 00:39:09,713 --> 00:39:16,863 We talk about probably the main ways to train in chapter 5 of our book, 626 00:39:16,893 --> 00:39:18,113 I was going to say that, but, I 627 00:39:18,113 --> 00:39:19,943 was going to say it last, right? 628 00:39:19,963 --> 00:39:21,223 Cause we do go over it. 629 00:39:21,523 --> 00:39:26,473 The book is primarily about production environments, but you can't really 630 00:39:26,473 --> 00:39:29,518 put a model in production if you don't know how to work with it. 631 00:39:29,518 --> 00:39:31,348 So we have stuff on fine tuning. 632 00:39:31,348 --> 00:39:34,328 We have stuff on perimeter, efficient, fine tuning on low 633 00:39:34,328 --> 00:39:35,908 rank adaptation, the whole deal. 634 00:39:36,438 --> 00:39:41,298 YouTube is actually probably one of your best resources right now, because 635 00:39:41,348 --> 00:39:48,108 it has amazing content creators that show you how to do it in whatever 636 00:39:48,118 --> 00:39:49,658 format you're comfortable in. 637 00:39:49,658 --> 00:39:55,448 So if you're a C+ developer, there are YouTube videos on how to fine tune a model 638 00:39:55,498 --> 00:39:59,238 and create a Laura using llama CPP, right? 639 00:39:59,238 --> 00:40:01,118 It's not even all that difficult. 640 00:40:01,118 --> 00:40:05,258 You just have to convert a model into a GGUF format and Boom, you're there. 641 00:40:05,258 --> 00:40:06,668 You can do it on a CPU. 642 00:40:06,718 --> 00:40:10,463 it'll take a long time, but you can do it in whatever quantization 643 00:40:10,463 --> 00:40:11,363 you want and everything. 644 00:40:12,053 --> 00:40:17,363 YouTube will meet you where you're at if you want to learn something a little bit 645 00:40:17,363 --> 00:40:21,563 more industry-standard so that you could potentially, get employment in this area, 646 00:40:22,173 --> 00:40:27,798 PyTorch has an amazing documentation, fantastic tutorials and they're one of the 647 00:40:27,808 --> 00:40:32,098 best at really making it feel like you're playing with, let's say "big boy Legos" 648 00:40:32,728 --> 00:40:39,928 You're like building the model using their little Lego pieces pretty cool If you need 649 00:40:39,928 --> 00:40:42,193 something Bit more high level than that. 650 00:40:42,573 --> 00:40:48,613 Hugging face, I think is the industry standard for, working in between a whole 651 00:40:48,613 --> 00:40:52,583 bunch of different frameworks, whether that's PyTorch or TensorFlow or, whatever 652 00:40:52,583 --> 00:40:54,593 other framework you're working with Onyx. 653 00:40:55,233 --> 00:40:59,193 HuggingFace has abstracted away a lot of the difficulty of setting 654 00:40:59,213 --> 00:41:04,003 up models for fine tuning cause in PyTorch you have to build out the 655 00:41:04,003 --> 00:41:07,623 exact model architecture just to load the weights and then fine tune it. 656 00:41:08,113 --> 00:41:10,973 HuggingFace already has the class built for you. 657 00:41:13,188 --> 00:41:16,078 I would point to those if you need more explanation, like 658 00:41:16,078 --> 00:41:17,798 Coursera is a fantastic place. 659 00:41:17,798 --> 00:41:21,688 Deep learning AI on Coursera and on their own sites felt like 660 00:41:22,178 --> 00:41:24,418 that's Andrew Ng's education stuff. 661 00:41:24,428 --> 00:41:28,368 That's where I got my start with machine learning was Andrew Ng's 662 00:41:28,388 --> 00:41:30,738 machine learning course on Coursera. 663 00:41:30,748 --> 00:41:31,778 It was Awesome. 664 00:41:31,808 --> 00:41:32,608 Fantastic. 665 00:41:32,658 --> 00:41:36,578 Jeremy Howard is also amazing in that area of creating content for 666 00:41:36,578 --> 00:41:40,388 people starting out and learning from beginner to advanced level. 667 00:41:40,498 --> 00:41:42,198 He's a fast AI. 668 00:41:42,198 --> 00:41:44,188 I, yeah, I strongly recommend all of those 669 00:41:46,200 --> 00:41:47,030 and your book. 670 00:41:48,358 --> 00:41:51,288 yeah, we ingested a lot of those in order to write the book, 671 00:41:52,438 --> 00:41:58,618 our book is a very nice high-level overview of the key things you want 672 00:41:58,618 --> 00:42:02,978 to be looking at and like different methodologies from training from 673 00:42:02,978 --> 00:42:05,658 scratch to basic fine tuning to. 674 00:42:06,913 --> 00:42:11,323 model distillation to, Laura and Path and things like that. 675 00:42:11,373 --> 00:42:15,253 we definitely give a high level overview, we give code samples and show you that. 676 00:42:15,303 --> 00:42:19,203 But, ultimately if you really wanted to get into it, yeah, there 677 00:42:19,223 --> 00:42:21,243 are other resources out there. 678 00:42:21,263 --> 00:42:25,383 I know Manning has another book coming out, specifically 679 00:42:25,393 --> 00:42:28,153 around all about training LLMs. 680 00:42:28,203 --> 00:42:30,563 there are definitely other places you can go, but. 681 00:42:30,913 --> 00:42:34,093 If you're looking for the quick, summarized version of all of 682 00:42:34,093 --> 00:42:37,033 these things, our book is actually a really good resource for it. 683 00:42:37,083 --> 00:42:41,313 One other thing that I like about your book is, the part where you 684 00:42:41,363 --> 00:42:46,833 build up the different, breakthrough moments, throughout the world of 685 00:42:46,833 --> 00:42:52,043 mathematics, that ultimately led to 'attention is all you need', and 686 00:42:52,173 --> 00:42:54,853 what is it, seven years later now? 687 00:42:54,943 --> 00:42:56,623 the gold rush that we're observing. 688 00:42:56,653 --> 00:43:00,923 but just before we jump into that, there is a little bit of vocabulary 689 00:43:00,983 --> 00:43:05,653 and that one needs to have in order to basically talk or even read 690 00:43:05,663 --> 00:43:07,833 a lot of this papers, could you. 691 00:43:08,753 --> 00:43:11,843 Talk us through briefly that vocabulary. 692 00:43:11,873 --> 00:43:16,813 I'm talking about phonetics, syntax, semantics, pragmatics, morphology, that 693 00:43:16,833 --> 00:43:22,743 until I read your book actually made me think mostly of blood tests and semiotics. 694 00:43:23,503 --> 00:43:27,603 Could you give us like the MVP version of what you need to know about these 695 00:43:27,623 --> 00:43:30,203 things to be able to read papers? 696 00:43:30,203 --> 00:43:31,473 Oh, absolutely. 697 00:43:31,573 --> 00:43:34,653 Matt has been learning a lot of this too, he might be better at it than me. 698 00:43:34,653 --> 00:43:36,303 I will throw other jargon into it. 699 00:43:36,623 --> 00:43:40,503 writing this book with Chris over the last year has been, mind-opening for me. 700 00:43:40,563 --> 00:43:43,753 until you can Understand these words like you were saying it's really 701 00:43:43,753 --> 00:43:48,833 hard to dive into the deep end but we go over in our book just because 702 00:43:48,923 --> 00:43:53,993 we do find it so valuable, It really helped me understand very quickly. 703 00:43:54,053 --> 00:43:56,303 "Oh, this is what my LLMs are good at. 704 00:43:56,303 --> 00:43:59,523 This is what LLMs are not", and that was one of the first things we started with 705 00:43:59,553 --> 00:44:04,998 but the first one semantics, that is just like the structure of words, how things 706 00:44:04,998 --> 00:44:07,348 go, whether or not it sounds correct. 707 00:44:07,838 --> 00:44:09,528 that is what LLMs are really good at. 708 00:44:09,538 --> 00:44:14,098 They're really good at making sure like the semantics of words align really well. 709 00:44:14,148 --> 00:44:19,378 but after that, you got pragmatics, which is what LLMs have no idea about. 710 00:44:19,428 --> 00:44:22,618 That is all the information around. 711 00:44:23,083 --> 00:44:24,663 That isn't said, right? 712 00:44:24,663 --> 00:44:30,713 So when you say I'm going to find the eggs the Easter Bunny left, right? 713 00:44:30,733 --> 00:44:33,653 you have to understand what, Easter is, what the Easter 714 00:44:33,653 --> 00:44:35,833 Bunny is, why a bunny has eggs. 715 00:44:35,873 --> 00:44:38,563 there's a lot of context around it that you have to understand, 716 00:44:38,983 --> 00:44:40,823 and that's all pragmatics. 717 00:44:40,903 --> 00:44:42,553 it's information that isn't said. 718 00:44:43,673 --> 00:44:45,273 And that's what LLMs generally lack. 719 00:44:45,353 --> 00:44:47,223 Actually, I'm gonna, I'm gonna jump in here real quick. 720 00:44:47,293 --> 00:44:51,043 Miko, did you like the Velkanot example that I gave in there? 721 00:44:51,480 --> 00:44:53,020 Yeah, I thought it was 722 00:44:53,043 --> 00:44:53,443 Yeah. 723 00:44:53,563 --> 00:44:54,393 Was that pretty good? 724 00:44:55,663 --> 00:44:59,653 I just wanted to ask because I remember experiencing that in Slovakia. 725 00:44:59,663 --> 00:45:05,293 Like I lived there for years and that was a hugely beneficial portion to me 726 00:45:05,373 --> 00:45:11,126 to help figure out that 'no, tons of people have tons of ways of looking at 727 00:45:11,126 --> 00:45:12,903 things', and LLMs don't know about it. 728 00:45:13,288 --> 00:45:16,628 you would have to explain every bit of it to them in order to get them 729 00:45:16,628 --> 00:45:18,038 to understand the same things as you. 730 00:45:18,368 --> 00:45:19,268 Anyway, sorry, Matt. 731 00:45:20,185 --> 00:45:24,855 I find like those two words in general, semantics and pragmatics, understanding 732 00:45:24,855 --> 00:45:29,715 those is going to get you significantly farther and just understanding 733 00:45:29,715 --> 00:45:31,475 how LLMs work, what they're doing. 734 00:45:31,985 --> 00:45:34,955 there's obviously a lot of other words that we talk about, 735 00:45:35,005 --> 00:45:36,305 like morphology and stuff. 736 00:45:36,305 --> 00:45:39,965 And I'll hand it off to Chris to talk about what he wants to add to there. 737 00:45:40,558 --> 00:45:41,578 I would agree with Matt. 738 00:45:41,858 --> 00:45:45,518 Just understanding semantics and pragmatics would get you probably 60% 739 00:45:45,518 --> 00:45:49,718 of the way there, and you could read new papers that come out and immediately 740 00:45:49,718 --> 00:45:52,288 see like where are they amazing? 741 00:45:52,368 --> 00:45:53,538 Where are they failing? 742 00:45:53,608 --> 00:45:58,778 I end up using The relationship between those two, just the literal 743 00:45:58,818 --> 00:46:00,628 encoded meaning of your words. 744 00:46:00,678 --> 00:46:04,988 if I say, "I'm married to my ex-wife", there's immediately, 745 00:46:05,018 --> 00:46:06,588 boom, semantic problem there. 746 00:46:06,898 --> 00:46:08,608 How can I be married to my ex-wife? 747 00:46:09,258 --> 00:46:11,058 The words don't agree with each other. 748 00:46:12,268 --> 00:46:16,048 Versus, exactly as Matt was saying, if we talk about Easter, if we talk about 749 00:46:16,048 --> 00:46:20,288 traditions, if we talk about rituals that people have, just like the stuff 750 00:46:20,288 --> 00:46:24,548 that you say, if you ask someone in Slovakia, they're going to respond to you. 751 00:46:24,958 --> 00:46:25,818 That's normal. 752 00:46:25,988 --> 00:46:27,508 it's a question, they respond. 753 00:46:27,628 --> 00:46:34,298 LLMs don't have that, and you have to have them ingest tons and tons of data in order 754 00:46:34,298 --> 00:46:37,538 to even get as far as giving a response. 755 00:46:38,158 --> 00:46:42,138 the other ones that we can think about, syntax, I would say 756 00:46:42,138 --> 00:46:44,238 that syntax is largely solved. 757 00:46:44,438 --> 00:46:49,868 At this point, syntax is your structure around the words, like what order do 758 00:46:49,868 --> 00:46:51,838 the words go in for them to be correct? 759 00:46:52,058 --> 00:46:56,818 Is it 'I go to the store' or is it 'I to the store go' or all of that stuff. 760 00:46:57,238 --> 00:46:58,178 That's syntax. 761 00:46:58,178 --> 00:47:02,138 It's the structure that holds your sentences, your utterances together. 762 00:47:02,138 --> 00:47:07,218 Morphology is delving into something that I consider to be very important in LLMs. 763 00:47:08,043 --> 00:47:11,053 I'm not going to say the most important, cause I think that's still semantics. 764 00:47:11,113 --> 00:47:12,283 There's a lot of work there. 765 00:47:14,053 --> 00:47:17,603 but morphology would be how words are built. 766 00:47:18,223 --> 00:47:21,703 what are the fundamental units of meaning the morphemes do those 767 00:47:21,703 --> 00:47:23,633 even exist that sort of stuff. 768 00:47:23,703 --> 00:47:25,963 and we don't have to delve really deep into that. 769 00:47:25,973 --> 00:47:30,313 That's largely solved by tokenization, but we can see. 770 00:47:30,363 --> 00:47:33,523 with newer models that come out that really matters. 771 00:47:33,533 --> 00:47:39,583 You have much smaller models that have more novel tokenization, more novel 772 00:47:39,613 --> 00:47:43,503 morphology that end up outperforming larger models on tasks that they 773 00:47:43,503 --> 00:47:45,103 didn't even train on all that much. 774 00:47:45,793 --> 00:47:47,843 if we can put it all together really quick. 775 00:47:48,153 --> 00:47:50,093 The model solves syntax. 776 00:47:50,793 --> 00:47:55,533 Embeddings try to solve semantics, but semantics is difficult, 777 00:47:55,643 --> 00:47:57,023 and so they're not perfect. 778 00:47:57,393 --> 00:48:01,743 Pragmatics is stuff like RAG, your Retrieval Augmented Generation, and 779 00:48:02,163 --> 00:48:07,343 having repeated sequences within your training data, it gives it landmarks, it's 780 00:48:07,343 --> 00:48:09,933 context around the syntax and semantics. 781 00:48:10,203 --> 00:48:16,068 Morphology is your tokenization, which, if I would Give that an example, your 782 00:48:16,068 --> 00:48:22,258 tokenization provides your model with stuff that it sees, it changes from text 783 00:48:22,268 --> 00:48:24,458 into what does the model actually see. 784 00:48:25,648 --> 00:48:28,658 And, your embedding strategy is moot if you don't have it. 785 00:48:28,658 --> 00:48:31,968 Just your morphology gives your model glasses, if you want to call it that. 786 00:48:32,448 --> 00:48:34,998 And then phonetics is the one that we haven't even talked about. 787 00:48:35,338 --> 00:48:41,188 Phonetics is the reason why we are doing a podcast and we're talking instead of just 788 00:48:41,188 --> 00:48:42,988 texting each other or emailing each other. 789 00:48:42,988 --> 00:48:46,388 Can you imagine trying to ingest a podcast that's just emails? 790 00:48:46,998 --> 00:48:47,978 It's horrendous. 791 00:48:48,643 --> 00:48:54,563 And it's because there's so much richness and depth in meaning in the language that 792 00:48:54,573 --> 00:48:59,143 is just lost when you strip it of its phonetic, I'm going to call it a medium. 793 00:48:59,673 --> 00:49:04,133 And that can lead people to think that it has to do with sound, that's the 794 00:49:04,133 --> 00:49:08,513 most common modality for people, but sign language has phonetics, they have 795 00:49:08,513 --> 00:49:11,323 particular places where they, make signs. 796 00:49:11,343 --> 00:49:15,053 They have particular ways that they do them to inflect and express more emotion. 797 00:49:15,413 --> 00:49:19,563 Their phonetics exists even outside of the verbal modality. 798 00:49:20,233 --> 00:49:25,033 that's important because that's where I see the most improvements coming to LLMs 799 00:49:25,073 --> 00:49:27,623 in the future is being able to process. 800 00:49:28,498 --> 00:49:32,708 phonetic information without having to convert it into text 801 00:49:32,808 --> 00:49:36,288 or process phonetic information and compare it against the text. 802 00:49:36,288 --> 00:49:39,068 that can be incredibly helpful for your model's understanding. 803 00:49:39,068 --> 00:49:43,078 those are the five features of language that we break 804 00:49:43,078 --> 00:49:44,578 things down into in the book. 805 00:49:44,598 --> 00:49:46,388 And they're largely agreed upon. 806 00:49:46,388 --> 00:49:49,738 There are some other linguistic features that are incredibly important, stuff like 807 00:49:49,738 --> 00:49:52,488 dialogue, that we haven't even covered. 808 00:49:52,988 --> 00:49:53,748 beyond that. 809 00:49:53,813 --> 00:49:55,883 Yeah, we can talk about semiotics too. 810 00:49:55,923 --> 00:50:01,933 That's, Charles Sanders Peirce, smart dude from the 1800s just created, a lot 811 00:50:01,933 --> 00:50:07,253 of structure and organizations we dive into that very lightly in the book. 812 00:50:07,303 --> 00:50:11,593 I don't think that you need a grounding in semiotics in order to improve 813 00:50:11,593 --> 00:50:13,323 your ability to interact with LLMs. 814 00:50:13,878 --> 00:50:18,178 But it is helpful for organizing all of these other concepts. 815 00:50:18,218 --> 00:50:22,668 how do we create a mental map for how stuff needs to be processed 816 00:50:22,668 --> 00:50:24,308 within a machine learning pipeline? 817 00:50:24,668 --> 00:50:28,648 How do we make sure that we're not mixing things up and inadvertently destroying 818 00:50:28,688 --> 00:50:30,738 our model's ability to see things, right? 819 00:50:30,758 --> 00:50:36,248 If we put embeddings before tokenization, it breaks your process. 820 00:50:36,468 --> 00:50:40,458 it's helpful for organizing things and it's also helpful for understanding 821 00:50:40,468 --> 00:50:45,028 how conversation happens and how I say something and it moves through 822 00:50:45,028 --> 00:50:47,078 your mind to create an interpretation. 823 00:50:47,078 --> 00:50:50,788 that's by far like the most theoretical out there concept that 824 00:50:50,788 --> 00:50:52,098 we get into in the whole book. 825 00:50:53,720 --> 00:50:59,440 And together you came up with this language definition as being, as a 826 00:50:59,450 --> 00:51:05,760 concept, "an abstraction of feelings and thoughts that occur to us in our heads". 827 00:51:05,840 --> 00:51:08,750 And I'll be honest, I initially thought it sucked. 828 00:51:09,165 --> 00:51:12,525 because it's a little bit, it's a little bit wishy washy. 829 00:51:13,145 --> 00:51:14,835 I wanted something a bit more concrete. 830 00:51:14,855 --> 00:51:18,455 But then, as I looked up all the other definitions in different contexts, I 831 00:51:18,455 --> 00:51:23,795 was like, Okay, I can clearly not come up with anything better than that. 832 00:51:23,795 --> 00:51:27,555 So I think I'm ready to yield now and say that this is actually 833 00:51:27,555 --> 00:51:29,275 capturing it pretty well. 834 00:51:29,985 --> 00:51:35,060 Putting abstraction in it, sounds also vaguely techie, so that helps. 835 00:51:35,610 --> 00:51:37,590 How did you come up with that definition? 836 00:51:38,868 --> 00:51:39,388 I didn't. 837 00:51:39,728 --> 00:51:41,708 I would love to take credit for that. 838 00:51:41,718 --> 00:51:45,718 No, that definition has been around for a long time within the linguistic 839 00:51:45,728 --> 00:51:51,298 community, and one of the best examples of why it really works is babies, right? 840 00:51:51,678 --> 00:51:56,708 Babies have no idea how to express their thoughts, but somehow they get it across. 841 00:51:56,988 --> 00:52:03,128 when a baby is happy, we can tell when a baby is crying, we can infer that 842 00:52:03,128 --> 00:52:07,508 it needs something, babies are able to communicate without language, meaning 843 00:52:07,508 --> 00:52:13,008 that language is something that we created to shorten the conversation. 844 00:52:13,348 --> 00:52:17,118 The reason I called it an abstraction is we have abstract ideas. 845 00:52:17,128 --> 00:52:22,478 You probably come up to a situation where you're feeling something, and you don't 846 00:52:22,488 --> 00:52:24,718 know the words to really express it. 847 00:52:24,968 --> 00:52:29,908 I think that's a pretty universal human adult thing that has happened 848 00:52:29,908 --> 00:52:30,868 at least once in your life. 849 00:52:30,868 --> 00:52:34,768 That's happened to me a bunch of times, and it really illustrates that 850 00:52:34,768 --> 00:52:40,118 "Oh man, the language that we use is actually describing "what's in 851 00:52:40,118 --> 00:52:42,168 here", it isn't "what isn't here". 852 00:52:42,738 --> 00:52:43,938 it's a hard concept. 853 00:52:44,368 --> 00:52:48,748 Once you get there though, it really helps with LLMs, because you realize that the 854 00:52:48,748 --> 00:52:50,588 language that we're using is a crutch. 855 00:52:51,128 --> 00:52:54,458 And that's all that the LLMs have in the first place. 856 00:52:54,958 --> 00:52:57,828 And so this is another thing that goes towards the miraculous 857 00:52:57,828 --> 00:52:59,338 nature of them working at all. 858 00:52:59,818 --> 00:53:04,698 Is they're dealing with an abstraction of an abstraction at least. 859 00:53:04,858 --> 00:53:06,938 In order to communicate with us. 860 00:53:06,988 --> 00:53:08,918 So let's say that I buy that. 861 00:53:09,018 --> 00:53:13,948 my first question, would be going back to your baby example, isn't what the 862 00:53:13,948 --> 00:53:17,148 baby's doing some form of a language? 863 00:53:17,408 --> 00:53:18,298 what's the line 864 00:53:18,345 --> 00:53:19,005 I'd like it to 865 00:53:19,068 --> 00:53:19,958 what is and what 866 00:53:20,058 --> 00:53:20,398 isn't? 867 00:53:20,398 --> 00:53:23,258 what's the line between, a language and communication? 868 00:53:23,858 --> 00:53:24,608 I like that. 869 00:53:24,618 --> 00:53:27,678 That's a question that a lot of people I bet have and It'll 870 00:53:27,678 --> 00:53:28,448 probably go in the appendix. 871 00:53:28,898 --> 00:53:32,478 We'll probably talk about this in an appendix for curious readers so the line 872 00:53:32,478 --> 00:53:36,948 between just straight up communication and a language is the ability to talk. 873 00:53:37,008 --> 00:53:40,558 there, there are a lot, but one of my favorite ones is the ability to talk about 874 00:53:40,558 --> 00:53:42,728 something that is not physically present. 875 00:53:42,968 --> 00:53:44,578 bees have communication. 876 00:53:44,798 --> 00:53:46,588 gibbons have communication. 877 00:53:46,658 --> 00:53:48,058 Babies have communication. 878 00:53:48,498 --> 00:53:53,448 Babies, though, are unable to express any ideas about stuff that is not 879 00:53:53,698 --> 00:53:58,588 physically present, you can't talk to a baby about theoretical physics. 880 00:53:58,718 --> 00:54:00,888 I mean you can, but what are you gonna get back? 881 00:54:01,798 --> 00:54:06,988 You can talk to a baby about my Star Wars posters, right? 882 00:54:07,008 --> 00:54:11,118 I can point at them because they're right there, but if I'm in a different 883 00:54:11,118 --> 00:54:15,198 room, baby's not gonna be able to talk to me about them And that's 884 00:54:15,198 --> 00:54:16,778 the difference, It's one of them. 885 00:54:16,808 --> 00:54:19,928 That's the one that I'd like to highlight though is that the fact that 886 00:54:20,558 --> 00:54:26,118 we can speak about things that are not physically right here with us, that we 887 00:54:26,118 --> 00:54:30,158 can point at, that's the distinction between communication and language, 888 00:54:30,178 --> 00:54:31,598 because babies are communicating. 889 00:54:33,758 --> 00:54:37,528 But once they get to that point, it really deepens the interaction 890 00:54:37,528 --> 00:54:39,128 that you're able to have with them. 891 00:54:39,178 --> 00:54:44,628 So now, equipped, with all that knowledge, I'm gonna try to prompt 892 00:54:44,648 --> 00:54:47,438 engineer you and give you this prompt. 893 00:54:47,528 --> 00:54:53,028 I'm a five year old baby, that has language now, and who's very curious 894 00:54:53,068 --> 00:55:00,733 about understanding how we got from bag of words, counting frequencies all the way to 895 00:55:00,833 --> 00:55:07,653 LLMs and ChatGPT and people worrying about the Terminator actually coming into life. 896 00:55:08,603 --> 00:55:12,723 Could you walk me through the high level ideas that were important, 897 00:55:13,603 --> 00:55:15,513 build up to what we're seeing today. 898 00:55:16,956 --> 00:55:22,686 The bag of words is really easy to think about, especially if you keep 899 00:55:22,686 --> 00:55:24,886 your tokenization incredibly easy. 900 00:55:25,226 --> 00:55:27,996 Sorry, this is, I'm already out of five year old territory. 901 00:55:28,936 --> 00:55:31,156 You just count words. 902 00:55:32,626 --> 00:55:35,556 If I take that sentence, "you ; just ; count ; words". 903 00:55:35,596 --> 00:55:37,166 Each of those has a count of one. 904 00:55:37,976 --> 00:55:41,496 If I add another sentence, "I like Star Wars". 905 00:55:41,746 --> 00:55:44,416 All of those still have a count of just one word. 906 00:55:44,856 --> 00:55:48,036 And then if I add another, "do you like Star Wars?" 907 00:55:48,386 --> 00:55:51,346 You and star and wars all go up to two. 908 00:55:52,976 --> 00:55:53,456 That's it. 909 00:55:53,456 --> 00:55:55,156 That's a bag of words model. 910 00:55:56,513 --> 00:55:57,763 why is it important? 911 00:55:57,813 --> 00:55:58,373 what can it 912 00:55:58,383 --> 00:55:58,693 do? 913 00:56:00,506 --> 00:56:06,366 I think that bag of words is The first model that we really have 914 00:56:06,556 --> 00:56:08,706 to explain being data-driven. 915 00:56:09,321 --> 00:56:10,961 It's just keeping track of things. 916 00:56:11,051 --> 00:56:16,551 if you look at a bag of words model for your workouts, it's just how 917 00:56:16,561 --> 00:56:18,511 often do you do certain things? 918 00:56:18,551 --> 00:56:22,941 how often are you doing a bicep workout versus doing a pectoral workout? 919 00:56:22,961 --> 00:56:25,421 How often are you doing which thing? 920 00:56:25,521 --> 00:56:27,091 it's just being data driven. 921 00:56:27,181 --> 00:56:29,371 It's the first step, right? 922 00:56:29,751 --> 00:56:31,461 You're not looking at any features. 923 00:56:31,551 --> 00:56:35,071 You're really caring about how these things interact with each other. 924 00:56:35,151 --> 00:56:36,301 You're just keeping track 925 00:56:37,738 --> 00:56:41,508 So I guess with that information from your example, I can guess whether 926 00:56:41,508 --> 00:56:46,298 you, are skipping leg days, and I can see what's important to you. 927 00:56:47,008 --> 00:56:51,238 Or, if I'm counting, words in U. 928 00:56:51,238 --> 00:56:51,408 S. 929 00:56:51,408 --> 00:56:57,318 presidents speeches, I can say, like you described in your book, whether it's a 930 00:56:57,318 --> 00:57:02,228 wartime or a peacetime president, and what they really try to get across. 931 00:57:02,781 --> 00:57:08,641 this is something that you can use for anything you count in soccer 932 00:57:09,261 --> 00:57:13,501 which players make goals how often that is a bag of words model. 933 00:57:13,501 --> 00:57:14,681 You're not tracking words. 934 00:57:14,801 --> 00:57:18,021 It's a bag of goals or it's a bag of, whatever else. 935 00:57:20,028 --> 00:57:21,708 So what's the next step from there? 936 00:57:21,708 --> 00:57:26,308 bag of words was really monumental just because it's so simple, but it's so 937 00:57:26,308 --> 00:57:30,568 powerful because know words you use when you're describing sports is very different 938 00:57:30,568 --> 00:57:35,508 from the words you use describing politics And so just picking up on certain words 939 00:57:35,508 --> 00:57:40,728 and their counts helps us understand the overall subject of what it is. 940 00:57:40,758 --> 00:57:45,748 But it really lacked, any sort of structure, because the order 941 00:57:45,748 --> 00:57:47,828 of words also matter, right? 942 00:57:47,828 --> 00:57:54,298 So the cat in the hat versus the cat's hat, they both have the word 'cat', they 943 00:57:54,298 --> 00:57:58,373 both have 'hat', but mean different things because of the order of the words, and 944 00:57:58,373 --> 00:58:01,613 so that kind of led to, n-gram models. 945 00:58:01,663 --> 00:58:06,383 instead of just simple words, we would also take n-grams, which are, 946 00:58:06,383 --> 00:58:11,533 n number of words in a certain order, and we would start cataloging those. 947 00:58:11,583 --> 00:58:14,973 And so, more than just words, we're getting n-grams. 948 00:58:15,423 --> 00:58:20,473 And that is improving our understanding of the language because now we 949 00:58:20,633 --> 00:58:22,573 have embedded some syntax in it. 950 00:58:22,573 --> 00:58:29,173 We understand some ordering of words and that's able to improve our categorization. 951 00:58:30,343 --> 00:58:35,983 however, from there though, we're not really able to make any predictions 952 00:58:35,983 --> 00:58:39,423 of what next words about to come up or anything like that, when it comes 953 00:58:39,423 --> 00:58:42,653 to bag of words or n-grams they're really more for categorization. 954 00:58:43,253 --> 00:58:45,873 And so that kind of led to Bayesian techniques 955 00:58:47,453 --> 00:58:50,183 and so not to really go deeply 956 00:58:50,183 --> 00:58:52,743 into Bayesian statistics, but 957 00:58:52,816 --> 00:58:53,046 Yeah. 958 00:58:53,046 --> 00:58:53,836 I'm sorry. 959 00:58:53,846 --> 00:58:55,766 Sorry to all Bayesian fanboys. 960 00:58:55,766 --> 00:58:59,506 We're going to go about as deep into this as we did to pragmatics. 961 00:59:00,263 --> 00:59:05,243 it's just you know, based off of the priors of the words that came before we 962 00:59:05,243 --> 00:59:11,343 can then predict the next word to come up and so if every single time after 963 00:59:11,373 --> 00:59:16,963 in text we saw 'I am a man' then it's going to predict that the next word is 964 00:59:16,963 --> 00:59:22,163 man instead of other words that easily could have come up like woman or girl 965 00:59:22,163 --> 00:59:25,123 or boy or cook or professional athlete. 966 00:59:25,173 --> 00:59:28,513 certain things that could come up that are gonna be a lot rarer Like I am an 967 00:59:28,523 --> 00:59:33,473 astronaut like a lot less people have been astronauts in order to say that 968 00:59:33,883 --> 00:59:37,693 it's gonna have a very low probability of being the next word predicted but 969 00:59:37,723 --> 00:59:41,563 it gives us this opportunity to look at what is the next word predicted. 970 00:59:42,123 --> 00:59:47,148 from there, we move on to what's called Markov chains we're swinging 971 00:59:47,158 --> 00:59:52,588 back towards the n-gram model But it gives us a bit of prediction next. 972 00:59:53,028 --> 01:00:00,278 I actually really love Markov chains because they provide very fast 973 01:00:00,338 --> 01:00:07,028 predictive text like Markov chains is essentially what's been fueling like 974 01:00:07,028 --> 01:00:12,138 the predictive text like for Google search and things like that has been the 975 01:00:12,138 --> 01:00:15,248 technology that's really been leading that charge for a really long time. 976 01:00:15,298 --> 01:00:20,138 and it's just a very basic way that we're using Ngrams now to 977 01:00:20,168 --> 01:00:22,498 make predictions of the future. 978 01:00:22,678 --> 01:00:23,768 You can think about it there, 979 01:00:24,053 --> 01:00:26,543 that is obviously I'm 980 01:00:27,223 --> 01:00:28,073 reducing it. 981 01:00:28,093 --> 01:00:31,743 that's not exactly how it works, but it's a bag of n-grams where you 982 01:00:31,743 --> 01:00:36,713 take a state, at each point in a sequence, and look at all the times 983 01:00:36,723 --> 01:00:40,633 that Previewings have occurred in that sequence, and then from that you can 984 01:00:40,643 --> 01:00:42,863 model probability about what comes next. 985 01:00:42,913 --> 01:00:48,863 Instead of just looking at each n-gram by itself, you give it state. 986 01:00:49,663 --> 01:00:51,213 and it's a bag of n-grams. 987 01:00:51,233 --> 01:00:52,053 It's really fun. 988 01:00:52,213 --> 01:00:54,753 It's a probabilistic bag of n-grams. 989 01:00:55,743 --> 01:00:56,893 That's how the chains work. 990 01:00:57,738 --> 01:01:01,758 One of my favorite parts, and I like that you kept track of this quote 991 01:01:01,768 --> 01:01:05,608 here, that Markov models represent the first comprehensive attempt to 992 01:01:05,608 --> 01:01:09,788 actually model language, which is funny, because Markov was not trying 993 01:01:09,788 --> 01:01:13,908 to model language initially, he was just trying to win an argument. 994 01:01:14,578 --> 01:01:18,978 And He eventually used it to, he looked at distributions in 995 01:01:19,018 --> 01:01:20,498 particular Russian authors. 996 01:01:20,528 --> 01:01:25,598 He looked at distributions in, Russian government official speeches. 997 01:01:25,638 --> 01:01:29,988 he knew what he had and he believed in it, and I love that, what a 998 01:01:29,988 --> 01:01:32,418 great piece of history anyway. 999 01:01:32,418 --> 01:01:33,728 continuous bag of words. 1000 01:01:34,368 --> 01:01:39,688 Is where we, start essentially taking the logic of a Markov chain where, 1001 01:01:39,738 --> 01:01:45,828 "oh, if we keep track of where things appear and how often they appear there, 1002 01:01:45,888 --> 01:01:52,348 then it helps us, be able to model for what could appear next", right? 1003 01:01:52,438 --> 01:01:56,903 And this is the first moment where we're really coming full circle all 1004 01:01:56,903 --> 01:02:01,133 together and going right back to bag of words and just adding context 1005 01:02:01,143 --> 01:02:04,103 for position and adding context. 1006 01:02:05,203 --> 01:02:10,563 from the context of the bag of words, the literal counting of things, we're 1007 01:02:10,563 --> 01:02:12,023 able to create embeddings, right? 1008 01:02:12,073 --> 01:02:15,523 I don't know if a lot of people are aware, but bag of words 1009 01:02:15,533 --> 01:02:17,753 is how Word2vec came to be. 1010 01:02:18,318 --> 01:02:25,798 Word2vec was huge in, I think, 2015, 2016, and it stayed huge, Gensim 1011 01:02:25,808 --> 01:02:29,828 is still one of the most downloaded natural language processing libraries 1012 01:02:29,828 --> 01:02:33,138 in Python for Word2vec and for GloVe. 1013 01:02:33,678 --> 01:02:36,808 Continuous bag of words, just adding that one little thing. 1014 01:02:37,143 --> 01:02:40,773 adds all this context so that we can create embeds. 1015 01:02:41,013 --> 01:02:44,833 We can create vectors that we can compare between words. 1016 01:02:44,923 --> 01:02:50,083 this all comes from the logic of I forgot that dude's name. 1017 01:02:50,813 --> 01:02:55,353 Tell me the company that a word keeps, and I'll tell you what that word means. 1018 01:02:55,353 --> 01:02:57,293 just what's around the word. 1019 01:02:57,603 --> 01:03:01,798 influences its meaning, which goes directly against a lot 1020 01:03:01,798 --> 01:03:05,468 of previous linguists' thought that, syntax and semantics are 1021 01:03:05,498 --> 01:03:07,208 absolutely not related at all. 1022 01:03:07,218 --> 01:03:12,168 That's one of the big things from Chomsky, the colorless green ideas sleep furiously, 1023 01:03:12,308 --> 01:03:14,458 nonsense, there's some semblance to it. 1024 01:03:14,488 --> 01:03:18,608 There's some sense to it and taking advantage of that with continuous 1025 01:03:18,608 --> 01:03:20,748 bag of words, we can create. 1026 01:03:21,113 --> 01:03:23,663 like I said, these vectors that we can then compare, and 1027 01:03:23,663 --> 01:03:25,423 that's really interesting. 1028 01:03:25,433 --> 01:03:30,453 that is what fuels LLMs now, is this exact same continuous bag 1029 01:03:30,453 --> 01:03:31,653 of words modeling technique. 1030 01:03:32,003 --> 01:03:36,383 It's been built upon a little bit, but that bag of words is still fundamental 1031 01:03:36,383 --> 01:03:38,003 to how embeddings are created. 1032 01:03:38,193 --> 01:03:45,298 Bag of words and positionality and, like we can get into, the rope scaling, 1033 01:03:45,308 --> 01:03:51,078 all of these rotational, plugins that you can use to get longer sequences 1034 01:03:51,358 --> 01:03:54,148 embedded correctly, or at least better. 1035 01:03:54,568 --> 01:03:57,288 that's one of the hard things when we're talking about language modeling 1036 01:03:57,288 --> 01:03:58,828 is what is good and what is better. 1037 01:03:59,478 --> 01:04:02,388 a lot of people like to appeal to, this is how humans do it. 1038 01:04:02,548 --> 01:04:05,428 I don't know if humans are incredibly efficient when we do it, but. 1039 01:04:06,258 --> 01:04:07,978 Like it's fine. 1040 01:04:08,088 --> 01:04:11,258 then we get into the 1960s, the very first 1041 01:04:11,748 --> 01:04:12,598 perceptrons, 1042 01:04:12,825 --> 01:04:16,565 Before we go there, can we spend a little longer on what 1043 01:04:16,565 --> 01:04:18,335 the embeddings actually are? 1044 01:04:18,375 --> 01:04:23,065 You mentioned words to Vec, you mentioned the words vectors and embedding, but for 1045 01:04:23,065 --> 01:04:27,905 somebody, listening to us, from the start, that's probably not clear what that is. 1046 01:04:27,925 --> 01:04:29,125 can we delve a little bit? 1047 01:04:30,158 --> 01:04:31,218 Yeah, absolutely. 1048 01:04:31,218 --> 01:04:36,088 So embeddings are the vectors that come out of models like 1049 01:04:36,098 --> 01:04:37,498 continuous bag of words. 1050 01:04:37,938 --> 01:04:42,618 when you look at a modern machine learning pipeline, there are multiple models that 1051 01:04:42,618 --> 01:04:46,628 you go through and we just attract all of it and call it model, just one model. 1052 01:04:46,638 --> 01:04:54,718 When you look at GPT-3, ChatGPT, it has a model that they call it, a byte pair 1053 01:04:54,748 --> 01:04:56,938 encoding model to do its tokenization. 1054 01:04:56,968 --> 01:04:59,598 And then it has a model to do embeddings. 1055 01:05:00,228 --> 01:05:03,998 that model is fundamentally a continuous bag of words. 1056 01:05:03,998 --> 01:05:07,448 It's built on top of it a little bit with, like I said, keeping track. 1057 01:05:07,753 --> 01:05:11,573 Not just how many times a word occurs, but how many times a word 1058 01:05:11,573 --> 01:05:13,653 occurs in particular positions. 1059 01:05:13,653 --> 01:05:19,053 and then on top of that, it keeps track of the, flip. 1060 01:05:19,053 --> 01:05:24,833 It's either an odd or an even position within a sentence and it assigns 1061 01:05:24,833 --> 01:05:29,243 it cosine or sine based on whether it's an odd or an even position. 1062 01:05:29,243 --> 01:05:35,043 in order to try to insert some of that meaning back into it, that was taken out 1063 01:05:35,173 --> 01:05:41,243 from the tokenization, cause tokenization is just assign each token a number in 1064 01:05:41,243 --> 01:05:45,863 a dictionary, and you have a way to get all words into that dictionary, and 1065 01:05:45,863 --> 01:05:47,373 then come back out of that dictionary. 1066 01:05:47,383 --> 01:05:49,223 So it takes all of the meaning out of it. 1067 01:05:49,233 --> 01:05:50,353 It's just one number. 1068 01:05:51,333 --> 01:05:56,153 The embeddings attempt to put some of the meaning back into it using positionality, 1069 01:05:56,163 --> 01:05:58,743 using continuous language modeling 1070 01:05:58,743 --> 01:05:59,373 techniques. 1071 01:06:00,143 --> 01:06:05,813 embeddings really simply, they're not perfect, they're just an approximation 1072 01:06:05,863 --> 01:06:11,588 of that meaning, and because we are able to put it into a vectorized 1073 01:06:11,598 --> 01:06:14,508 space, we're able to take these words, put them in a vectorized space. 1074 01:06:14,508 --> 01:06:20,088 We can start doing things that start to make sense and start to make us feel 1075 01:06:20,088 --> 01:06:21,918 like we're headed in the right direction. 1076 01:06:21,968 --> 01:06:25,698 the classic example is, when we first discovered embeddings, we 1077 01:06:25,698 --> 01:06:29,748 took the embedding of 'king', we subtracted 'man' from it. 1078 01:06:30,328 --> 01:06:35,218 We then added the embedding of 'woman' and we got the closest. 1079 01:06:35,528 --> 01:06:41,088 Embedding to that was 'queen' to that, we start to get this vectorized 1080 01:06:41,098 --> 01:06:42,598 space that starts to make sense. 1081 01:06:42,598 --> 01:06:46,408 We start to, these words start to have connection to each other and they start 1082 01:06:46,408 --> 01:06:49,328 to make semantic sense to us as humans. 1083 01:06:50,558 --> 01:06:52,698 however, embeddings are still an approximation, right? 1084 01:06:52,698 --> 01:06:56,578 So if you were to do that with kind of every combination, it's interesting, 1085 01:06:56,588 --> 01:07:02,078 what do you get when you start, taking words, That don't necessarily make any 1086 01:07:02,078 --> 01:07:04,548 sense, like adding or subtracting them together. 1087 01:07:04,548 --> 01:07:05,068 what do you get 1088 01:07:05,171 --> 01:07:07,991 a good quintessential example of that is you take the vector for 1089 01:07:07,991 --> 01:07:11,451 'king', you subtract the vector for 'wolf', and you add the 1090 01:07:11,451 --> 01:07:14,291 vector for 'prince', and you get the vector for 'village'. 1091 01:07:14,871 --> 01:07:16,111 Or at least pretty close to it. 1092 01:07:16,601 --> 01:07:17,791 That doesn't make any sense, 1093 01:07:17,951 --> 01:07:22,831 there's still lots of, okay, these are starting to add meaning, not 1094 01:07:22,831 --> 01:07:27,631 always, but sometimes, like it's an approximation and embeddings ultimately. 1095 01:07:28,206 --> 01:07:30,806 it's something we're constantly trying to learn and improve 1096 01:07:30,983 --> 01:07:36,508 If your listeners are wondering how to keep up in space, like embeddings are 1097 01:07:36,508 --> 01:07:41,588 probably the number one thing to keep track of OpenAI recently released, logic 1098 01:07:41,598 --> 01:07:46,128 for being able to change the size of embeddings, to me, like being pretty 1099 01:07:46,128 --> 01:07:47,868 deep into this, it feels groundbreaking. 1100 01:07:48,438 --> 01:07:52,688 Because normally you have to structure these vectors so that they're all the same 1101 01:07:52,688 --> 01:07:59,318 size and each point within that vector represents meaning negative or positive 1102 01:07:59,348 --> 01:08:05,148 and it's very structured and not malleable and so the idea that you could take you 1103 01:08:05,278 --> 01:08:10,343 all of your embedding space and change the size of it at your whim Is just amazing. 1104 01:08:10,843 --> 01:08:14,723 that's one of the things that I see as a huge groundbreaking piece of technology 1105 01:08:14,773 --> 01:08:17,113 that OpenAI is continuing to lead in. 1106 01:08:17,163 --> 01:08:20,563 yeah, and if you're ever in doubt for oh man, is this paper important? 1107 01:08:20,903 --> 01:08:25,053 If it's about embeddings and doing really cool things with embeddings, probably. 1108 01:08:27,033 --> 01:08:31,253 I think the one question for anybody to like picture that, so what's 1109 01:08:31,253 --> 01:08:33,953 the dimension of all these vectors? 1110 01:08:33,983 --> 01:08:36,173 Is that the entire vocabulary? 1111 01:08:36,943 --> 01:08:39,253 Are there different techniques? 1112 01:08:39,253 --> 01:08:45,003 yeah, currently the, number one, dimensionality that is an 1113 01:08:45,023 --> 01:08:47,613 unspoken industry standard is 768. 1114 01:08:47,623 --> 01:08:50,873 that's a number that pretty much every NLP practitioner knows. 1115 01:08:51,223 --> 01:08:55,543 like the reason OpenAI's embeddings initially were like really cool 1116 01:08:55,663 --> 01:08:59,193 and they thought they were super dense is they were, what, 536, or 1117 01:08:59,193 --> 01:09:04,653 1536, which is 768 doubled, right? 1118 01:09:04,663 --> 01:09:08,463 You're gonna see multiples of 768 all over the place here. 1119 01:09:09,323 --> 01:09:13,573 And that's not because that number is super significant, that's just 1120 01:09:13,593 --> 01:09:17,693 the first embedding space that we found that tended to work better 1121 01:09:17,713 --> 01:09:18,553 than the others. 1122 01:09:19,375 --> 01:09:19,975 So that's the 1123 01:09:19,975 --> 01:09:23,365 more art than science part of this 1124 01:09:24,070 --> 01:09:24,290 for 1125 01:09:24,403 --> 01:09:26,093 It's the brute force testing. 1126 01:09:26,173 --> 01:09:33,613 Yeah, before going through and testing, 767, 766, 765 and landed on 1127 01:09:33,613 --> 01:09:37,763 that one and it worked, that's the best one that we've found so far. 1128 01:09:38,243 --> 01:09:42,993 Even the doubled embeddings from open AI offer a marginal improvement 1129 01:09:43,003 --> 01:09:44,473 in that understanding space. 1130 01:09:45,230 --> 01:09:49,950 I think we can move on to the multilayer perceptrons. 1131 01:09:51,088 --> 01:09:51,428 Okay. 1132 01:09:51,958 --> 01:09:56,038 a perceptron is essentially just a linear transformation of data. 1133 01:09:56,078 --> 01:10:02,098 If you look at it from a statistical standpoint, if you have three things 1134 01:10:02,178 --> 01:10:07,723 about something, You can just add those things together and you get 1135 01:10:08,003 --> 01:10:10,333 a description of that thing, right? 1136 01:10:10,333 --> 01:10:17,163 Just summing them and, that's like abstracting it a little bit much, 1137 01:10:17,453 --> 01:10:20,303 especially if machine learning practitioners are listening to that. 1138 01:10:20,303 --> 01:10:23,843 Like we can do linear trans transformations. 1139 01:10:24,413 --> 01:10:28,563 that's like the easiest way to think about it for me is you perform one. 1140 01:10:28,903 --> 01:10:33,053 action on a group of features and you get something out of it. 1141 01:10:33,473 --> 01:10:35,263 That's not by itself. 1142 01:10:36,413 --> 01:10:37,413 really helpful. 1143 01:10:37,463 --> 01:10:43,073 once you get into having multiple layers of the, this is the MLP, the 1144 01:10:43,073 --> 01:10:47,213 multi layer perceptron, once you get into multiple layers where you are 1145 01:10:47,543 --> 01:10:51,753 adding these transformations together, and in between those layers you have 1146 01:10:51,763 --> 01:10:56,773 non linear activation functions so that you can, create, you can create 1147 01:10:56,803 --> 01:11:02,463 nonlinear relationships between sets of linear transformations. 1148 01:11:02,823 --> 01:11:04,873 You can get into really cool spaces. 1149 01:11:04,953 --> 01:11:10,223 And one of the first things that any machine learning practitioner learns, 1150 01:11:10,793 --> 01:11:16,293 at least in a lot of the cases that I've talked to is that just adding 1151 01:11:16,293 --> 01:11:18,113 more layers does not make it better. 1152 01:11:18,113 --> 01:11:22,283 In fact, the cool part is finding the minimum number of layers that 1153 01:11:22,283 --> 01:11:26,573 you need in order to model the relationship between two points. 1154 01:11:26,673 --> 01:11:30,183 that's a little bit abstract, I think the quintessential example is like 1155 01:11:30,363 --> 01:11:33,073 detecting which type of iris flower. 1156 01:11:33,783 --> 01:11:37,513 It is from an image, the, we don't necessarily know how many features 1157 01:11:37,513 --> 01:11:43,443 there are, but we can vectorize the entire picture of an iris flower. 1158 01:11:43,483 --> 01:11:48,043 And then we can discover that the, I think minimum number of layers is 1159 01:11:48,043 --> 01:11:53,493 like five in order to go through and actually get really good accuracy on 1160 01:11:53,493 --> 01:11:56,223 detecting which iris flower it is. 1161 01:11:57,513 --> 01:12:02,018 yeah, multi layered perceptrons are The feed forward networks. 1162 01:12:02,038 --> 01:12:05,808 Those are the basis of everything that comes after it whether it's recurrent 1163 01:12:05,818 --> 01:12:12,458 or even Transformers have feed forward networks inside them and that's the basis 1164 01:12:12,458 --> 01:12:13,128 of it right there. 1165 01:12:13,785 --> 01:12:18,955 How do you choose the sizes and is it all just trial and error 1166 01:12:18,975 --> 01:12:23,335 as well for the number of layers, the sizes of the hidden layers? 1167 01:12:23,995 --> 01:12:24,325 Are there 1168 01:12:24,498 --> 01:12:24,898 Not any 1169 01:12:24,905 --> 01:12:25,835 rules that always 1170 01:12:25,835 --> 01:12:26,315 work? 1171 01:12:28,588 --> 01:12:34,008 Yeah, so going through a feed forward network and this comes from trial and 1172 01:12:34,008 --> 01:12:38,128 error, it comes from a lot of people trying different stuff, but generally 1173 01:12:38,128 --> 01:12:43,928 you have your Initial dimensionality could be something like 768, right? 1174 01:12:43,928 --> 01:12:45,598 Your initial hidden layer. 1175 01:12:45,608 --> 01:12:47,288 that's a good number for it. 1176 01:12:47,318 --> 01:12:51,018 That's an embedding dimension that we're familiar with, but then we want the 1177 01:12:51,018 --> 01:12:52,728 next hidden layer to be double that. 1178 01:12:52,808 --> 01:12:56,538 And then we want to go smaller and smaller until we hit our 1179 01:12:56,538 --> 01:12:58,508 final output classification layer. 1180 01:12:58,508 --> 01:13:01,658 So we want to have a big jump and then small. 1181 01:13:02,028 --> 01:13:08,138 What to think about that theoretically is you want to model the number of features 1182 01:13:08,168 --> 01:13:13,248 that you are looking for, and then you want to just model double that is just 1183 01:13:13,248 --> 01:13:17,208 a good way of saying all the features that we might not know about that we 1184 01:13:17,208 --> 01:13:18,728 might not even be keeping track of. 1185 01:13:18,728 --> 01:13:21,288 Let's see if the model can figure them out mathematically. 1186 01:13:21,648 --> 01:13:24,003 And then we want to narrow it down. 1187 01:13:24,333 --> 01:13:25,183 Narrow it down. 1188 01:13:25,183 --> 01:13:28,933 Narrow it down until we get to our actual classification, which in language 1189 01:13:28,953 --> 01:13:31,433 modeling is what is the next word, right? 1190 01:13:31,483 --> 01:13:31,993 Got it. 1191 01:13:33,103 --> 01:13:37,543 So double it and then boil it down to the size that you're actually looking 1192 01:13:37,543 --> 01:13:40,723 for across a bunch of layers and hope for 1193 01:13:40,723 --> 01:13:41,283 the best. 1194 01:13:42,143 --> 01:13:42,633 Okay. 1195 01:13:42,746 --> 01:13:46,556 and that's why when OpenAI doubled the embedding layers, it was a 1196 01:13:46,556 --> 01:13:50,046 marginal improvement, but it's predictable because that's normal. 1197 01:13:50,636 --> 01:13:51,326 People do that. 1198 01:13:52,383 --> 01:13:58,353 Are there any particular, well known kind of configurations of this neural 1199 01:13:58,353 --> 01:14:02,183 networks that just work for a bunch of problems that, something that 1200 01:14:02,203 --> 01:14:06,323 you keep seeing over and over, or is it more custom for every problem 1201 01:14:06,763 --> 01:14:09,493 you just follow the heuristics that you just described? 1202 01:14:09,543 --> 01:14:14,843 as far as model architecture, no, it's basically the heuristics that I 1203 01:14:14,843 --> 01:14:21,133 described, and then people will experiment and tune them and find that, oh man, 1204 01:14:21,143 --> 01:14:26,023 statistically, If this layer of the model is bigger, then it works better, 1205 01:14:26,413 --> 01:14:28,043 but it follows that general structure. 1206 01:14:28,063 --> 01:14:32,493 I think, one of the papers that I would point to for this is a bit, 1207 01:14:33,133 --> 01:14:39,433 MFIT, where it was, it's basically a methodology for fine tuning. 1208 01:14:40,003 --> 01:14:45,473 But it experiments with gradual unfreezing of layers where when you're 1209 01:14:45,473 --> 01:14:50,253 training, you will start with only the very last classification layer and 1210 01:14:50,283 --> 01:14:51,823 everything else is exactly the same. 1211 01:14:51,853 --> 01:14:53,483 And you only train that one. 1212 01:14:53,513 --> 01:14:58,933 And then you unfreeze, unfreeze, and test each layer as you're training. 1213 01:14:58,953 --> 01:15:03,273 And that tends to help things like even now that is abstracted within 1214 01:15:03,613 --> 01:15:05,063 the hugging face trainer class. 1215 01:15:05,093 --> 01:15:07,413 And that's abstracted within pretty much every. 1216 01:15:07,653 --> 01:15:11,193 model.fit methodology because it works. 1217 01:15:13,420 --> 01:15:13,820 Awesome. 1218 01:15:14,810 --> 01:15:16,500 What's next in our journey? 1219 01:15:16,520 --> 01:15:19,640 probably just the fact that multilayer perceptrons 1220 01:15:19,650 --> 01:15:22,450 struggle with sequences, right? 1221 01:15:22,660 --> 01:15:26,690 even if you try to embed things and try and keep some of that 1222 01:15:26,690 --> 01:15:30,105 positional encoding within your embeddings, they struggle to model. 1223 01:15:30,525 --> 01:15:33,625 Multiple things where the order of them matters, right? 1224 01:15:34,175 --> 01:15:38,225 which language, which the order matters sometimes, right? 1225 01:15:38,445 --> 01:15:43,295 Sometimes it's normal to say gibberish and knowing when is, which is extremely 1226 01:15:43,295 --> 01:15:48,450 difficult and to solve that, I don't know if we need to necessarily go 1227 01:15:48,450 --> 01:15:51,780 into recurrent neural networks, but we definitely need to talk about 1228 01:15:51,820 --> 01:15:56,090 LSTMs, the long term short memories, which are recurrent neural networks 1229 01:15:56,090 --> 01:16:01,270 to, start with, but they added some really important things, which, for 1230 01:16:01,270 --> 01:16:03,710 example, when I'm talking, you are. 1231 01:16:04,090 --> 01:16:08,550 Kind of consciously predicting what I might be saying, you can hear what I'm 1232 01:16:08,550 --> 01:16:12,410 saying and you're trying to figure it out as it goes on to understand it. 1233 01:16:12,410 --> 01:16:13,560 we call that active listening. 1234 01:16:13,560 --> 01:16:14,370 that's what happens. 1235 01:16:14,710 --> 01:16:19,050 long term short memories, model that a little bit in that they take the sequences 1236 01:16:19,580 --> 01:16:24,250 and they allow the model to try to predict both going forwards and backwards. 1237 01:16:25,020 --> 01:16:26,930 instead of just doing the one way. 1238 01:16:26,940 --> 01:16:30,060 So that bidirectionality it's computationally expensive. 1239 01:16:30,060 --> 01:16:33,770 It takes a lot longer, which is why I think these are not used as much 1240 01:16:33,770 --> 01:16:38,220 anymore, but it's really novel and it did help a lot in predicting sequences. 1241 01:16:38,220 --> 01:16:40,410 it was phenomenal for language modeling. 1242 01:16:40,410 --> 01:16:44,030 beyond that, they like solving the attention. 1243 01:16:44,300 --> 01:16:49,080 Within LSTMs, like when attention came out, adding attention to whatever you 1244 01:16:49,080 --> 01:16:56,750 were doing was phenomenal where it added an extra layer of non linearity when it 1245 01:16:56,750 --> 01:17:00,930 was going through and trying to search for what word might come next, it not 1246 01:17:00,930 --> 01:17:04,620 only had all the modeling that we've already talked about, it also had the 1247 01:17:04,620 --> 01:17:10,070 ability to search now and search for not that exact thing, but something similar. 1248 01:17:11,225 --> 01:17:16,145 And, that just exploded in popularity because it works, it was phenomenal. 1249 01:17:16,155 --> 01:17:22,045 However, the difficulty with long term short memories is they're computationally 1250 01:17:22,045 --> 01:17:27,275 expensive, they're slow, it's a lot of math that you have to do in order to 1251 01:17:27,275 --> 01:17:33,600 get through every single layer of it, let alone trying to predict and stream 1252 01:17:33,600 --> 01:17:38,520 those predictions in a sequence, you're going at one token per 30 seconds. 1253 01:17:38,580 --> 01:17:42,230 And that's difficult for having models that are the same size 1254 01:17:42,310 --> 01:17:43,900 as transformers, for example. 1255 01:17:45,140 --> 01:17:49,960 so yeah, it was a lot of really cool stuff that helped us solve 1256 01:17:49,990 --> 01:17:53,400 basically how to get to the next step. 1257 01:17:53,460 --> 01:17:56,380 It was just computationally expensive and slow. 1258 01:17:56,430 --> 01:18:00,360 basically, not very practical in use, but important. 1259 01:18:01,070 --> 01:18:05,150 talking about practicality, I think it's great that it's accurate, right? 1260 01:18:05,550 --> 01:18:07,680 I think accuracy is incredibly practical. 1261 01:18:07,990 --> 01:18:12,820 I don't think that from a customer experience that's practical, right? 1262 01:18:12,900 --> 01:18:16,110 Customers don't like waiting a long time for the right answer because 1263 01:18:16,110 --> 01:18:18,720 they might be able to find the right answer in that amount of time anyway. 1264 01:18:18,720 --> 01:18:21,250 and then from there, do we jump to the attention? 1265 01:18:22,020 --> 01:18:27,140 at this point, we've gone through the history of, the field modeling 1266 01:18:27,140 --> 01:18:31,840 language, building up and we finally reached attention, right? 1267 01:18:32,480 --> 01:18:36,000 And attention is, the backbone of transformers, which is 1268 01:18:36,000 --> 01:18:37,820 what LLMs are built off of. 1269 01:18:37,860 --> 01:18:40,570 And, attention just adds a non linearity. 1270 01:18:41,040 --> 01:18:45,360 And it was just a breakthrough and how we're able to connect the words, 1271 01:18:45,390 --> 01:18:49,750 so attention really quickly is just, creating these dictionaries, 1272 01:18:49,750 --> 01:18:55,330 key values of, every word to every other word in the token space. 1273 01:18:55,480 --> 01:18:57,470 and then it's able to query it. 1274 01:18:57,470 --> 01:19:00,320 for each other word, we're able to build. 1275 01:19:00,320 --> 01:19:03,400 importance of the other words that are important to it. 1276 01:19:03,440 --> 01:19:09,180 And it's in a quadratic space, so it's much more than a linear space, but 1277 01:19:09,190 --> 01:19:14,780 it's a reasonable amount of time, to compute these kind of dictionaries, 1278 01:19:14,780 --> 01:19:18,530 the key values, and then query them and understand the importance of other 1279 01:19:18,530 --> 01:19:23,450 words It's the backbone of what all these, different models are doing. 1280 01:19:23,470 --> 01:19:27,520 and even as Chris mentioned, like we could inject attention into 1281 01:19:27,580 --> 01:19:33,660 these previous, RNNs, LSTMs, et cetera, but, it was the backbone 1282 01:19:33,660 --> 01:19:35,700 of building the transformer model, 1283 01:19:35,750 --> 01:19:39,400 which, came out, in the catchy paper, "attention is all you need". 1284 01:19:40,050 --> 01:19:41,510 where essentially all they use, 1285 01:19:41,513 --> 01:19:42,723 a meme, right? 1286 01:19:43,103 --> 01:19:45,413 That we've seen a whole bunch of other papers afterwards. 1287 01:19:45,413 --> 01:19:46,893 They're like, "no, this is all you need". 1288 01:19:46,893 --> 01:19:49,653 or no, this is all you need, or no, you don't need, but the 1289 01:19:49,653 --> 01:19:51,003 reason it's a meme is because they 1290 01:19:51,003 --> 01:19:55,673 took out everything that was, supposedly novel about the long 1291 01:19:55,673 --> 01:19:57,343 term short memory, the LSTM. 1292 01:19:57,493 --> 01:20:00,353 They used only attention and feedforward networks 1293 01:20:01,163 --> 01:20:04,023 Could you give us an example of what that would look like 1294 01:20:04,023 --> 01:20:06,013 on a very stripped down thing? 1295 01:20:06,023 --> 01:20:09,393 What does that dictionary look like? 1296 01:20:09,653 --> 01:20:10,733 for visualization 1297 01:20:11,136 --> 01:20:11,476 and decode. 1298 01:20:11,903 --> 01:20:13,863 no, just for the attention itself, right? 1299 01:20:13,863 --> 01:20:17,633 You mentioned a key value from basically every combination. 1300 01:20:17,763 --> 01:20:20,703 You have to pre compute every combination within the vocabulary. 1301 01:20:21,456 --> 01:20:26,296 You can take a sentence that you're feeding in to the attention algorithm, the 1302 01:20:26,296 --> 01:20:28,366 cat in the hat, since I used that earlier. 1303 01:20:28,366 --> 01:20:33,951 and so essentially you would have a dictionary where the is comparing to 1304 01:20:33,961 --> 01:20:41,171 every other word, cat in the hat, and it's coming up with assimilating metrics 1305 01:20:41,181 --> 01:20:42,881 of the importance of all the other words. 1306 01:20:43,036 --> 01:20:51,596 And then you would do that for cat, it's going to do it for the in the hat, and in 1307 01:20:51,906 --> 01:20:58,201 the cat, the hat, and it's going to come up with A dictionary, essentially, of 1308 01:20:58,211 --> 01:21:02,231 key value pairs for all the other words, helping you understand, the importance 1309 01:21:02,231 --> 01:21:04,181 of the other words that are in there. 1310 01:21:04,231 --> 01:21:08,301 and then the query algorithm, that runs, that essentially helps us 1311 01:21:08,301 --> 01:21:11,881 understand being able to predict the next word that's coming afterwards 1312 01:21:11,911 --> 01:21:15,991 based off of how important the, all of those kind of dictionaries 1313 01:21:16,041 --> 01:21:17,471 are, and adding them. 1314 01:21:17,471 --> 01:21:17,971 And so all of, 1315 01:21:17,971 --> 01:21:19,961 this happens to happen in quadratic time. 1316 01:21:19,961 --> 01:21:20,691 one of the nice 1317 01:21:20,711 --> 01:21:21,091 novel 1318 01:21:21,101 --> 01:21:25,671 things about this is that the query And key vectors, your query vector 1319 01:21:25,671 --> 01:21:28,761 is the word that you're looking at in the utterance and your key 1320 01:21:28,761 --> 01:21:31,191 vector is the key in the dictionary. 1321 01:21:31,191 --> 01:21:34,521 those two vectors are not one hot encoded. 1322 01:21:34,701 --> 01:21:37,131 The way that a lot of we haven't even mentioned this. 1323 01:21:37,131 --> 01:21:43,541 But that's a vector that is 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, that's how a lot of these 1324 01:21:43,541 --> 01:21:50,411 things had been represented previously, coming off of the bag of words, The idea 1325 01:21:50,441 --> 01:21:53,461 that, hey, we can model these things. 1326 01:21:53,481 --> 01:21:56,601 We can create vectors that are just did this word appear. 1327 01:21:56,931 --> 01:21:58,241 Or did it not? 1328 01:21:58,301 --> 01:21:59,391 And where did it appear? 1329 01:21:59,411 --> 01:22:04,391 That was a positionality and, attention is all you need, you can immediately see 1330 01:22:04,391 --> 01:22:08,571 a problem with one hot encoding in the it's very sparse, especially as you're 1331 01:22:08,571 --> 01:22:11,951 getting into 768 dimensions, right? 1332 01:22:12,471 --> 01:22:17,161 You have just one 1 and a whole bunch of zeros and those zeros don't really matter. 1333 01:22:17,406 --> 01:22:21,806 And so one of the breakthroughs here was using dense vectors 1334 01:22:21,816 --> 01:22:25,626 for queries and keys in order to get values that are also dense. 1335 01:22:26,756 --> 01:22:30,576 I think one of my favorite visualizations of it, it's from Jesse Vig. 1336 01:22:30,736 --> 01:22:32,946 It's called BertViz on GitHub. 1337 01:22:33,876 --> 01:22:39,581 I've used this in production environments in order to show that hey, Our model 1338 01:22:39,581 --> 01:22:44,331 is not understanding this because look at the attention, all of it is 1339 01:22:44,331 --> 01:22:50,001 factoring in, all of the queries are related to the key of the wrong word. 1340 01:22:50,071 --> 01:22:53,301 If you look at words with semantic ambiguity, I think the quintessential 1341 01:22:53,301 --> 01:22:55,391 one is "time flies like an arrow". 1342 01:22:56,201 --> 01:23:00,451 Where flies is also another word that could mean multiple small 1343 01:23:00,451 --> 01:23:01,961 little bugs buzzing around. 1344 01:23:02,231 --> 01:23:04,211 How do we know that it's not that word? 1345 01:23:04,211 --> 01:23:09,231 It's because of the position in the sentence that we know that it is a verb. 1346 01:23:09,331 --> 01:23:11,841 and it's referring to time and it's referring to arrow. 1347 01:23:12,331 --> 01:23:16,371 And we can see that predictably within attention, because that 1348 01:23:16,371 --> 01:23:18,131 word is determined to be important. 1349 01:23:18,601 --> 01:23:23,521 That query is determined to be important as it relates to the keys of time and 1350 01:23:23,531 --> 01:23:26,341 arrow within query key value attention. 1351 01:23:26,831 --> 01:23:28,221 That's what that dictionary looks like. 1352 01:23:28,231 --> 01:23:28,681 That's why it's 1353 01:23:28,681 --> 01:23:29,101 useful. 1354 01:23:30,333 --> 01:23:34,073 And, I guess the representation of the importance, how do 1355 01:23:34,073 --> 01:23:35,093 we actually come up with 1356 01:23:35,103 --> 01:23:35,523 that 1357 01:23:37,671 --> 01:23:38,961 I think it's dot product. 1358 01:23:39,521 --> 01:23:43,311 we're comparing the vectors between the query and the key. 1359 01:23:43,381 --> 01:23:47,721 dot product attention is, I'm pretty, that's not where it started, but I 1360 01:23:47,741 --> 01:23:49,371 think that's where we're at right now. 1361 01:23:49,761 --> 01:23:52,361 That's like the industry standard that everybody uses. 1362 01:23:53,051 --> 01:23:54,851 It's just, multiplying the vectors together. 1363 01:23:54,851 --> 01:23:59,181 Essentially you take the dot product of the two vectors, and that's 1364 01:23:59,181 --> 01:24:02,771 where we get the comparison and the relative importance values. 1365 01:24:02,771 --> 01:24:04,371 it's not magic, it's 1366 01:24:04,371 --> 01:24:04,711 math. 1367 01:24:04,711 --> 01:24:07,451 kind of the same thing from time to time? 1368 01:24:08,571 --> 01:24:08,951 Okay. 1369 01:24:09,611 --> 01:24:18,091 And then with that, we've got the GPT, the generative pre trained transformer model. 1370 01:24:18,141 --> 01:24:18,441 What's 1371 01:24:18,441 --> 01:24:20,031 so groundbreaking about that? 1372 01:24:20,081 --> 01:24:24,441 as opposed to the original transformer, they only use a decoder. 1373 01:24:24,451 --> 01:24:29,281 So the original transformer had attention based encoders, which changed your 1374 01:24:29,281 --> 01:24:34,301 embeddings into essentially another embedding that was then taken by your 1375 01:24:34,311 --> 01:24:37,301 decoder and used to predict the next word. 1376 01:24:37,351 --> 01:24:44,411 So it had two networks linked together in the middle in order to produce 1377 01:24:44,856 --> 01:24:48,896 Your next word and the reason this is important is it goes back to that 1378 01:24:48,896 --> 01:24:52,686 original idea that we talked about a language as an abstraction, right? 1379 01:24:52,696 --> 01:24:58,286 The authors of attention is all you need looked at that abstraction and 1380 01:24:58,286 --> 01:24:59,766 we're like, Hey, can we model that? 1381 01:25:00,151 --> 01:25:01,541 And that's what an encoder is. 1382 01:25:01,541 --> 01:25:07,101 When you look at models like BERT, it's taking your input and putting it 1383 01:25:07,111 --> 01:25:12,070 into a new abstract space with lots of nonlinear trans transformations and 1384 01:25:12,070 --> 01:25:13,744 it's taking your input and putting it into a new abstract space with lots 1385 01:25:13,744 --> 01:25:16,371 of nonlinear trans transformations and it's taking your Incredibly useful. 1386 01:25:16,461 --> 01:25:21,851 And so the GPT models were groundbreaking, because they 1387 01:25:21,851 --> 01:25:22,981 were like, we don't need that. 1388 01:25:22,981 --> 01:25:28,121 we just need the decoder and we're just going to use syntax basically. 1389 01:25:28,141 --> 01:25:33,081 And the thought process there is that syntax is related to semantics deeper than 1390 01:25:33,141 --> 01:25:37,601 linguists are able to really conceptualize in an easy to understand way. 1391 01:25:38,291 --> 01:25:42,871 We know that it's true, And we know that it's predictive with especially 1392 01:25:42,901 --> 01:25:46,151 looking at how good GPT-3, GPT-4 are. 1393 01:25:46,341 --> 01:25:49,371 And even looking at the open source stuff, LLAMA is a decoder 1394 01:25:49,371 --> 01:25:51,951 only network and it rocks, right? 1395 01:25:52,541 --> 01:25:58,751 I have a suspicion that we're going to hit a point later where, Google 1396 01:25:58,751 --> 01:26:02,751 is going to blow everybody out of the water with another T5, like another, 1397 01:26:03,221 --> 01:26:05,801 version of that puts the encoder back in. 1398 01:26:06,171 --> 01:26:08,731 I don't know how we're going to get to that point, though, because the 1399 01:26:08,741 --> 01:26:10,571 decoder only models work so well. 1400 01:26:11,961 --> 01:26:14,721 And they're faster, they're less computationally expensive, because 1401 01:26:14,721 --> 01:26:18,241 you're taking, probably, a third of the model and just throwing it away. 1402 01:26:18,596 --> 01:26:22,196 So you mentioned Llama, and I think that might be a good 1403 01:26:22,246 --> 01:26:28,026 segway from what essentially is, about a third of your book. 1404 01:26:28,116 --> 01:26:31,636 so for everybody else who wants to go and jump into more details and 1405 01:26:31,636 --> 01:26:37,026 see actual Python implementations of a lot of what we just covered, 1406 01:26:37,656 --> 01:26:39,896 the book is called Production LLMs. 1407 01:26:39,966 --> 01:26:44,356 It's available on manning.com, and I'm pretty sure you're going to love it. 1408 01:26:45,326 --> 01:26:51,166 So going back to Llama, let's do a little hall of fame, rundown 1409 01:26:51,206 --> 01:26:55,476 of the kind of landmark important models from the last few years. 1410 01:26:55,506 --> 01:26:56,136 Where should we start? 1411 01:26:56,216 --> 01:26:59,496 I would probably start with the original transformer, like 1412 01:27:00,326 --> 01:27:01,286 they deserve credit. 1413 01:27:01,346 --> 01:27:05,096 A lot of the, Vaswani and all, a lot of the people who wrote that paper have 1414 01:27:05,096 --> 01:27:09,676 gone on to found or co found companies that are now competing in this space. 1415 01:27:10,026 --> 01:27:12,406 Whether that's Anthropic or Character. 1416 01:27:12,406 --> 01:27:15,506 ai, those are the people that created that Transformer and 1417 01:27:15,506 --> 01:27:16,586 they're still building on it. 1418 01:27:16,816 --> 01:27:19,336 I think that's the first one that I'd say for the Hall of Fame. 1419 01:27:19,386 --> 01:27:20,226 what would you say, Matt? 1420 01:27:20,226 --> 01:27:24,726 think part of this question is what is the first LLM versus what is, the first, Hall 1421 01:27:24,726 --> 01:27:29,956 of Fame model and yeah, like Transformers, Bert, like Bert, is incredibly powerful, 1422 01:27:29,956 --> 01:27:36,316 I think, because it's so small, it's not in the LLM space, it's often overlooked. 1423 01:27:36,356 --> 01:27:42,856 And I think many companies are still looking at these massive 1424 01:27:42,866 --> 01:27:46,916 LLM models for problems they could solve with a simple BERT model. 1425 01:27:46,946 --> 01:27:52,076 But because they're only getting into this space now, 1426 01:27:52,916 --> 01:27:53,066 they 1427 01:27:53,066 --> 01:27:55,556 think immediately, hey, we have to use an LLM, 1428 01:27:55,576 --> 01:27:55,856 right? 1429 01:27:55,896 --> 01:27:56,106 And 1430 01:27:56,409 --> 01:27:58,029 they didn't care in 2017. 1431 01:27:58,459 --> 01:27:58,679 And 1432 01:27:59,211 --> 01:27:59,351 And 1433 01:27:59,401 --> 01:27:59,851 over what 1434 01:27:59,851 --> 01:28:00,361 was there. 1435 01:28:00,371 --> 01:28:03,961 and I go back, I said it before, I love Markov chains, like they're 1436 01:28:04,541 --> 01:28:07,841 amazing and they're really powerful for what they do really well. 1437 01:28:07,891 --> 01:28:12,291 And even then, a lot of people could just use Markov chains for a lot 1438 01:28:12,291 --> 01:28:15,691 of the problems that they're trying to solve with LLMs, but, LLMs. 1439 01:28:16,006 --> 01:28:21,876 They do give that flexibility, just their massive levels of computation. 1440 01:28:22,406 --> 01:28:27,686 I think if I was to point, to a model that I thought was just really powerful, it. 1441 01:28:28,176 --> 01:28:29,926 It would be Bloom, actually. 1442 01:28:29,956 --> 01:28:37,506 Bloom was essentially the first, LLM massive, large model that was built. 1443 01:28:37,646 --> 01:28:40,036 And it was built, completely transparently. 1444 01:28:40,176 --> 01:28:42,486 it was a research, project. 1445 01:28:42,746 --> 01:28:46,466 funded, a large part by, the French government. 1446 01:28:46,476 --> 01:28:49,451 And just, it was built completely transparently and 1447 01:28:49,451 --> 01:28:51,171 completely in the open space. 1448 01:28:51,191 --> 01:28:57,321 and even though the bloom model today, isn't seen as, a very competitive 1449 01:28:57,321 --> 01:29:02,271 model, but like a lot of the open source learnings, a lot of what 1450 01:29:02,281 --> 01:29:08,856 we have nowadays is because of what those researchers figured out 1451 01:29:08,866 --> 01:29:10,416 while they were working in bloom. 1452 01:29:10,666 --> 01:29:14,836 we got amazing, libraries out of it from like deep speed 1453 01:29:14,836 --> 01:29:15,956 and other things like that. 1454 01:29:16,016 --> 01:29:20,516 it really boosted the open source community, which has been one of the 1455 01:29:20,526 --> 01:29:25,856 major driving factors of LLMs today, and probably a large part of why 1456 01:29:25,856 --> 01:29:29,746 we could even write our book, cause the open source community wasn't. 1457 01:29:30,221 --> 01:29:33,821 At where it is today, like there wouldn't be much we could really 1458 01:29:33,821 --> 01:29:38,591 tell people other than oh, You got to go work for Google or Microsoft or 1459 01:29:39,231 --> 01:29:41,011 how would We, know any of it, right? 1460 01:29:41,194 --> 01:29:41,634 Yeah. 1461 01:29:42,241 --> 01:29:43,801 we know, about it largely 1462 01:29:43,801 --> 01:29:47,991 because, we've been involved in the open source and we, built off of 1463 01:29:48,021 --> 01:29:49,641 what those scientists at Bloom did. 1464 01:29:50,111 --> 01:29:50,331 Big 1465 01:29:50,331 --> 01:29:50,811 science. 1466 01:29:51,924 --> 01:29:53,864 So that's 2022, right? 1467 01:29:53,934 --> 01:29:55,384 That's a couple of years now. 1468 01:29:56,654 --> 01:29:56,964 Yeah. 1469 01:29:57,584 --> 01:30:02,594 and then we had llama that became important, and llama2 1470 01:30:03,791 --> 01:30:04,181 Yeah, 1471 01:30:04,354 --> 01:30:05,344 even more important. 1472 01:30:07,061 --> 01:30:12,331 Yeah, and it's largely just because, I don't remember the username of who 1473 01:30:12,331 --> 01:30:17,561 did it, but whoever put that PR on the original llama GitHub that had the 1474 01:30:17,561 --> 01:30:22,071 torrent link to leak the weights, that's the hockey stick moment for LLMs, right? 1475 01:30:22,551 --> 01:30:25,351 That's what made them available to everybody. 1476 01:30:25,401 --> 01:30:28,891 That's what enabled Stanford to create alpaca and show that, oh man, 1477 01:30:28,901 --> 01:30:33,711 you can make the model better with like only 50 K responses like you 1478 01:30:33,711 --> 01:30:38,081 don't need tons and tons of data in order to fine tune and get very good 1479 01:30:38,081 --> 01:30:39,841 results and improve in every metric. 1480 01:30:40,581 --> 01:30:43,991 yeah, that everything since then has just been building off of that 1481 01:30:43,991 --> 01:30:49,011 exact same momentum of whoever leaked that first llama and Meta 1482 01:30:49,021 --> 01:30:50,811 has benefited greatly from it too. 1483 01:30:50,811 --> 01:30:58,361 they now have a very open, I wouldn't say completely, but a very open attitude 1484 01:30:58,481 --> 01:31:03,021 towards the space because they recognize how, advantageous it is to have other 1485 01:31:03,021 --> 01:31:06,831 people building on top of their model and be considered an industry standard. 1486 01:31:08,234 --> 01:31:11,584 Yeah they've really leaned into it recently, right? 1487 01:31:11,584 --> 01:31:12,254 And like 1488 01:31:12,326 --> 01:31:13,076 how big was their 1489 01:31:13,076 --> 01:31:13,846 stock jump? 1490 01:31:13,909 --> 01:31:14,394 right? 1491 01:31:14,444 --> 01:31:16,424 all of the underlying architecture, right? 1492 01:31:16,444 --> 01:31:23,529 Like these open source programmers or even just like the video programmers, like 1493 01:31:23,579 --> 01:31:26,429 they're able to go in and because they know everything about Lama, they're able 1494 01:31:26,429 --> 01:31:29,569 to optimize, cuda kernels and everything. 1495 01:31:29,569 --> 01:31:35,809 And so Lama has gotten faster and more proficient, Lama CPP, we're able to run 1496 01:31:35,809 --> 01:31:42,109 it with, just on a CPU, there's lots of benefits that because they, gave 1497 01:31:42,109 --> 01:31:45,809 us the architecture, it was leaked, but now, they've, leaned into it. 1498 01:31:45,819 --> 01:31:47,679 They essentially they've given it to us. 1499 01:31:47,679 --> 01:31:47,969 And so 1500 01:31:48,856 --> 01:31:51,406 Yeah, we just need them to release the data that they used to train 1501 01:31:51,406 --> 01:31:52,846 on it And it's completely open, 1502 01:31:53,016 --> 01:31:53,306 right? 1503 01:31:53,356 --> 01:31:56,966 but even the data, they've told us a lot about what the data is, right? 1504 01:31:58,716 --> 01:32:03,676 we don't have the exact data, but we know essentially red pajama, what those data 1505 01:32:03,676 --> 01:32:05,776 sites were built off of, what they were. 1506 01:32:05,776 --> 01:32:06,486 And so 1507 01:32:07,416 --> 01:32:08,416 we're able to. 1508 01:32:08,466 --> 01:32:11,066 replicate it really closely in the open source community. 1509 01:32:11,116 --> 01:32:14,986 Llama, I don't know, if we have a really good list of Hall of 1510 01:32:14,996 --> 01:32:16,026 Famers because 1511 01:32:16,476 --> 01:32:19,686 it's difficult to see what's going to stick around partially because 1512 01:32:19,686 --> 01:32:23,666 it's so difficult to evaluate these models as opposed to BERT right? 1513 01:32:23,666 --> 01:32:26,096 large BERT had 300 million parameters. 1514 01:32:26,766 --> 01:32:30,096 You can run stuff to see how well those parameters are, 1515 01:32:30,706 --> 01:32:31,896 like you can hyper tune them. 1516 01:32:31,906 --> 01:32:34,826 you can run evaluations to see how each one is performing 1517 01:32:34,966 --> 01:32:37,046 and still go relatively fast. 1518 01:32:38,036 --> 01:32:41,586 When we're getting into the 7 billion parameter range and the 13 1519 01:32:41,596 --> 01:32:45,166 billion parameter range and the 70 billion parameter range, it's much 1520 01:32:45,176 --> 01:32:48,586 more difficult and computationally expensive to evaluate on that level. 1521 01:32:49,426 --> 01:32:51,616 And we don't even have the ability to describe what all 1522 01:32:51,616 --> 01:32:52,746 the parameters are doing. 1523 01:32:52,796 --> 01:32:58,236 and so our evaluation metrics are difficult to gauge. 1524 01:32:58,746 --> 01:33:02,096 You look at MMLU, you look at a lot of the benchmarks that people 1525 01:33:02,096 --> 01:33:03,866 are running, and they're useful. 1526 01:33:04,386 --> 01:33:09,156 But ultimately at this stage, we still have to go download those models 1527 01:33:09,186 --> 01:33:12,356 and test them against our own use cases to see if they perform better. 1528 01:33:13,186 --> 01:33:15,306 And that's incredibly time consuming. 1529 01:33:15,356 --> 01:33:19,066 like we could talk about a lot of the models that have come out, like Capybara, 1530 01:33:19,106 --> 01:33:25,011 we can talk about New Zermes, we can talk about WizardCoder, and they're all great. 1531 01:33:25,571 --> 01:33:27,821 I don't know which ones are going to be the hall of fame. 1532 01:33:27,831 --> 01:33:29,541 The next industry standard though, 1533 01:33:29,721 --> 01:33:32,781 there's definitely some other models that we love and we talk about in our 1534 01:33:32,781 --> 01:33:34,971 book, like Falcon, which came out of 1535 01:33:35,811 --> 01:33:38,591 the TII and Abu Dabi, right? 1536 01:33:38,591 --> 01:33:40,601 Like amazing model. 1537 01:33:40,931 --> 01:33:41,211 It's, 1538 01:33:41,694 --> 01:33:41,954 Micu. 1539 01:33:43,131 --> 01:33:46,031 the latest Falcon is one of the largest open source models and it's 1540 01:33:46,051 --> 01:33:47,691 come, under the Apache 2 license. 1541 01:33:47,701 --> 01:33:49,431 So it's completely open source. 1542 01:33:49,491 --> 01:33:51,831 the very first model that's fully open source. 1543 01:33:52,101 --> 01:33:54,901 there's definitely amazing, progress being 1544 01:33:54,901 --> 01:33:57,761 made and lots of different models to be paying attention to. 1545 01:33:57,811 --> 01:33:58,981 But yeah, 1546 01:33:59,344 --> 01:34:00,484 One of the biggest ones to 1547 01:34:00,484 --> 01:34:01,234 pay attention to. 1548 01:34:01,234 --> 01:34:04,914 right now, I think is Olmo, not because it's competitive and 1549 01:34:04,914 --> 01:34:09,094 performant, but because like Falcon, it is 100% open source. 1550 01:34:09,104 --> 01:34:10,544 You can see the data they trained on. 1551 01:34:10,544 --> 01:34:12,724 You can replicate exactly their experiments. 1552 01:34:12,734 --> 01:34:16,114 that's going to be one of the biggest drivers in this field where, you look at 1553 01:34:16,164 --> 01:34:21,489 a lot of the, innovation that's happening and it's happening over on files that 1554 01:34:21,489 --> 01:34:23,019 people are passing around on torrents. 1555 01:34:23,019 --> 01:34:28,169 It's happening on like random users on Reddit are coming up with NTK aware 1556 01:34:28,169 --> 01:34:30,319 scaling and rope scaling after that. 1557 01:34:30,369 --> 01:34:33,059 And they're coming up with more stuff because. 1558 01:34:33,789 --> 01:34:37,869 They have time, and they want to help and a lot of these people are experts 1559 01:34:37,869 --> 01:34:43,209 and they're just anonymous and that's Incredibly important for the space because 1560 01:34:43,769 --> 01:34:49,909 we're finding that people who deal with these models and use them 24/7 Have skills 1561 01:34:49,939 --> 01:34:54,659 that the researchers don't necessarily have and that's difficult to admit being 1562 01:34:54,659 --> 01:34:56,589 on the research part of it But it's true. 1563 01:34:57,649 --> 01:35:02,909 so that's the one coming from Allen Institute for AI, right? 1564 01:35:02,969 --> 01:35:08,249 The one it has, yeah, I think they're also open source in the 1565 01:35:08,289 --> 01:35:09,729 actual training code as well. 1566 01:35:09,739 --> 01:35:10,079 the whole 1567 01:35:10,217 --> 01:35:11,197 they are the whole 1568 01:35:11,197 --> 01:35:11,527 thing. 1569 01:35:12,709 --> 01:35:13,649 That's pretty awesome. 1570 01:35:14,169 --> 01:35:18,879 So with that caveat out of the way, hedging your predictions, we don't 1571 01:35:18,879 --> 01:35:20,329 know what's going to happen tomorrow. 1572 01:35:20,909 --> 01:35:27,649 Do you see any one company kind of getting ahead of the others? 1573 01:35:27,649 --> 01:35:35,549 The GPT-4 is still holding up well against a lot of these models, which makes me 1574 01:35:35,559 --> 01:35:37,619 think personally that they have a few. 1575 01:35:38,179 --> 01:35:41,769 Tweaks and hacks they haven't shared, which helps with 1576 01:35:41,779 --> 01:35:43,399 their multi billion valuation. 1577 01:35:43,949 --> 01:35:48,129 Do you see anybody like running away from the crowds or is it too late now? 1578 01:35:48,129 --> 01:35:53,279 The cat's out of the bag and the progress is going to come from the mass of people. 1579 01:35:53,279 --> 01:35:56,894 I don't know I know that, I was texting with a couple of people the 1580 01:35:56,894 --> 01:36:02,174 other day talking about GPT-4 and, how it is still relevant, even, people 1581 01:36:02,174 --> 01:36:06,624 talk about the performance decrease, but it's still relevant, and every 1582 01:36:06,624 --> 01:36:10,964 week, every model is, that's coming out getting compared against GPT-4. 1583 01:36:10,984 --> 01:36:15,514 And they're finding that most models are more performant in GPT than 1584 01:36:15,514 --> 01:36:19,814 GPT-4 on certain things, right? 1585 01:36:19,824 --> 01:36:25,324 It's comparing the Rain Man to an average human where, and asking like 1586 01:36:25,324 --> 01:36:26,924 what tasks they're good at, right? 1587 01:36:26,954 --> 01:36:30,044 If you, if it's going to McDonald's and ordering your 1588 01:36:30,044 --> 01:36:32,784 own food, Rain Man is not great. 1589 01:36:33,129 --> 01:36:35,279 And you just got to find the model that's better. 1590 01:36:35,769 --> 01:36:38,129 a good example for that with GPT-4 is math. 1591 01:36:38,639 --> 01:36:41,149 if you need a model to perform calculations for you. 1592 01:36:41,724 --> 01:36:42,474 That's not it. 1593 01:36:43,324 --> 01:36:49,054 you have Alpha Wolf, you have, Goat, you have, even just Vanilla Llama 2 is 1594 01:36:49,054 --> 01:36:53,194 better at math than GPT-4, even though they weren't explicitly training on it. 1595 01:36:53,344 --> 01:37:00,014 And I think that they currently have that first-to-market 1596 01:37:00,274 --> 01:37:01,944 advantage more than anything. 1597 01:37:02,664 --> 01:37:03,904 That's not to say that it's bad. 1598 01:37:03,904 --> 01:37:08,324 That's not to reduce the work that OpenAI has done because it is phenomenal. 1599 01:37:08,624 --> 01:37:12,504 But that's what's keeping them really afloat is the first 1600 01:37:12,504 --> 01:37:14,274 market and the ease of use. 1601 01:37:16,807 --> 01:37:21,357 One other question I was holding, as you were speaking with, you 1602 01:37:21,357 --> 01:37:24,037 mentioned mixed role and, What is it 1603 01:37:24,077 --> 01:37:24,357 called? 1604 01:37:24,417 --> 01:37:26,287 Mixed of, mix of experts. 1605 01:37:26,527 --> 01:37:26,827 what's 1606 01:37:26,834 --> 01:37:27,744 Yeah, mixtral. 1607 01:37:27,744 --> 01:37:30,174 Yeah, it's routing. 1608 01:37:30,234 --> 01:37:34,114 it's being smart and saying, hey, we don't need a dense feed forward 1609 01:37:34,114 --> 01:37:35,794 network for every single thing. 1610 01:37:36,264 --> 01:37:40,779 Let's have a whole bunch of sparse networks and just based on the input 1611 01:37:41,209 --> 01:37:44,799 route it and tell it which expert is actually going to be the best. 1612 01:37:45,029 --> 01:37:50,669 It results in much larger models that are smaller on disc and faster to run. 1613 01:37:52,066 --> 01:37:56,516 Is that more similar to how the human brain works? 1614 01:37:57,166 --> 01:37:59,026 Because it's obviously not fully 1615 01:37:59,026 --> 01:37:59,686 connected. 1616 01:37:59,786 --> 01:38:02,236 It's got different regions and stuff like that. 1617 01:38:02,886 --> 01:38:04,496 I would love to appeal to that. 1618 01:38:04,496 --> 01:38:05,016 authority. 1619 01:38:05,026 --> 01:38:05,766 that didn't rock. 1620 01:38:05,816 --> 01:38:10,916 I don't know though, because like you look at MRIs and you can see, Oh man, 1621 01:38:10,916 --> 01:38:14,856 this portion is lighting up when you're experiencing that emotion or seeing that 1622 01:38:14,856 --> 01:38:15,236 input. 1623 01:38:15,236 --> 01:38:15,526 But 1624 01:38:16,554 --> 01:38:16,684 who 1625 01:38:16,766 --> 01:38:19,396 we don't really have a really great mapping of 1626 01:38:19,396 --> 01:38:20,426 every person's brain. 1627 01:38:20,476 --> 01:38:26,596 I think the connection between a neural net and actual neurons has 1628 01:38:26,956 --> 01:38:28,786 been lost a long time ago, right? 1629 01:38:29,196 --> 01:38:33,266 how does the human brain work and how does it really compare to modern day models? 1630 01:38:33,276 --> 01:38:37,636 Like it's hard to really make that argument, we're still 1631 01:38:37,636 --> 01:38:39,506 learning about how we learn. 1632 01:38:39,626 --> 01:38:46,276 And as we do, and as neuroscience filled advances, like ultimately leads to 1633 01:38:46,276 --> 01:38:49,866 advances in the AI space and vice versa. 1634 01:38:49,986 --> 01:38:51,656 there's definitely connections there. 1635 01:38:51,716 --> 01:38:56,766 but yeah, as far as your question goes, I think it's anybody's guess. 1636 01:38:56,766 --> 01:39:00,466 I think this is a perfect note to end. 1637 01:39:00,496 --> 01:39:02,116 A little bit of suspense. 1638 01:39:02,286 --> 01:39:06,176 we're going to have to get you back at some point when you've finished your 1639 01:39:06,226 --> 01:39:12,386 book and talk a little bit more about the actual technical problems and challenges. 1640 01:39:12,426 --> 01:39:17,506 We haven't really touched upon any of that yet, but today I certainly 1641 01:39:17,536 --> 01:39:22,656 learned a lot from you and I hope a lot of our listeners will as well. 1642 01:39:22,766 --> 01:39:25,306 It was an absolute pleasure to meet you both. 1643 01:39:26,196 --> 01:39:28,246 Thank you so much and see you next time.