Machine Learning Study Group
|
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu
Agenda & Minutes (144th meeting, Jan. 3, 2025)
Table of Contents
* Agenda and minutes
* Transcript (when available)
Agenda and minutes
- Announcements, updates, questions, presentations, etc.
- Recall the masters project that some students are doing and need our suggestions about:
- Suppose a generative AI like ChatGPT or Claude.ai was used to write a book or content-focused website about a simply stated task, like "how to scramble an egg," "how to plant and care for a persimmon tree," "how to check and change the oil in your car," or any other question like that. Interact with an AI to collaboratively write a book or an informationally near-equivalent website about it!
- BI: Maybe something like "Public health policy."
- LG: Thinking of changing to "How to plan for retirement."
- ET: Gardening (veggies, herbs in particular). Specifically, growing vegetables from seeds. May change topics.
- JK is focusing on prompt eng. with agents and may have comments at some point.
- If anyone else has a project they would like to help supervise, let me know!
- JK proposes complex prompts, etc. (https://drive.google.com/drive/u/0/folders/1uuG4P7puw8w2Cm_S5opis2t0_NF6gBCZ).
- NM project.
- The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates when available, every now and then but not every week.
- Anything else anyone would like to bring up?
- View more of the Chapter 5 video.
- Here are the latest on readings and viewings
- Next we will continue to work through chapter 5: https://www.youtube.com/watch?v=wjZofJX0v4M. We got up to 22:15. Next time we do this video, we will go on from there. (When sharing the screen, we need to click the option to optimize for sharing a video.)
- We can work through chapter 6: https://www.youtube.com/watch?v=eMlx5fFNoYc.
- We can work through chapter 7: https://www.youtube.com/watch?v=9-Jl0dxWQs8
- https://www.forbes.com/sites/robtoews/2024/12/22/10-ai-predictions-for-2025/
- https://arxiv.org/pdf/2001.08361
- Computer scientists win Nobel prize in physics! Https://www.nobelprize.org/uploads/2024/10/popular-physicsprize2024-2.pdf got a evaluation of 5.0 for a detailed reading.
- We can evaluate https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10718663 for reading & discussion.
- Chapter 6 recommends material by Andrej Karpathy, https://www.youtube.com/@AndrejKarpathy/videos for learning more.
- Chapter 6 recommends material by Chris Olah, https://www.youtube.com/results?search_query=chris+olah
- Chapter 6 recommended https://www.youtube.com/c/VCubingX for relevant material, in particular https://www.youtube.com/watch?v=1il-s4mgNdI
- Chapter 6 recommended Art of the Problem, in particular https://www.youtube.com/watch?v=OFS90-FX6pg
- LLMs and the singularity: https://philpapers.org/go.pl?id=ISHLLM&u=https%3A%2F%2Fphilpapers.org%2Farchive%2FISHLLM.pdf (summarized at: https://poe.com/s/WuYyhuciNwlFuSR0SVEt). 6/7/24: vote was 4 3/7. We read the abstract. We could start it any time. We could even spend some time on this and some time on something else in the same meeting.
Transcript:
ML discussion group
Fri, Jan 3, 2025
4:00 - D.B.
Well, why don't we get started? Welcome back, everyone. Even though most people aren't really back, back. Classes don't start for a couple of weeks. But anyway, this meeting is not even really a university activity, right? It just sort of happens to be a bunch of people mostly affiliated with the university doing it. So we're not linked to a school calendar. Anyway, so there's none of these master's students are here today yet, at least. But they're going to be writing books or equivalent websites on some topic using AIs this semester as their projects. And I thought it would be interesting for them to give us a one-minute update each week and we can talk about it as needed or not. I'm just kind of curious about what the experiences will be when someone tries to write a book using an AI. I know people do it. Anyone else has a project they'd like to help supervise like that? Anything AI? Related, just let me know. I'll put out the word and probably find a student for you. J.'s not here. V., do you have any idea if J.'s going to be coming back here, or is he just...
5:39 - V.W.
J. is an enigma that I greatly appreciate, but I do not fully understand, so I am unable to fathom the depths of that. He was coming to the gym, and that's where I came my office hours for walking around the track and talking about projects and technical endeavors. But I have not seen him lately. Of course, for three weeks, we've all been incognito at the gym. But I had my first workout yesterday. He wasn't there, although a number of the other people that are regulars were there. So, yeah, I'm curious. He's been up to really interesting things. And I really hope for the best. You know, I tried to specifically design a project that would suit him his writing talent, his science fiction interest. And I had hoped that we'd be able to rewrite your Back to the Future book. But he was feeling overcommitted at the time. And I didn't want to really take it on myself and deprive him of an opportunity. I thought what might be most suitable for his background. So that made me sorry. But J. had been through some really rough times lately. And so I wanted to make sure I was giving him the complete benefit of the doubt. And that's all I know.
6:55 - Unidentified Speaker
Okay.
6:56 - D.B.
Well, if he does come back at some point, I'd like to kind of, he had a website with a bunch of like very, very elaborate agent based prompts and I, and they were there and I thought people would be like to hear him and try.
7:11 - Unidentified Speaker
Yeah.
7:11 - Multiple Speakers
He's a rising star for sure. And I think he underestimates his own ability to contribute because now that that we can all program in English, the English majors are making a very serious comeback.
7:23 - V.W.
Yeah. All right. Anything else anyone would like to bring up? Because if not, then we can.
7:30 - D.B.
I mean, we started this chapter five, quote unquote, video a while, it was a while ago, we got up to minute 1550, and we actually did that twice. We we sort of took a big break. It didn't finish it. We looked at part of it, took a long break, came back, redid the first part. So this would be, like, if we were to start from the beginning, it would be the third time. I'm willing to do it, or we could just start from minute 1550 on this video. What do you all think? You want to start from the beginning and redo the
8:09 - V.W.
Maybe we could pick up, if you want, where we left off. Through 3Blue1Brown videos in other areas. He has the most fantastic topology video I've ever seen that completely motivates the discipline, defines it as unique and distinct from other branches of mathematics, like linear algebra, calculus, or category theory. And he's just an incredible illustrator and animator, and someone we could all be inspired by.
8:37 - D.B.
OK, well, this does not look It does it. Skip. Here we go. OK, but I need to unshare, change the sharing parameters to be video friendly, and then reshare again. So unshare, share, optimize for video. Video clip there. Okay, this will hopefully work. All right, and where did I say we were at? We went up to 1550. Right back here.
9:14 - V.W.
Look at the, when you hovered your mouse over the slider there, it shows, those are the number of areas that are replayed in this video. That is a spectacular out of engagement for a video of this type to have. And it shows you just how heavily it's being subscribed to. It's got 139,000 likes, and that's out of probably, what, one and a half million views. So incredible.
9:43 - D.B.
I wonder, you know, why is it that the peaks in this, do y'all see that graph below with the hills and valleys?
9:51 - V.W.
Yeah, the peaks are the times that are most often replayed.
9:55 - D.B.
So this is a plot of engagement versus time.
9:58 - V.W.
And if somebody didn't understand something, or they thought it was very fascinating, they'll go back. And that's the replay map of the whole population of a million some odd people playing this video.
10:08 - D.B.
So if you watch it, and then you go back 10 seconds to catch. Right. And secondly, it adds like two to the peak instead of one or something like that. Right.
10:17 - V.W.
And so it also tells you if you were pressed for time, or you wanted to cover the high points, you could just go over the peaks, and then back as necessary.
10:26 - D.B.
Of course, it's continuity is so good.
10:30 - V.W.
Every word is valuable.
10:33 - D.B.
Well, we're going to add one to everything after 1550. We will be that drop in the bucket. Yeah. All right. So anyway, here we go. That flows through it and go step by step.
10:53 - Unidentified Speaker
There are many different kinds of models that you can build using transformers. The point is, it looks like during training the model found it advantageous to choose embeddings such that one direction in this space encodes gender information. There's one guitar pattern that's not talked about on YouTube, and none of the pros will tell you about it, but they... Another example is that if you take the embedding of Italy and you subtract embedding of Germany, and then you add that to the embedding of Hitler, you get something very close to the embedding of Mussolini. Any comments on that? The male, female, or this one?
11:33 - V.W.
It's just the semantic directions have been a real great new way to think for me. That in these very high dimensional spaces, there are these multiplicity of directions that have to do with, in this case, dictator-ness or Axis powers, wartuneness, et cetera.
11:50 - D.B.
I find the concept that they use embeddings, just that foundation, helps me either understand or maybe mistakenly think about how good that is for doing things like asking it to find the word for X. You say, well, I need a word that expresses this kind of nuance, blah, blah, blah. You give it a paragraph, and your paragraph hopefully will embed it exactly the same point as that word. And if you ask it to find the word, it'll find the word.
12:27 - V.W.
It absolutely excels it that I was having to name a weird part on this aircraft I'm building to my right, and I had to really rack my brain. And then I went to it and it gave me a very appropriate, generally understood industrial term for the specific part that I was working with. And so that's very good when you're communicating with other people about what it is you're up to. And so I also think that this directionality, sorry, E., is that we're getting the emergent properties from the fact that we all have all these semantic directions. We're in a semantic vector space where meanings of different things point in all kinds of different directions. So that causes the emergent properties as a side effect, which we find the most enjoyable about LLM.
13:14 - E.G.
Here we're just looking at binary interpretations. When we start looking at things like prints, it'll pull out like the monarchy in England. But then you say musical artist, it'll pull out something else. But it is nice how it tends to adapt to that. So that way it almost has an annealing process to say, OK, as we get more information, the annealing process will hone in on exactly what you're looking for.
13:48 - V.W.
Yeah, like, show me the monarchs of artificial intelligence. Show me the sultans of swing. You know, these kind of directions that we immediately assent to as people is new for us to have as a reasoning tool, you know, in an automated sense. Okay. Anyone else?
14:11 - D.B.
It's as if the model learned to associate some directions with Italian-ness, and others with World War II Axis leaders.
14:18 - Unidentified Speaker
Maybe my favorite example in this vein is how in some models, if you take the difference between Germany and Japan, and you add it to sushi, you end up very close to Broadwurst.
14:29 - V.W.
Also, in playing this game of finding nearest neighbors, I was very pleased to see how close Cat was to both Beast and Monster. One bit of mathematical intuition that's helpful to have in mind, especially for the next chapter, is how the dot product of two vectors can be thought of as a way to measure how well they align.
14:50 - Unidentified Speaker
Computationally, dot products involve multiplying all the corresponding components and then adding the results, which is good since so much of our computation has to look like weighted sums. Geometrically, the dot product is positive when vectors point in similar directions, it's zero if they're perpendicular, and it's negative whenever they point opposite directions. For example, let's say you were playing with this model and you hypothesize that the embedding of cats minus cat might represent a sort of plurality direction in this space. To test this, I'm going to take this vector and compute its dot product against the embeddings of certain singular nouns and compare it to the dot products with the corresponding plural nouns. If you play around with this, you'll notice that the plural ones do indeed seem to consistently give higher values than the singular ones.
15:39 - D.B.
Okay, I don't get that. Why would the plural ones give higher values? Or does anyone else have any thoughts on this? Indicating that they align It's also fun how if you take this dot product with the embeddings of the words 1, 2, 3, and so on, they give increasing values, so it's as if we can quantitatively measure how plural the
16:17 - Unidentified Speaker
model finds a given word.
16:18 - V.W.
Again, the specifics for how words get embedded is learned using data.
16:22 - Unidentified Speaker
This embedding matrix, whose columns tell us what happens to each word, is the first pile of weights in our model. And using the GPT-3 numbers, the vocabulary size specifically is 50,257, And again, technically this consists not of words per se, but of tokens. And the embedding dimension is 12,288. Multiplying those tells us this consists of about 617 million weights. Let's go ahead and add this to a running tally, remembering that by the end we should count up to 175 billion.
16:52 - D.B.
Any thoughts or comments?
16:59 - D.B.
So in those tables, the 12,288 is what, rows? And the 50,000 is columns? Anyway, it makes a matrix. And for embedding, it seems like 12,000 is a little low if you're trying to be able to have a separate value for every meaning that could exist.
17:34 - Unidentified Speaker
In the case of transformers, you really want to think of the vectors in this embedding space as not merely representing individual words. For one thing, they also encode information about the position of that word, which we'll talk about later. But more importantly, you should think of them as having the capacity to soak in context. A vector that started its life as the embedding of the word king, for example, might progressively get tugged and pulled by various blocks in this network so that by the end it points in a much more specific and nuanced direction that somehow encodes that it was a king who lived in Scotland and who had achieved his post after murdering the previous king.
18:19 - D.B.
So that suggests that all those king locations are on the same King... Monarchy axis or something. You know, an axis, like if it's got 12,000, I mean, is it 12,000 axes then?
18:38 - E.G.
Yeah, but I think king in and of itself would have one axis. And as you start adding another Symantic layer as you saw the different layers. King would be one they geography in Scotland would send it in a different direction.
19:08 - Multiple Speakers
OK, so if you if you go back, you'll actually see those layers.
19:17 - E.G.
Basically. Augment. Yeah.
19:19 - V.W.
The more tokens you have, the more you have this notion of this notionality space. And the more words you have, the more specific you're being about the notion you're trying to represent. And therefore, it's likely as you become more specific, there will be fewer other things connected to that. So you will eventually converge as your phrase increases in the number of tokens to more and more specific things. But then that's by the fact that some very wordy types of phrases can refer themselves to lots of things. Like all the molecules that wiggle is still a pretty big set.
19:57 - D.B.
Well, so all these different kinds of kings, you know, the one that lived in Scotland, the one that had a murder predecessor, the one that's, you know, lived in France, the one that lived in England, right? They're all going to be pretty high. They're all probably have the same value almost on the king axis. But they'll have way different values on the Scotland axis versus the France axis and so on.
20:23 - Unidentified Speaker
Right.
20:24 - V.W.
There's the king notion that lives in the dictionary, like the Oxford dictionary, and there's the king notion that lives in the Wikipedia for H.VIII. And the H.VIII notion of a king has associated with it these geographic and time and rise to power associations, while the king that lives in the dictionary has only the limited scope of title and role.
20:51 - M.M.
Remember for all of this, you have to have all of this content in the training. So if you don't have it in the training, you cannot have it in this vector space. Right.
21:05 - V.W.
And those vectors won't tug on each other because they won't be there.
21:10 - M.M.
Exactly. So everything is related. With the training of whatever content you have it, and embedding is just part. But I actually sent you the small videos of simple embedding of three-word sentence, five-word sentence, because this is complicated. This is very huge multidimensional space.
21:31 - D.B.
Yeah, this one doesn't show that all these different kings here, there's four of them, are actually in the same place. On the king axis.
21:42 - A.B.
Yeah.
21:42 - V.W.
In fact, this is pejorative because he's showing this in a three-dimensional space when it's really like a 27,000 dimensional space.
21:50 - M.M.
Exactly.
21:50 - V.W.
And it's helpful for us to think of these vector displacements in a geometric world that we live in, but the actual thing that's going on is much richer and more complex.
22:02 - M.M.
Yeah. If it's trained with all of this content. All right.
22:06 - D.B.
Well, V., what about, or anybody, What about a number of years ago, there was WordNet, which was an attempt by a guy named M., a very famous cognitive psychologist. He called it WordNet, which was basically a neural network or a semantic network of words.
22:28 - V.W.
That was like the precursor to BERT, wasn't it?
22:33 - M.M.
Exactly. Simple neural networks can do this job. I show you the videos with this simple multi-layer perceptron is doing this job.
22:44 - D.B.
Yes, you're correct. Could they incorporate WordNet into this system? Another way to state that would be is that WordNet is an ancestor of this system.
22:57 - V.W.
So it wouldn't be a matter of incorporating it, it would be a matter that this thing we're looking at evolved from WordNet and BERT and the early large language model attempts and culminated with the transformer that was originally designed simply to be a good translator between words and phrases and meanings. And then these emergent meanings popped out and lo and behold, the AI winter had thawed.
23:26 - D.B.
Well, couldn't it be that, I mean, isn't it sort of reinventing the wheel to sort of throw away which was elaborately generated partly by hand, I would guess, in favor of just sort of brute force processing of text from the net.
23:44 - Multiple Speakers
Who says they threw it away? Maybe there's still pieces of the WordNet and BERT code in what we see now.
23:53 - M.M.
It is inside, but you add the positioning coding, you add attention mechanism, you add some things. But the main idea, like you mentioned, is war net. It is. It is. All right.
24:07 - D.B.
So, here's another related question. So, there's this effort, PSYCH, C-Y-C, which was D.L.'s lifetime project.
24:14 - V.W.
He died a few years ago, it turns out.
24:18 - D.B.
But he was trying to get a- Codify all human knowledge.
24:23 - Multiple Speakers
Well, he was trying to come up with a database of common sense, physical, maybe not just physical, including common sense, physical knowledge. And the DOD was all over him to, you know, let us use this for our intelligence purposes.
24:39 - Unidentified Speaker
Yeah.
24:40 - D.B.
So why not? Couldn't, since it was so elaborately developed by hand, you know, they were, people had, were curating this, this database of common sense facts, you know, like, like a, a smaller cup can fit in a bigger cup. Okay. That's a common sense, physical knowledge.
24:57 - V.W.
T.W., world type things.
24:59 - D.B.
But on steroids, because T.W. was a micro world guy as everyone was in those days. And he was trying to scale up, but he didn't realize that you had to scale up like this. So he did it by hand. His people did it by hand. But couldn't you take his database and kind of combine it with something like this to get a LLM with common sense physical knowledge? That'd be better. I would prefer to look at it as that.
25:26 - V.W.
one LLM eats another, like a big fish eats a smaller fish. And the little fish is inside the big fish. And we go from very specialized fish to very big, multi-capable fish.
25:40 - D.B.
Well, how do you get an LLM or ChatGPT to be the big fish that eats Syke?
25:46 - V.W.
So you could take Syke and you could run it side-by-side with ChatGPT 4.0 or Pro or whatever, and you could see how they perform and are there things that the smaller, earlier model does worse or better or the same as the newer one? That's a pretty interesting thought.
26:04 - D.B.
So you could take a query and somehow expand it into a query for ChatGPT and another related, you know, a parallel query for...
26:12 - V.W.
Right. And you could race them and you could say...
26:15 - Multiple Speakers
I got another one for you. When you take a look at how we learn things and we build been built before.
26:24 - E.G.
We synthesize what is useful for us. We incorporate that and we throw away what isn't.
26:35 - D.B.
Aren't you explaining what these models are doing?
26:40 - V.W.
And there was another one for antibiotics that was really good at understanding antibiotic resistance the context. It was these domain-specific expert systems, termed at the time, that did a very strict subset of the kind of thing we're doing now, that we would just openly engage in a chat about.
27:06 - D.B.
All right. Anyone else have a comment? All right, we'll continue. And who's being described in Shakespearean language?
27:15 - Unidentified Speaker
Think about your own understanding of a given word, the meaning of that word is clearly informed by the surroundings. And sometimes this includes context from a long distance away. So in putting together a model that has the ability to predict what word comes next, the goal is to somehow empower it to incorporate context efficiently. To be clear, in that very first step, when you create the array of vectors based on the input text, each one of those is simply plucked out of the embedding matrix. So initially, each one can only encode the meaning of a single word without any input from its surroundings.
27:48 - D.B.
Each of those vertical vectors is a list of how far each word is on each of the dimensions of the space, right?
27:56 - V.W.
So imagine you were trying to send a message to the past so that the winter of AI would have never occurred. And you could only send it in four words, this message to tell M.M. to rewrite the Perceptron's book, to not be so mean about XOR gates.
28:14 - D.B.
And so what you would say is predict the next word.
28:19 - V.W.
It's like Fourier told us to look in the frequency domain instead of the time domain. And modern times have said, all you need to do is try to predict the next word and you will naturally arrive at this kind of thing. Because this predicting the next word is a total subset, an insignificant subset of the actual leverage we've obtained by trying to do it because it turns if you out you try to predict the next word you end up in
28:51 - D.B.
enabling thinking and emergent properties to occur and who would have thunk anybody else right but you should think of the primary goal of this network that it flows through as being to enable each one of those vectors to soak
29:10 - Unidentified Speaker
up a meaning that's much more rich and specific than what mere individual words could represent. The network can only process a fixed number of vectors at a time, known as its context size. For GPT-3, it was trained with a context size of 2048, so the data flowing through the network always looks like this array of 2048 columns, each of which has 12,000 dimensions. This context size limits how much text the transformer can incorporate when it's making a prediction of the next word. So the context is the number of tokens? Under consideration? Yes.
29:43 - V.W.
And that's why the Anthropic products with their 100,000 token context links are incredibly more useful. And even the Facebook LLM, which is very useful in some respects, has relatively small context length, which limit its applicability when you're trying to understand a specific idea in the context of a non-training piece of information like a large PDF or an academic paper or something like that.
30:12 - D.B.
Well, I wonder how it handles... I mean, okay, so it has a context of 2000 tokens or maybe more if it's quad.ai or whatever it is, but what about the syntactic relationship between those tokens? That makes a difference in terms of...
30:30 - Multiple Speakers
It makes a huge difference in the utility and the scope of the answer and whether the scope And whether the answer you get is vividly specific to what you're doing.
30:43 - E.G.
Now, a token really isn't a whole word. It's a part of a word or it's a token is a subset of something that it uses to understand.
30:56 - V.W.
It's a semantic fragment like apostrophe s or pound, bang, ampersand, up arrow for profanity. You know, It's sort of the whole world in terms of four-letter words.
31:10 - Multiple Speakers
Reach would be reach plus past tense.
31:14 - D.B.
Underneath would be under plus neat, probably.
31:18 - E.G.
When you look at something like LLAMA 3, the $405 billion one, it has a token length of 128K. That's why I think LLAMA 3 has really changed the whole landscape.
31:34 - V.W.
But the input token link that that llama accepts from the user is relatively small if you compare it with the other LLMs if I'm not mistaken. This is one thing that surprised me because I tallied not only the size of their training set but that the size of the that context link that the user could provide and I found it to be a better metric for the utility of the LLM than necessarily the size of the training set.
32:03 - E.G.
Are you talking about LLAMA 3? Because it has one of the largest token sizes.
32:09 - V.W.
I will go dig. You know, these things change week to week, too.
32:15 - A.B.
Yeah, right now it changes. So I have a question. So on that visual with the matrix at the attention layer, I don't know if you... Is it possible to go back a few seconds to freeze on that? Because I was wondering. Yeah, I think it was right there. Yeah. So where each token is vertically represented, then as I go down, each of those numbers is reflective of a token that's within one of the adjacent words in the context size. Is that right?
32:55 - E.G.
The adjacent word, I think, is the other layers leading into it. So this would be a token either in each column, because these arrays are literally that. They're multidimensional arrays, and how they change based on the intonation, around the word, not the word itself, but around the word. Because if you just had a context where all words were just in one plane, this array would be too huge for it to manage. So that's why it actually has multiple slices it goes through and does its stuff all in little chunks.
33:50 - A.B.
to feed it through. Right, right.
33:52 - Multiple Speakers
So like, I can't remember what the first word was on the like, where the 5.5 was, but right, so that's a that's a word in the input, right?
34:00 - A.B.
And then this is basically saying down the list, like, you know, whatever the not, I don't know what the, you know, the 9.2, or the 9.8, right, saying, hey, that word based on the context, that's where you need to pay attention. Like, this is the attention layer part that we're, we're looking at now. Right.
34:16 - D.B.
So I thought I thought that the 5.5 word or token, you know, was separated by the nine from the 9.8 tokens, two tokens. Correct.
34:27 - A.B.
And then but the 9.8 is another token in this context, but but based on the historical training, right, that's where like that higher number is gonna is what it'll pay attention to, right? Because it's But am I thinking about this the right way?
34:48 - E.G.
Are you thinking that these numbers are static? Because these numbers aren't static. These are derived based on other pieces of information.
34:59 - A.B.
Right, right. But OK. So I guess that's what I was trying to get to. It's like, OK, so in the first array, uh, token, what, what are these numbers being used for to, I guess, to, to predict on, to like the very, the very lat, like that, like what, to predict the next word in the sequence in the very last, right? So I thought that these were associating kind of like, Hey, this is, this is the token. And then this is based on the problem, the probabilities here are not probabilities based on the numbers below. This is what you should, what it's going to pay attention to, but I'm probably misunderstanding.
35:39 - E.G.
So if you take a look at King and you just searched on that word King without any other information, it's going to easily be able to identify because that's already been redrained. So what they're going to do then is we're going to add another term to King and we're going to say of rock. Versus of England. And it's going to change. Based on how it goes through, uh, trying to, you want to go from the H.VIII direction to the Elvis direction.
36:14 - Multiple Speakers
Go your example.
36:15 - Unidentified Speaker
Right.
36:16 - E.G.
And that's how it does it.
36:18 - Unidentified Speaker
Artist.
36:19 - Multiple Speakers
I put that graph in the chat and I noticed that it's based on LLM two, which is considerably out of date. So we'll need to update that with more recent figures.
36:33 - D.B.
This looks like this array of 2048 columns, each of which has 12,000 dimensions. This context size limits how much text the transformer can incorporate when it's making a prediction of the next word.
36:48 - Unidentified Speaker
This is why long conversations with certain chatbots, like the early versions of ChatGPT, often gave the feeling of the bot kind of losing the thread of conversation as you continued too long.
37:02 - V.W.
We'll go into that has remained time, but skipping. Huh? Uh, these LLMs forgetting what they were doing, like a doddering old man, like me have, have remained. You've just gotten a little more, uh, but you can quickly run to the end of even the most recent incarnation of ChatGPT one Oh pro it'll eventually lose track of what it was doing and you'll have to begin reeducating it. And in the process, you're stepping on the context that was also important. And so there's some maximum complexity that you can get it to do, even if you're paying 200 bucks a month. And also, by the way, waiting about up to five minutes per query to get it to bloat out the next logjam. And so if the $200 a month is now coming to serious question for me, because if I can only query this thing every five minutes instead of every five seconds, then I'm rate limited no matter how much resources they give me of size of LLM. If they only grunted out a response every five minutes, I'm dead in the water and may as well be using a smaller LLM that runs faster. Example, Grok runs almost instantly compared to any of the GPT series LLMs. When Girish being the new Twitter bot that E.M. set up for Twitter, and that I was using because it had the ability to look at sentiment analysis over the whole Twitter posting space. But that's I digress. Dr.
38:34 - E.G.
Brilliant, do you remember back in the day where we had one K we had to work in and we had to swap out stuff so that we could do our work in it? We'd have to do it in chunks and piece it together. And how we went to four K, it was, wow, we can do all of this together. This is a manifestation of that. It does not have that type of bucket where they can hold enough information to have that history.
39:05 - D.B.
So like while we used to have, you know, 10K floppy disks and anything bigger than that would have to be broken up.
39:15 - Unidentified Speaker
Exactly.
39:16 - E.G.
So you could draw an algae, couldn't you?
39:19 - D.B.
Yeah.
39:20 - V.W.
And one of the criticisms of LSTMs long short term memories and RNNs, recurrent neural networks that are used for time series data, is that they are highly prioritizing recent contextual information, but allowing far away information to drop off in importance. And it turns out that's a really big deal of not making the thing very useful. So the LLMs presumably opened up the gates by allowing us to have effectively infinite context links only they turn not to be they're not effectively infinite, they're kind of very predictably short. And thus, you they limit the utility of the LLM. And you can if you're interested in predicting the future on this, you can gate a Moore's law for LLMs. That's the rate is how many times you can say something to it before it forgets what you were talking about, and you have to start over. That's the new metric for Moore's law for how much you can get how much how many transistors you can fit, how many tokens you can fit in a conversation.
40:25 - Unidentified Speaker
All right.
40:26 - D.B.
Well, anything else, anyone? I was just going to comment on what V. said. Oh yeah, go for it. I agree with what he said. That's exactly what we're dealing with now.
40:42 - D.D.
Now that we've kind of made this breakthrough, we're looking at Moore's Law on the LLM. And that's kind of scary, because it's fast.
40:56 - D.B.
Well, is there a limit to the context that would be useful?
41:03 - V.W.
Like all human knowledge about historical documents ever written and burned in the libraries of Alexandria.
41:13 - D.D.
I agree. Yes. The sum of all human data. That is the limit.
41:19 - V.W.
And right now we're just trained with recent data that we believe to be factually correct. But historically, there's a lot of data that could be trained to provide us cultural information that we've lost access to.
41:31 - Multiple Speakers
I mean, it's almost like we've got a senile machine where it could just remember what it was just told a few moments ago, but you come back and it's like, who are you?
41:43 - V.W.
It's like a senile savant. It's like it's smart for a few conversations and then you've got to buy him another cup of coffee at Starbucks and start over.
41:55 - D.B.
All right. Remember, the desired output is a probability distribution over all tokens that might come next. For example, if the very last word is professor and the context includes words like Harry Potter and immediately preceding we see least favorite teacher, and also if you give me some leeway by letting me pretend that tokens simply look
42:18 - Unidentified Speaker
like full words, then a well-trained network that had built up knowledge of Harry Potter would presumably assign a high number to the word Snape. This involves two different steps. The first one is to use another matrix that maps the very last vector in that context to a list of 50,000 values, one for each token in the vocabulary. Then there's a function that normalizes this in into a probability distribution. It's called Softmax, and we'll talk more about it in just a second. But before that, it might seem a little bit weird to only use this last embedding to make a prediction, when after all, in that last step, there are thousands of other vectors in the layer just sitting there with their own context-rich meanings. This has to do with the fact that in the training process, it turns out to be much more efficient if you use each one of those vectors in the final layer to simultaneously make a prediction for what would come immediately after it. There's a lot more to be said about training later on, but I just want to call that out right now. This matrix is called the unembedding matrix, and we give it the label WU. Again, like all the weight matrices we see, its entries begin at random, but they are learned during the training process.
43:28 - V.W.
Okay, the unembedding matrix. So now imagine you put a phrase in, and there were two words of equal probability but considerably different meaning. So your answer could have this small perturbation initially that after several cycles could lead you into very different places in the knowledge space for the response. And so this brittleness with respect to small changes in the dot products seems to be pretty important. So we not only have the, you know, got viruses, and then LLMs got hallucinations. And now LLMs are forgetting stuff. And so we're slowly reproducing the human condition. And then we have this, the longer conversation it is, the less certain you are what other conversations and parallel universes could have been had if you'd only varied the inflection or tone of your voice slightly.
44:32 - D.B.
All right.
44:33 - E.G.
I agree with what V. said, except for slowly.
44:37 - D.B.
It's in the eye of the beholder, I guess.
44:40 - V.W.
All right.
44:41 - D.B.
Score on our total parameter count, this unembedding matrix has one row for each word in the vocabulary. And each row has the same number of elements as the embedding dimension.
44:54 - Unidentified Speaker
It's very similar to the embedding matrix, just with the order swapped. So it adds another 617 million parameters to the network, meaning our count so far is a little over a billion. A small, but not wholly insignificant fraction of the 175 billion that we'll end up with in total. As the very last mini-lesson for this chapter, I want to talk more about the softmax... Okay, so I think we're at a... We're just sort of at the chapter...
45:26 - D.B.
So we'll start here, and that comes on the right here.
45:40 - Unidentified Speaker
We're about 22:20. All right. Let's see. Escape.
46:21 - V.W.
We got up to 22:15 again, and then we forgot what we were doing.
46:29 - Multiple Speakers
So, we are our own worst enemy.
46:33 - D.B.
Well, I think one more session. It's only another five minutes or so. We'll get through it next time. All right.
46:44 - Unidentified Speaker
Focus on the chat. Chapter 5 video. All right, anything else anyone wants to mention before we call it a day?
47:07 - D.B.
All right, well, I'm trying to see.
47:16 - R.S.
You said you were going to have some of the grad students give some short presentations, D.?
47:23 - D.B.
Well, they're not here.
47:25 - R.S.
Yeah, I understand when the semester starts. Yeah. Yep. All right. Well, thanks, everyone.
47:31 - D.B.
I'll see you all next time. All right. Thank you.
47:36 - R.S.
Bye now.
47:37 - V.W.
See you guys.
Fri, Jan 3, 2025
4:00 - D.B.
Well, why don't we get started? Welcome back, everyone. Even though most people aren't really back, back. Classes don't start for a couple of weeks. But anyway, this meeting is not even really a university activity, right? It just sort of happens to be a bunch of people mostly affiliated with the university doing it. So we're not linked to a school calendar. Anyway, so there's none of these master's students are here today yet, at least. But they're going to be writing books or equivalent websites on some topic using AIs this semester as their projects. And I thought it would be interesting for them to give us a one-minute update each week and we can talk about it as needed or not. I'm just kind of curious about what the experiences will be when someone tries to write a book using an AI. I know people do it. Anyone else has a project they'd like to help supervise like that? Anything AI? Related, just let me know. I'll put out the word and probably find a student for you. J.'s not here. V., do you have any idea if J.'s going to be coming back here, or is he just...
5:39 - V.W.
J. is an enigma that I greatly appreciate, but I do not fully understand, so I am unable to fathom the depths of that. He was coming to the gym, and that's where I came my office hours for walking around the track and talking about projects and technical endeavors. But I have not seen him lately. Of course, for three weeks, we've all been incognito at the gym. But I had my first workout yesterday. He wasn't there, although a number of the other people that are regulars were there. So, yeah, I'm curious. He's been up to really interesting things. And I really hope for the best. You know, I tried to specifically design a project that would suit him his writing talent, his science fiction interest. And I had hoped that we'd be able to rewrite your Back to the Future book. But he was feeling overcommitted at the time. And I didn't want to really take it on myself and deprive him of an opportunity. I thought what might be most suitable for his background. So that made me sorry. But J. had been through some really rough times lately. And so I wanted to make sure I was giving him the complete benefit of the doubt. And that's all I know.
6:55 - Unidentified Speaker
Okay.
6:56 - D.B.
Well, if he does come back at some point, I'd like to kind of, he had a website with a bunch of like very, very elaborate agent based prompts and I, and they were there and I thought people would be like to hear him and try.
7:11 - Unidentified Speaker
Yeah.
7:11 - Multiple Speakers
He's a rising star for sure. And I think he underestimates his own ability to contribute because now that that we can all program in English, the English majors are making a very serious comeback.
7:23 - V.W.
Yeah. All right. Anything else anyone would like to bring up? Because if not, then we can.
7:30 - D.B.
I mean, we started this chapter five, quote unquote, video a while, it was a while ago, we got up to minute 1550, and we actually did that twice. We we sort of took a big break. It didn't finish it. We looked at part of it, took a long break, came back, redid the first part. So this would be, like, if we were to start from the beginning, it would be the third time. I'm willing to do it, or we could just start from minute 1550 on this video. What do you all think? You want to start from the beginning and redo the
8:09 - V.W.
Maybe we could pick up, if you want, where we left off. Through 3Blue1Brown videos in other areas. He has the most fantastic topology video I've ever seen that completely motivates the discipline, defines it as unique and distinct from other branches of mathematics, like linear algebra, calculus, or category theory. And he's just an incredible illustrator and animator, and someone we could all be inspired by.
8:37 - D.B.
OK, well, this does not look It does it. Skip. Here we go. OK, but I need to unshare, change the sharing parameters to be video friendly, and then reshare again. So unshare, share, optimize for video. Video clip there. Okay, this will hopefully work. All right, and where did I say we were at? We went up to 1550. Right back here.
9:14 - V.W.
Look at the, when you hovered your mouse over the slider there, it shows, those are the number of areas that are replayed in this video. That is a spectacular out of engagement for a video of this type to have. And it shows you just how heavily it's being subscribed to. It's got 139,000 likes, and that's out of probably, what, one and a half million views. So incredible.
9:43 - D.B.
I wonder, you know, why is it that the peaks in this, do y'all see that graph below with the hills and valleys?
9:51 - V.W.
Yeah, the peaks are the times that are most often replayed.
9:55 - D.B.
So this is a plot of engagement versus time.
9:58 - V.W.
And if somebody didn't understand something, or they thought it was very fascinating, they'll go back. And that's the replay map of the whole population of a million some odd people playing this video.
10:08 - D.B.
So if you watch it, and then you go back 10 seconds to catch. Right. And secondly, it adds like two to the peak instead of one or something like that. Right.
10:17 - V.W.
And so it also tells you if you were pressed for time, or you wanted to cover the high points, you could just go over the peaks, and then back as necessary.
10:26 - D.B.
Of course, it's continuity is so good.
10:30 - V.W.
Every word is valuable.
10:33 - D.B.
Well, we're going to add one to everything after 1550. We will be that drop in the bucket. Yeah. All right. So anyway, here we go. That flows through it and go step by step.
10:53 - Unidentified Speaker
There are many different kinds of models that you can build using transformers. The point is, it looks like during training the model found it advantageous to choose embeddings such that one direction in this space encodes gender information. There's one guitar pattern that's not talked about on YouTube, and none of the pros will tell you about it, but they... Another example is that if you take the embedding of Italy and you subtract embedding of Germany, and then you add that to the embedding of Hitler, you get something very close to the embedding of Mussolini. Any comments on that? The male, female, or this one?
11:33 - V.W.
It's just the semantic directions have been a real great new way to think for me. That in these very high dimensional spaces, there are these multiplicity of directions that have to do with, in this case, dictator-ness or Axis powers, wartuneness, et cetera.
11:50 - D.B.
I find the concept that they use embeddings, just that foundation, helps me either understand or maybe mistakenly think about how good that is for doing things like asking it to find the word for X. You say, well, I need a word that expresses this kind of nuance, blah, blah, blah. You give it a paragraph, and your paragraph hopefully will embed it exactly the same point as that word. And if you ask it to find the word, it'll find the word.
12:27 - V.W.
It absolutely excels it that I was having to name a weird part on this aircraft I'm building to my right, and I had to really rack my brain. And then I went to it and it gave me a very appropriate, generally understood industrial term for the specific part that I was working with. And so that's very good when you're communicating with other people about what it is you're up to. And so I also think that this directionality, sorry, E., is that we're getting the emergent properties from the fact that we all have all these semantic directions. We're in a semantic vector space where meanings of different things point in all kinds of different directions. So that causes the emergent properties as a side effect, which we find the most enjoyable about LLM.
13:14 - E.G.
Here we're just looking at binary interpretations. When we start looking at things like prints, it'll pull out like the monarchy in England. But then you say musical artist, it'll pull out something else. But it is nice how it tends to adapt to that. So that way it almost has an annealing process to say, OK, as we get more information, the annealing process will hone in on exactly what you're looking for.
13:48 - V.W.
Yeah, like, show me the monarchs of artificial intelligence. Show me the sultans of swing. You know, these kind of directions that we immediately assent to as people is new for us to have as a reasoning tool, you know, in an automated sense. Okay. Anyone else?
14:11 - D.B.
It's as if the model learned to associate some directions with Italian-ness, and others with World War II Axis leaders.
14:18 - Unidentified Speaker
Maybe my favorite example in this vein is how in some models, if you take the difference between Germany and Japan, and you add it to sushi, you end up very close to Broadwurst.
14:29 - V.W.
Also, in playing this game of finding nearest neighbors, I was very pleased to see how close Cat was to both Beast and Monster. One bit of mathematical intuition that's helpful to have in mind, especially for the next chapter, is how the dot product of two vectors can be thought of as a way to measure how well they align.
14:50 - Unidentified Speaker
Computationally, dot products involve multiplying all the corresponding components and then adding the results, which is good since so much of our computation has to look like weighted sums. Geometrically, the dot product is positive when vectors point in similar directions, it's zero if they're perpendicular, and it's negative whenever they point opposite directions. For example, let's say you were playing with this model and you hypothesize that the embedding of cats minus cat might represent a sort of plurality direction in this space. To test this, I'm going to take this vector and compute its dot product against the embeddings of certain singular nouns and compare it to the dot products with the corresponding plural nouns. If you play around with this, you'll notice that the plural ones do indeed seem to consistently give higher values than the singular ones.
15:39 - D.B.
Okay, I don't get that. Why would the plural ones give higher values? Or does anyone else have any thoughts on this? Indicating that they align It's also fun how if you take this dot product with the embeddings of the words 1, 2, 3, and so on, they give increasing values, so it's as if we can quantitatively measure how plural the
16:17 - Unidentified Speaker
model finds a given word.
16:18 - V.W.
Again, the specifics for how words get embedded is learned using data.
16:22 - Unidentified Speaker
This embedding matrix, whose columns tell us what happens to each word, is the first pile of weights in our model. And using the GPT-3 numbers, the vocabulary size specifically is 50,257, And again, technically this consists not of words per se, but of tokens. And the embedding dimension is 12,288. Multiplying those tells us this consists of about 617 million weights. Let's go ahead and add this to a running tally, remembering that by the end we should count up to 175 billion.
16:52 - D.B.
Any thoughts or comments?
16:59 - D.B.
So in those tables, the 12,288 is what, rows? And the 50,000 is columns? Anyway, it makes a matrix. And for embedding, it seems like 12,000 is a little low if you're trying to be able to have a separate value for every meaning that could exist.
17:34 - Unidentified Speaker
In the case of transformers, you really want to think of the vectors in this embedding space as not merely representing individual words. For one thing, they also encode information about the position of that word, which we'll talk about later. But more importantly, you should think of them as having the capacity to soak in context. A vector that started its life as the embedding of the word king, for example, might progressively get tugged and pulled by various blocks in this network so that by the end it points in a much more specific and nuanced direction that somehow encodes that it was a king who lived in Scotland and who had achieved his post after murdering the previous king.
18:19 - D.B.
So that suggests that all those king locations are on the same King... Monarchy axis or something. You know, an axis, like if it's got 12,000, I mean, is it 12,000 axes then?
18:38 - E.G.
Yeah, but I think king in and of itself would have one axis. And as you start adding another Symantic layer as you saw the different layers. King would be one they geography in Scotland would send it in a different direction.
19:08 - Multiple Speakers
OK, so if you if you go back, you'll actually see those layers.
19:17 - E.G.
Basically. Augment. Yeah.
19:19 - V.W.
The more tokens you have, the more you have this notion of this notionality space. And the more words you have, the more specific you're being about the notion you're trying to represent. And therefore, it's likely as you become more specific, there will be fewer other things connected to that. So you will eventually converge as your phrase increases in the number of tokens to more and more specific things. But then that's by the fact that some very wordy types of phrases can refer themselves to lots of things. Like all the molecules that wiggle is still a pretty big set.
19:57 - D.B.
Well, so all these different kinds of kings, you know, the one that lived in Scotland, the one that had a murder predecessor, the one that's, you know, lived in France, the one that lived in England, right? They're all going to be pretty high. They're all probably have the same value almost on the king axis. But they'll have way different values on the Scotland axis versus the France axis and so on.
20:23 - Unidentified Speaker
Right.
20:24 - V.W.
There's the king notion that lives in the dictionary, like the Oxford dictionary, and there's the king notion that lives in the Wikipedia for H.VIII. And the H.VIII notion of a king has associated with it these geographic and time and rise to power associations, while the king that lives in the dictionary has only the limited scope of title and role.
20:51 - M.M.
Remember for all of this, you have to have all of this content in the training. So if you don't have it in the training, you cannot have it in this vector space. Right.
21:05 - V.W.
And those vectors won't tug on each other because they won't be there.
21:10 - M.M.
Exactly. So everything is related. With the training of whatever content you have it, and embedding is just part. But I actually sent you the small videos of simple embedding of three-word sentence, five-word sentence, because this is complicated. This is very huge multidimensional space.
21:31 - D.B.
Yeah, this one doesn't show that all these different kings here, there's four of them, are actually in the same place. On the king axis.
21:42 - A.B.
Yeah.
21:42 - V.W.
In fact, this is pejorative because he's showing this in a three-dimensional space when it's really like a 27,000 dimensional space.
21:50 - M.M.
Exactly.
21:50 - V.W.
And it's helpful for us to think of these vector displacements in a geometric world that we live in, but the actual thing that's going on is much richer and more complex.
22:02 - M.M.
Yeah. If it's trained with all of this content. All right.
22:06 - D.B.
Well, V., what about, or anybody, What about a number of years ago, there was WordNet, which was an attempt by a guy named M., a very famous cognitive psychologist. He called it WordNet, which was basically a neural network or a semantic network of words.
22:28 - V.W.
That was like the precursor to BERT, wasn't it?
22:33 - M.M.
Exactly. Simple neural networks can do this job. I show you the videos with this simple multi-layer perceptron is doing this job.
22:44 - D.B.
Yes, you're correct. Could they incorporate WordNet into this system? Another way to state that would be is that WordNet is an ancestor of this system.
22:57 - V.W.
So it wouldn't be a matter of incorporating it, it would be a matter that this thing we're looking at evolved from WordNet and BERT and the early large language model attempts and culminated with the transformer that was originally designed simply to be a good translator between words and phrases and meanings. And then these emergent meanings popped out and lo and behold, the AI winter had thawed.
23:26 - D.B.
Well, couldn't it be that, I mean, isn't it sort of reinventing the wheel to sort of throw away which was elaborately generated partly by hand, I would guess, in favor of just sort of brute force processing of text from the net.
23:44 - Multiple Speakers
Who says they threw it away? Maybe there's still pieces of the WordNet and BERT code in what we see now.
23:53 - M.M.
It is inside, but you add the positioning coding, you add attention mechanism, you add some things. But the main idea, like you mentioned, is war net. It is. It is. All right.
24:07 - D.B.
So, here's another related question. So, there's this effort, PSYCH, C-Y-C, which was D.L.'s lifetime project.
24:14 - V.W.
He died a few years ago, it turns out.
24:18 - D.B.
But he was trying to get a- Codify all human knowledge.
24:23 - Multiple Speakers
Well, he was trying to come up with a database of common sense, physical, maybe not just physical, including common sense, physical knowledge. And the DOD was all over him to, you know, let us use this for our intelligence purposes.
24:39 - Unidentified Speaker
Yeah.
24:40 - D.B.
So why not? Couldn't, since it was so elaborately developed by hand, you know, they were, people had, were curating this, this database of common sense facts, you know, like, like a, a smaller cup can fit in a bigger cup. Okay. That's a common sense, physical knowledge.
24:57 - V.W.
T.W., world type things.
24:59 - D.B.
But on steroids, because T.W. was a micro world guy as everyone was in those days. And he was trying to scale up, but he didn't realize that you had to scale up like this. So he did it by hand. His people did it by hand. But couldn't you take his database and kind of combine it with something like this to get a LLM with common sense physical knowledge? That'd be better. I would prefer to look at it as that.
25:26 - V.W.
one LLM eats another, like a big fish eats a smaller fish. And the little fish is inside the big fish. And we go from very specialized fish to very big, multi-capable fish.
25:40 - D.B.
Well, how do you get an LLM or ChatGPT to be the big fish that eats Syke?
25:46 - V.W.
So you could take Syke and you could run it side-by-side with ChatGPT 4.0 or Pro or whatever, and you could see how they perform and are there things that the smaller, earlier model does worse or better or the same as the newer one? That's a pretty interesting thought.
26:04 - D.B.
So you could take a query and somehow expand it into a query for ChatGPT and another related, you know, a parallel query for...
26:12 - V.W.
Right. And you could race them and you could say...
26:15 - Multiple Speakers
I got another one for you. When you take a look at how we learn things and we build been built before.
26:24 - E.G.
We synthesize what is useful for us. We incorporate that and we throw away what isn't.
26:35 - D.B.
Aren't you explaining what these models are doing?
26:40 - V.W.
And there was another one for antibiotics that was really good at understanding antibiotic resistance the context. It was these domain-specific expert systems, termed at the time, that did a very strict subset of the kind of thing we're doing now, that we would just openly engage in a chat about.
27:06 - D.B.
All right. Anyone else have a comment? All right, we'll continue. And who's being described in Shakespearean language?
27:15 - Unidentified Speaker
Think about your own understanding of a given word, the meaning of that word is clearly informed by the surroundings. And sometimes this includes context from a long distance away. So in putting together a model that has the ability to predict what word comes next, the goal is to somehow empower it to incorporate context efficiently. To be clear, in that very first step, when you create the array of vectors based on the input text, each one of those is simply plucked out of the embedding matrix. So initially, each one can only encode the meaning of a single word without any input from its surroundings.
27:48 - D.B.
Each of those vertical vectors is a list of how far each word is on each of the dimensions of the space, right?
27:56 - V.W.
So imagine you were trying to send a message to the past so that the winter of AI would have never occurred. And you could only send it in four words, this message to tell M.M. to rewrite the Perceptron's book, to not be so mean about XOR gates.
28:14 - D.B.
And so what you would say is predict the next word.
28:19 - V.W.
It's like Fourier told us to look in the frequency domain instead of the time domain. And modern times have said, all you need to do is try to predict the next word and you will naturally arrive at this kind of thing. Because this predicting the next word is a total subset, an insignificant subset of the actual leverage we've obtained by trying to do it because it turns if you out you try to predict the next word you end up in
28:51 - D.B.
enabling thinking and emergent properties to occur and who would have thunk anybody else right but you should think of the primary goal of this network that it flows through as being to enable each one of those vectors to soak
29:10 - Unidentified Speaker
up a meaning that's much more rich and specific than what mere individual words could represent. The network can only process a fixed number of vectors at a time, known as its context size. For GPT-3, it was trained with a context size of 2048, so the data flowing through the network always looks like this array of 2048 columns, each of which has 12,000 dimensions. This context size limits how much text the transformer can incorporate when it's making a prediction of the next word. So the context is the number of tokens? Under consideration? Yes.
29:43 - V.W.
And that's why the Anthropic products with their 100,000 token context links are incredibly more useful. And even the Facebook LLM, which is very useful in some respects, has relatively small context length, which limit its applicability when you're trying to understand a specific idea in the context of a non-training piece of information like a large PDF or an academic paper or something like that.
30:12 - D.B.
Well, I wonder how it handles... I mean, okay, so it has a context of 2000 tokens or maybe more if it's quad.ai or whatever it is, but what about the syntactic relationship between those tokens? That makes a difference in terms of...
30:30 - Multiple Speakers
It makes a huge difference in the utility and the scope of the answer and whether the scope And whether the answer you get is vividly specific to what you're doing.
30:43 - E.G.
Now, a token really isn't a whole word. It's a part of a word or it's a token is a subset of something that it uses to understand.
30:56 - V.W.
It's a semantic fragment like apostrophe s or pound, bang, ampersand, up arrow for profanity. You know, It's sort of the whole world in terms of four-letter words.
31:10 - Multiple Speakers
Reach would be reach plus past tense.
31:14 - D.B.
Underneath would be under plus neat, probably.
31:18 - E.G.
When you look at something like LLAMA 3, the $405 billion one, it has a token length of 128K. That's why I think LLAMA 3 has really changed the whole landscape.
31:34 - V.W.
But the input token link that that llama accepts from the user is relatively small if you compare it with the other LLMs if I'm not mistaken. This is one thing that surprised me because I tallied not only the size of their training set but that the size of the that context link that the user could provide and I found it to be a better metric for the utility of the LLM than necessarily the size of the training set.
32:03 - E.G.
Are you talking about LLAMA 3? Because it has one of the largest token sizes.
32:09 - V.W.
I will go dig. You know, these things change week to week, too.
32:15 - A.B.
Yeah, right now it changes. So I have a question. So on that visual with the matrix at the attention layer, I don't know if you... Is it possible to go back a few seconds to freeze on that? Because I was wondering. Yeah, I think it was right there. Yeah. So where each token is vertically represented, then as I go down, each of those numbers is reflective of a token that's within one of the adjacent words in the context size. Is that right?
32:55 - E.G.
The adjacent word, I think, is the other layers leading into it. So this would be a token either in each column, because these arrays are literally that. They're multidimensional arrays, and how they change based on the intonation, around the word, not the word itself, but around the word. Because if you just had a context where all words were just in one plane, this array would be too huge for it to manage. So that's why it actually has multiple slices it goes through and does its stuff all in little chunks.
33:50 - A.B.
to feed it through. Right, right.
33:52 - Multiple Speakers
So like, I can't remember what the first word was on the like, where the 5.5 was, but right, so that's a that's a word in the input, right?
34:00 - A.B.
And then this is basically saying down the list, like, you know, whatever the not, I don't know what the, you know, the 9.2, or the 9.8, right, saying, hey, that word based on the context, that's where you need to pay attention. Like, this is the attention layer part that we're, we're looking at now. Right.
34:16 - D.B.
So I thought I thought that the 5.5 word or token, you know, was separated by the nine from the 9.8 tokens, two tokens. Correct.
34:27 - A.B.
And then but the 9.8 is another token in this context, but but based on the historical training, right, that's where like that higher number is gonna is what it'll pay attention to, right? Because it's But am I thinking about this the right way?
34:48 - E.G.
Are you thinking that these numbers are static? Because these numbers aren't static. These are derived based on other pieces of information.
34:59 - A.B.
Right, right. But OK. So I guess that's what I was trying to get to. It's like, OK, so in the first array, uh, token, what, what are these numbers being used for to, I guess, to, to predict on, to like the very, the very lat, like that, like what, to predict the next word in the sequence in the very last, right? So I thought that these were associating kind of like, Hey, this is, this is the token. And then this is based on the problem, the probabilities here are not probabilities based on the numbers below. This is what you should, what it's going to pay attention to, but I'm probably misunderstanding.
35:39 - E.G.
So if you take a look at King and you just searched on that word King without any other information, it's going to easily be able to identify because that's already been redrained. So what they're going to do then is we're going to add another term to King and we're going to say of rock. Versus of England. And it's going to change. Based on how it goes through, uh, trying to, you want to go from the H.VIII direction to the Elvis direction.
36:14 - Multiple Speakers
Go your example.
36:15 - Unidentified Speaker
Right.
36:16 - E.G.
And that's how it does it.
36:18 - Unidentified Speaker
Artist.
36:19 - Multiple Speakers
I put that graph in the chat and I noticed that it's based on LLM two, which is considerably out of date. So we'll need to update that with more recent figures.
36:33 - D.B.
This looks like this array of 2048 columns, each of which has 12,000 dimensions. This context size limits how much text the transformer can incorporate when it's making a prediction of the next word.
36:48 - Unidentified Speaker
This is why long conversations with certain chatbots, like the early versions of ChatGPT, often gave the feeling of the bot kind of losing the thread of conversation as you continued too long.
37:02 - V.W.
We'll go into that has remained time, but skipping. Huh? Uh, these LLMs forgetting what they were doing, like a doddering old man, like me have, have remained. You've just gotten a little more, uh, but you can quickly run to the end of even the most recent incarnation of ChatGPT one Oh pro it'll eventually lose track of what it was doing and you'll have to begin reeducating it. And in the process, you're stepping on the context that was also important. And so there's some maximum complexity that you can get it to do, even if you're paying 200 bucks a month. And also, by the way, waiting about up to five minutes per query to get it to bloat out the next logjam. And so if the $200 a month is now coming to serious question for me, because if I can only query this thing every five minutes instead of every five seconds, then I'm rate limited no matter how much resources they give me of size of LLM. If they only grunted out a response every five minutes, I'm dead in the water and may as well be using a smaller LLM that runs faster. Example, Grok runs almost instantly compared to any of the GPT series LLMs. When Girish being the new Twitter bot that E.M. set up for Twitter, and that I was using because it had the ability to look at sentiment analysis over the whole Twitter posting space. But that's I digress. Dr.
38:34 - E.G.
Brilliant, do you remember back in the day where we had one K we had to work in and we had to swap out stuff so that we could do our work in it? We'd have to do it in chunks and piece it together. And how we went to four K, it was, wow, we can do all of this together. This is a manifestation of that. It does not have that type of bucket where they can hold enough information to have that history.
39:05 - D.B.
So like while we used to have, you know, 10K floppy disks and anything bigger than that would have to be broken up.
39:15 - Unidentified Speaker
Exactly.
39:16 - E.G.
So you could draw an algae, couldn't you?
39:19 - D.B.
Yeah.
39:20 - V.W.
And one of the criticisms of LSTMs long short term memories and RNNs, recurrent neural networks that are used for time series data, is that they are highly prioritizing recent contextual information, but allowing far away information to drop off in importance. And it turns out that's a really big deal of not making the thing very useful. So the LLMs presumably opened up the gates by allowing us to have effectively infinite context links only they turn not to be they're not effectively infinite, they're kind of very predictably short. And thus, you they limit the utility of the LLM. And you can if you're interested in predicting the future on this, you can gate a Moore's law for LLMs. That's the rate is how many times you can say something to it before it forgets what you were talking about, and you have to start over. That's the new metric for Moore's law for how much you can get how much how many transistors you can fit, how many tokens you can fit in a conversation.
40:25 - Unidentified Speaker
All right.
40:26 - D.B.
Well, anything else, anyone? I was just going to comment on what V. said. Oh yeah, go for it. I agree with what he said. That's exactly what we're dealing with now.
40:42 - D.D.
Now that we've kind of made this breakthrough, we're looking at Moore's Law on the LLM. And that's kind of scary, because it's fast.
40:56 - D.B.
Well, is there a limit to the context that would be useful?
41:03 - V.W.
Like all human knowledge about historical documents ever written and burned in the libraries of Alexandria.
41:13 - D.D.
I agree. Yes. The sum of all human data. That is the limit.
41:19 - V.W.
And right now we're just trained with recent data that we believe to be factually correct. But historically, there's a lot of data that could be trained to provide us cultural information that we've lost access to.
41:31 - Multiple Speakers
I mean, it's almost like we've got a senile machine where it could just remember what it was just told a few moments ago, but you come back and it's like, who are you?
41:43 - V.W.
It's like a senile savant. It's like it's smart for a few conversations and then you've got to buy him another cup of coffee at Starbucks and start over.
41:55 - D.B.
All right. Remember, the desired output is a probability distribution over all tokens that might come next. For example, if the very last word is professor and the context includes words like Harry Potter and immediately preceding we see least favorite teacher, and also if you give me some leeway by letting me pretend that tokens simply look
42:18 - Unidentified Speaker
like full words, then a well-trained network that had built up knowledge of Harry Potter would presumably assign a high number to the word Snape. This involves two different steps. The first one is to use another matrix that maps the very last vector in that context to a list of 50,000 values, one for each token in the vocabulary. Then there's a function that normalizes this in into a probability distribution. It's called Softmax, and we'll talk more about it in just a second. But before that, it might seem a little bit weird to only use this last embedding to make a prediction, when after all, in that last step, there are thousands of other vectors in the layer just sitting there with their own context-rich meanings. This has to do with the fact that in the training process, it turns out to be much more efficient if you use each one of those vectors in the final layer to simultaneously make a prediction for what would come immediately after it. There's a lot more to be said about training later on, but I just want to call that out right now. This matrix is called the unembedding matrix, and we give it the label WU. Again, like all the weight matrices we see, its entries begin at random, but they are learned during the training process.
43:28 - V.W.
Okay, the unembedding matrix. So now imagine you put a phrase in, and there were two words of equal probability but considerably different meaning. So your answer could have this small perturbation initially that after several cycles could lead you into very different places in the knowledge space for the response. And so this brittleness with respect to small changes in the dot products seems to be pretty important. So we not only have the, you know, got viruses, and then LLMs got hallucinations. And now LLMs are forgetting stuff. And so we're slowly reproducing the human condition. And then we have this, the longer conversation it is, the less certain you are what other conversations and parallel universes could have been had if you'd only varied the inflection or tone of your voice slightly.
44:32 - D.B.
All right.
44:33 - E.G.
I agree with what V. said, except for slowly.
44:37 - D.B.
It's in the eye of the beholder, I guess.
44:40 - V.W.
All right.
44:41 - D.B.
Score on our total parameter count, this unembedding matrix has one row for each word in the vocabulary. And each row has the same number of elements as the embedding dimension.
44:54 - Unidentified Speaker
It's very similar to the embedding matrix, just with the order swapped. So it adds another 617 million parameters to the network, meaning our count so far is a little over a billion. A small, but not wholly insignificant fraction of the 175 billion that we'll end up with in total. As the very last mini-lesson for this chapter, I want to talk more about the softmax... Okay, so I think we're at a... We're just sort of at the chapter...
45:26 - D.B.
So we'll start here, and that comes on the right here.
45:40 - Unidentified Speaker
We're about 22:20. All right. Let's see. Escape.
46:21 - V.W.
We got up to 22:15 again, and then we forgot what we were doing.
46:29 - Multiple Speakers
So, we are our own worst enemy.
46:33 - D.B.
Well, I think one more session. It's only another five minutes or so. We'll get through it next time. All right.
46:44 - Unidentified Speaker
Focus on the chat. Chapter 5 video. All right, anything else anyone wants to mention before we call it a day?
47:07 - D.B.
All right, well, I'm trying to see.
47:16 - R.S.
You said you were going to have some of the grad students give some short presentations, D.?
47:23 - D.B.
Well, they're not here.
47:25 - R.S.
Yeah, I understand when the semester starts. Yeah. Yep. All right. Well, thanks, everyone.
47:31 - D.B.
I'll see you all next time. All right. Thank you.
47:36 - R.S.
Bye now.
47:37 - V.W.
See you guys.
No comments:
Post a Comment