AI Discussion Group
Fri, Jun 13, 2025
0:30 - D. B.
Hi, everyone. Happy Friday.
0:37 - E. G.
Ain't that the truth.
1:46 - D. B.
All right, well, let's see what we got for today.
2:13 - Unidentified Speaker
So if anyone's interested, on Wednesday, there will be a Zoom meeting.
2:18 - D. B.
Best practice for crafting prompts, and then practice it. So yeah, it looks like a little workshop. It's not going to be, you know, these are not high tech folks. They're just users. So it should be kind of a gentle introduction. I might go. Anyone planning on going?
2:42 - Unidentified Speaker
OK.
2:49 - D. B.
And again, if anyone has an idea for a master's student project in AI, just let me know and we can have the student tell us how it goes each week. And similarly, if you have any questions you'd like the group to discuss, let me know. And Y. talked about leading a discussion at some point about expanding to GroupMe or something like that.
3:28 - R. S.
Can you move that back up so I can see the URL for that meeting on Wednesday?
3:33 - E. G.
Hang on a second.
3:44 - Unidentified Speaker
Okay, what was your question earlier?
4:01 - R. S.
Can you just move that up, your window up there so I can write down the URL for the Zoom meeting on Wednesday?
4:18 - Unidentified Speaker
Okay, just hold on a minute. Got it.
4:24 - R. S.
8 7 2 0 0 1. 18, 90, 42. Okay. All right. One, one, not two ones. Okay.
4:35 - D. B.
I can barely hear you. I don't know if it's my speaker or something. I'm in front of my speaker.
4:46 - R. S.
Can you hear me better now? Yeah.
4:50 - D. B.#+#R. R.
You could actually just copy paste and put it in the chat if, to be accurate, Dr.
4:59 - D. B.
Burleigh? Yeah, I could.
5:01 - R. R.
And also anybody can get to these minutes just online.
5:07 - R. S.
That's correct. Yeah. You can put it in the chat. Okay, got it.
5:25 - Unidentified Speaker
Okay.
5:26 - D. B.
And also, if anyone reads an article, they can tell us about it.
5:33 - D. B.#+#E. G.
Anyone read an article recently? Yeah. I was reading about recursive AI, Alice.
5:40 - E. G.
Recursive AI at that point is our singularity, where AI is actually rewriting itself faster than developers and how fast it can get out of hand. So I was just following up on that. I think that would be a great discussion point because when we talk about AI singularity or recursive AI at that point, we're talking about the genesis of AI outside of human control.
6:15 - D. B.
Yeah. I hadn't heard it called recursive AI before. The concept's been around for a while, sort of AI. You know, if we can make something smarter than ourselves, then why couldn't it make something smarter than itself? And then you get into a spiral, a sort of self-reinforcing spiral. A singularity at that point. That's the singularity, yeah. If you can find an article, a specific article or something like that, I'd be happy to put it in the list of items If I could find it. Yeah, just you can send it to me.
6:50 - E. G.
Whatever up on my news feed.
6:52 - R. R.
Is that from S. A. that he wrote on June 10th? I believe.
6:57 - E. G.
S. was involved in it. OK, yeah, but I don't know if it was fun.
7:02 - D. B.#+#E. G.
Maybe we have the same news feed or something, because I saw he came out with some kind of a statement about that. But you know, he's not.
7:11 - Unidentified Speaker
He's not.
7:12 - D. B.
You know the the guru on this topic and it's been, it's been, it's been, you know, in the, it's been out there for decades. I mean, if you've ever heard of the science fiction writer, V. V.?
7:28 - D. B.#+#E. G.
Yeah.
7:28 - D. B.
An essay about it in the 90s saying this was, might happen.
7:33 - Unidentified Speaker
And he wasn't even the first. There was a statistician, I think, named Goode.
7:40 - D. B.
Forgot his first name. Last thing was good that he wrote about even earlier than that. Singularity. But now maybe it's going to happen, you know?
7:53 - Unidentified Speaker
OK, I got a email.
7:55 - D. B.
And the only thing I can suggest is just let's just read it. So I'll let you read it. And then we can talk about it.
9:04 - Unidentified Speaker
Any thoughts or comments?
9:50 - D. B.
Well, all right. Any comments? So anyway, I thought it sounded intriguing, you know, like give it a paper and it writes code to do something like that.
10:04 - Unidentified Speaker
I didn't quite understand But anyway, I got in touch with this guy.
10:10 - D. B.
He sent me this email. And I'm going to meet with him on Monday by Zoom. And I thought maybe he'd be willing to schedule a demo for this group.
10:22 - R. R.
That'd be awesome. Yeah. Cuter to see.
10:25 - Unidentified Speaker
Yeah.
10:26 - D. B.
The problem is when I went to his calendar where you have to sign up to talk to him, I realized that his slots were from 11 p.m. To 11 a.m. So I think he's probably not in the United States. Either he's got really weird hours or he's not based in the United States at all. And of course, our meeting time is not between 11 p.m. And 11 a.m. So I don't know. I'm going to talk to him on Monday. We'll see. If he doesn't want to do a demo at 4 p.m., central time. Maybe he's in Australia or something. I was thinking of scheduling one in the morning and either, you know, just letting anybody who wants to go to it, including us, but it wouldn't be during our regular meeting time. I could send it out to the faculty of the university. Maybe people out there want to attend. Meanwhile, if anyone wants to meet with him individually, or anything like that, you can. So anyway, I'd like to see what he's got to offer. I don't quite understand the email very well, but I'd like to see the demo.
11:42 - E. G.
Yeah, I'd like to see the context parameters for the papers for it to generate code.
11:48 - D. B.
Yeah, I mean, he's got some kind of a pipeline that's pretty interesting. I'd just like to see what it's all about.
11:56 - R. S.
Yeah, so I really don't understand it. Trying to put the papers that we wrote into some pipeline to reproduce other people's work, to try out new ideas? I don't understand.
12:11 - D. B.
What does that mean? That's what it looks like to me. He reproduces research papers into executable code. That's what he says.
12:21 - Unidentified Speaker
I don't get it. But you upload the PDF, and everything else is automatic.
12:29 - E. G.
And that's why I want to ask him what type of papers, what's the context of the papers, because all papers can't produce code. Well, that's true. He also says something about...
12:44 - D. B.
Especially in finance, he says.
12:46 - R. R.
Finance, yeah. Maybe he has trained his engine to do that. It'll be nice to have a demo. It would. He did specifically mention the paper that I was involved in on Moore's law.
13:06 - Unidentified Speaker
So that's not a finance paper for sure.
13:16 - R. R.
Well, you know, I don't, I don't understand what he's doing.
13:20 - R. S.
Is he making a repository of other people's work in, in similar areas? I don't know.
13:25 - D. B.
I kind of wish he could do the demo at four o'clock, but, but like I said, if he, if it's in the middle of the night for him, it won't work that way, but I'll, I'll just meet with him on Monday and see what you all think, see what he, see what, um, See what I think and then. Get back to you like next week or something.
13:50 - R. S.
In other words, he's he's out of the country.
13:54 - D. B.
He's like in England or something.
13:56 - R. S.
Well, I don't know where he is or maybe he just has odd hours, you know.
14:02 - D. D.
I don't know. It says cheers. So to me, that's that that's down. What's that website there? The demo is right there. Look here at Cal dot That's a calendar for setting a demo.
14:18 - E. G.
That's just a calendar? Yeah.
14:22 - D. B.
This is the website.
14:25 - R. S.
Can you look at that link?
14:42 - D. B.
And the calendar is, let's look at the calendar. If you just pick a random day, the calendar starts at 12 a.m. To 1130, oh, 12 a.m. 30 a.m. But then it's actually starts at 11 p.m. Not 12 a.m. So it goes his appointments go from 11 to 11 11 p.m. To 11 a.m. So that doesn't sound like the United States to me unless he's just a really a night owl but um I just went to his atlas research.io website and I have my avg web shirt shield saying multiple web threats Secured.
15:38 - V. W.
We've blocked a threat URL blacklist or HTTP labs from being downloaded. Got it. So, um, that's a little strange. I don't know.
15:47 - D. B.
I mean, I didn't get, I don't get that doing this here. I'm not going to the calendar.
15:55 - V. W.
I'm just going to what he said. It was his demo.
15:59 - D. B.
Oh, the go to dashboard here.
16:02 - V. W.
I'll send you the picture there.
16:11 - D. B.
All right. Well, why don't I be a false positive?
16:20 - Unidentified Speaker
You can share your screen. Okay. There's the URL.
16:28 - V. W.
And, uh, my usual practice is not to go any further.
16:38 - D. B.
Yeah.
16:40 - V. W.
So I'm going to go ahead. I'm going to stop there. And if someone else is bolder than I am, go to it.
16:50 - D. B.#+#V. W.
I already just went there. I hope I didn't.
16:54 - D. B.
Down might be a false positive, who knows these things happen.
16:59 - E. G.
Maybe it's a controlling where we go now.
17:05 - D. B.
Yeah, he could be not what he says. He could be a state actor, you know?
17:12 - V. W.
Oh my gosh, he's really pierced the veil now.
17:16 - D. B.
Intended to entrap scientists in America, because we're so important. We're a critical national resource.
17:22 - V. W.
When I first skimmed the idea, I thought, this is similar to, I'll often ask for an HTML5 CSS JavaScript demo for some body of work that I been doing. And I'll get one and it's illuminating. I did one on group theory and physics and Minkowski space time. And I did harmonic oscillator. And then I recently did harmonic oscillator complex, which shows mass length and time along with stiffness and damping being complex numbers instead of just real numbers. And it was pretty interesting. So I get the idea that if you have a body of work and you'd like some code to demonstrate it. That's a pretty quick thing to do now, especially with Cloud 4, it's making really nice code. Okay, anything else before we go on to the Chapter 6 video?
18:25 - D. B.
All right, well let's do that.
18:29 - R. S.
Wednesday at noon. Is that true, Dan?
18:33 - D. B.
I don't remember. Yeah. I got it. In the last chapter, okay, so I'm gonna set up the screen here, go like you and I started to step through the internal workings of a transformer. This is one of the You all can hear that, right? Yes. All right, here we go. In the last chapter, you and I started to step through the internal workings of a transformer. This is one of the key pieces of technology inside large language models and a lot of other tools in the modern wave of AI. It first hit the scene in a now famous 2017 paper called Attention is All You Need. And in this chapter, you and I will dig into what this attention mechanism is visualizing how it processes data. As a quick recap, here's the important context I want you to have in mind. The goal of the model that you and I are studying is to take in a piece of text and predict what word comes next. The input text is broken up into little pieces that we call tokens. And these are very often words or pieces of words, but just to make the examples in this video easier for you and me to think about, Let's simplify by pretending the tokens are always just words. The first step in a transformer is to associate each token with a high-dimensional vector, what we call its embedding. Any questions or comments so far? Embedding. Okay, I'll continue. The most important idea I want you to have in mind is how directions in this high-dimensional space of all possible embeddings can correspond with semantic meaning. In the last chapter, we saw an example for how direction can correspond to gender in the sense that adding a certain step in this space can take you from the embedding of a masculine noun to the embedding of the corresponding feminine noun. Okay, any questions? So I have a question. So he talks about direction, but what about distance? I mean, does it matter? The vector have a length does that matter?
20:49 - E. G.
I think it does because as he's saying here notice the vector it's it's not identical but it's similar in length between the endpoints and when I was watching the video earlier it looks like that's one of the continuity effects.
21:09 - D. B.
Maybe the vectors are all of length one.
21:13 - V. W.
I thought the vectors were normalized by the time this part of the process came around. I could be wrong though.
21:21 - D. B.
I mean, I'm not, I don't see how the yellow vector could, would necessarily.
21:27 - E. G.
The yellow vector is going, like you said, from masculine to feminine. So if we have like a term man, and we ask for the feminine of that, uh, what they're saying is a vector for the masculine to feminine would be similar across the relationships like aunt, uncle.
21:51 - V. W.
King, queen. King, queen.
21:54 - E. G.
Exactly. Okay. That's just one example.
21:57 - D. B.
You could imagine how many other directions in this high dimensional space could correspond to numerous other aspects of a word's meaning. The aim of a transformer is to progressively adjust these embeddings so that they don't merely encode an individual word, but instead they bake in some much, much richer contextual meaning. I should say up front that a lot of people find the attention mechanism, this key piece in a transformer, very confusing, so don't worry if it takes some time for things to sink in. I think that before we dive into the computational details and all the matrix multiplications, thinking about a couple examples for the kind of behavior that we want attention to enable. Consider the phrases American true mole, one mole of carbon dioxide, and take a biopsy of the mole. You and I know that the word mole has different meanings in each one of these, based on the context. But after the first step of a transformer, the one that breaks up the text and associates each token with a vector, the vector that's associated with mole would be the same in all three of these cases. Thoughts?
23:06 - E. G.
Yeah, that's when we talked, I think it was a couple of weeks ago, when we talked about contextualizing the verbs.
23:21 - Unidentified Speaker
OK. Because this initial token embedding is effectively a look with no reference to the context.
23:31 - D. B.
It's only in the next step of the transformer that the surrounding embeddings have the chance to pass information into this one. The picture you might have in mind is that there are multiple distinct directions in this embedding space, encoding the multiple distinct meanings of the word mole, and that a well-trained attention block calculates what you need to add to the generic embedding to move it to one of these more specific directions. Okay, so if the undisambiguated word mole has a space, a spot in the embedding, what is that spot?
24:18 - E. G.
It could be a placeholder to say, OK, this is what we're dealing with.
24:27 - D. D.
Go ahead. I love it. Embedding like its definition. That's just what it means. It's just a mole. And then whenever the tension mechanism changes weights, it makes different moles.
24:46 - D. B.#+#D. D.
So now, depending on the works that are around the mole, it's going to be able to say, oh, well, this mole is closer to that vector.
24:56 - D. D.
So this is, they're talking about a mole of, that's on your skin, you know? Like, it'll be, the vector going up to the mole on the lip will be closer to cancer or something like that. As opposed to the original mole that no longer exists anymore, except for in the embedding as I look up vectors what I think what he just said definition just what that word means compared to all the other words well what the word means is is this sort of strange combination of three different sort of three different meanings so could you take in the transformer yes but not in the But not in the embedding space.
25:43 - D. B.
Yeah. Well, could you take a random assortment of meanings that don't really connect with each other? Like, you know, rat up and idea?
25:55 - Unidentified Speaker
Yeah.
25:55 - D. D.
If you put garbage into the transformer, you'll get garbage out.
26:00 - D. B.
I promise you.
26:02 - Unidentified Speaker
Yeah.
26:02 - D. D.
That's the way all this stuff works. Garbage in, garbage out. Yeah. That in it or you're going to get some incoherent mess on the other side.
26:15 - V. W.
But you can explore new ideas and concepts that wouldn't have previously been covered like the ratness of an idea. Now we have a kind of metaphor.
26:26 - D. B.
And it's the emergent properties that are coming from this metaphor process.
26:31 - V. W.
And that's why this notion of predict the next word that seems so turns up giving us such rich answers because we're contextualizing these things.
26:42 - D. B.
We can imagine what a phrase like a metaphor like the ratness of an idea might mean. It's not such an...
26:51 - V. W.
It actually could make some sense.
26:53 - D. B.
So therefore, it must have a point in the embedding space that sort of has a meaning.
27:04 - Unidentified Speaker
What if embeddings are good for...
27:08 - V. W.
But these embeddings represent combinations of ideas which may not have appeared before that can nonetheless be reasoned with and answers generated to.
27:21 - D. B.
Okay. As a function of the context. To take another example, consider the embedding of the word tower. This is presumably some very generic, non-specific direction in the space, associated with lots of other large, tall nouns. If this word was immediately preceded by Eiffel, you could imagine wanting the mechanism to update this vector so that it points in a direction that more specifically encodes the Eiffel Tower, maybe correlated with vectors associated with Paris and France and things made of steel. If it was also preceded by the word miniature, then the vector should be updated so that it no longer correlates with large tau things. More generally than just refining the meaning of a word, the attention block allows the model to move information encoded in one embedding to that of another, potentially ones that are quite far away, and potentially with information that's much richer than just a single word. What we saw in the last chapter was how after all of the vectors flow through the network, including many different attention blocks, the computation that you perform to produce a prediction of the next token is entirely a function of the last vector in the sequence. So imagine, for example, that the text you input is most of an entire mystery novel, all the way up to a point near the end, which reads, therefore the murderer was. If the model's going to accurately predict the next word, that final vector in the sequence, which began its life simply embedding the word was, will have to have been updated by all of the attention blocks to represent much, much more than individual word, somehow encoding all of the information from the full context window that's relevant to predicting the next word.
29:10 - Unidentified Speaker
Comments?
29:11 - V. W.
The incredible power of having the last word. In fact, isn't, uh, I'm not familiar with it, but, uh, I could say that one more time.
29:28 - E. G.
My daughters have taken four years of Latin in school, and I think Latin is contextual such that it's the final word of words that put together the context of the statement. That's why I think Plato was such a good orator, that he'd keep going on until the last piece to put everything into context.
29:58 - D. B.
So this vector shown here somehow represents the entire story up to that point. Can I just go over that little snip one more time?
30:11 - D. D.
Sure.
30:12 - D. B.
I don't know where to go back to. But I'll go back and up. ...vetting of the word tower. This is presumably some very generic, non-specific direction in the space associated with lots of other large, tall nouns. If this word was immediately preceded by Eiffel, you could imagine wanting the mechanism to update this vector so that it points in a direction that more specifically encodes the Eiffel Tower, maybe correlated with vectors associated with Paris. And things made of steel. If it was also preceded by the word miniature, then the vector should be updated even further, so that it no longer correlates with large, tall things. More generally than just refining the meaning of a word, the attention block allows the model to move information encoded in one embedding to that of another, potentially ones that are quite far away, and potentially with information that's much richer than just a single word. What we saw in the last chapter was how after all of the vectors flow through the network, including many different attention blocks, the computation that you perform to produce a prediction of the next token is entirely a function of the last vector in the sequence. So imagine, for example, that the text you input is most of an entire mystery novel, all the way up to a point near the end, which reads, therefore the murderer was. If the model is going to accurately predict the next word, that final vector in the sequence, which began its life life simply embedding the word was, will have to have been updated by all of the attention blocks to represent much, much more than any individual word, somehow encoding all of the information from the full context window that's relevant to predicting the next word.
31:59 - D. D.
What do you think, Daniel?
32:01 - Unidentified Speaker
Heavy.
32:02 - V. W.
I think there's a little glitch in there because I don't think was being the next word. If we had just had therefore the murderer and we didn't have was, we still might have gotten the name of the murderer. So it's that local context recursively applied backwards that's enriching the meaning of the most recent words.
32:29 - D. B.#+#D. D.
Well, I think that, I mean, therefore the murderer the the system probably would pick was is the next word right it might pick that's what I was thinking it might that's pretty good cause or is is you might pick it is and which is the same you know yeah that's that's interesting but in that case it's just correcting your grammar and not telling you the most important thing you read the book for you know that's interesting that it would, you know, cause it's got to do both, right.
33:07 - D. B.
It's got to pay attention to grammar in a sense. I mean, when you use chat GPT to get output, it's usually fairly grammatical. It's always correcting mine. Yeah. Spelling. All right. Any other comments so far?
33:26 - V. W.
All right.
33:27 - D. B.
Step through the computation. Though, let's take a much simpler example. Imagine that the input includes the phrase, a fluffy blue creature roamed the verdant forest. And for the moment, suppose that the only type of update that we care about is having the adjectives adjust the meanings of their corresponding nouns. What I'm about to describe is what we would call a single head of attention, and later we will see how the attention block consists of many different heads run in parallel. Again, the initial embedding for each word is some high-dimensional vector that only encodes the meaning of that particular word with no context. Actually, that's not quite true. They also encode the position of the word. There's a lot more to say about the specific way the positions are encoded, but right now, all you need to know is that the entries of this vector are enough to tell you both what the word is and where it exists in the context. Comments?
34:23 - V. W.
We've discussed this a lot in the past with being a language-specific thing that French will put adjectives after, while English will put adjectives before. And that positional encoding is what takes care of that working in both languages.
34:42 - D. B.
So these vectors only have 12,288 elements. And yet, see, the English language has a lot more than 12,288. 12,288 words, not to mention a lot more than 12,288 stories, right? Contexts. But I guess each element in the vector can have any value. How many vectors do I have in my head of words?
35:16 - D. D.
Can I come up with 12,288?
35:19 - D. D.#+#E. G.
Yeah, OK.
35:20 - D. B.
So there's a lot more than 12,288 vectors for sure.
35:26 - E. G.
Well, if you take a look at it just for context purposes, the neurons in your brain trying to find a specific and just follow the same train of thought we store information in our brain and locations we access them through the the synapses to find it and what we do with those synapses is provide context to the terms like father mother male female when we go into to synthesize the information we're actually going through and applying transformative information to the context of the word to provide meaning. That's what that's doing.
36:21 - V. W.
How many ways are there to arrange 12,288 things? Well, it would be that factorial, which is an extremely large number. Number.
36:36 - D. B.
Well, the concept or the thought that when you start bringing up real brains and real neurons, to me, the fact that something like a transformer works so well at simulating intelligence suggests that actual intelligence might be kind of like transformers.
37:09 - V. W.
It might just be predicting the next word, as we've joked often.
37:16 - E. G.
Well, I just looked up. I mean, humans have, on average, 86 billion neurons.
37:23 - D. D.
We know what we're going to say, though, before we say it many times.
37:32 - E. G.
But if we're trying, that's because we know what we want to say in the sentence. But if we're trying to synthesize information that's given to us and predict what the next thing is, we do a similar approach. And in truth, that makes sense, because when we program something, we program it in a that we understand.
38:02 - D. D.
So AI is really just one of those annoying things that wants to finish your sentence.
38:17 - E. G.
I would agree, yes, but I think that It's hamstrung based on the way that we synthesize information. And that before you guys joined, we were talking about recursive AI and the AI singularity, where AI is going to be proving itself. I think it was Google Translate when it does the translation. It was initially coded by engineers. But when they looked at the Google Translate code that it used to do the translations, the developers didn't understand it. I'll see if I can find the article. But it was in a language that the program understood. But we didn't. All right. Well, let's continue.
39:23 - D. B.
Let's go ahead and denote these embeddings with the letter E. The goal is to have a series of computations produce a new refined set of embeddings where, for example, those corresponding to the nouns have ingested the meaning from their corresponding adjectives. And playing the deep learning game, we want most of the computations involved to look like matrix vector products where where the matrices are full of tunable weights, things that the model will learn based on data. To be clear, I'm making up this example of adjectives updating nouns just to illustrate the type of behavior that you could imagine an intention head doing. As with so much deep learning, the true behavior is much harder to parse, because it's based on tweaking and tuning a huge number of parameters to minimize some cost function. It's just that as we step through all of the different matrices filled with parameters that are involved in this process, it's really helpful to have an imagined example of something that it could be doing to help keep it all more concrete. All right, well he's got fluffy and blue, and he's got e2, e3, and e4 converging to define e'4. That's adjectives affecting a noun, and then the same with e7 and E8. And notice those two lines of vectors are completely connected.
40:54 - V. W.
So there's 64 arcs between them. No, they're not completely connected. Well, every one is connected to every other one.
41:08 - E. G.
You see every arc connected.
41:11 - V. W.
Even E1 to E1 prime.
41:15 - D. B.#+#V. W.
It's very faint, but on my screen it shows up as a faint gray line.
41:19 - D. B.#+#E. G.
Oh, I see. I'm looking at the... I didn't see some of those.
41:23 - D. B.
Do you see E1 connecting to E'2? Yes, I do. It's a very faint line.
41:28 - E. G.
Oh, it's not on my screen.
41:29 - D. B.
Yeah, I see it too. Okay, because there's a matrix involved and it's just... Well, I put a zero. You could have a very small number and it doesn't have to be zero. For the first step of this process, you might imagine each noun, like a creature, asking the question, hey, are there any adjectives sitting in front of me? And for the words fluffy and blue to each be able to answer, yeah, I'm an adjective.
42:07 - D. B.
That question is somehow encoded as yet another vector, another list of numbers, which we call the query for this word. This query vector, though, has a much smaller dimension than the embedding vector, say 128. What's going on here? Anyone?
42:23 - Unidentified Speaker
Shall we continue?
42:26 - V. W.
Well, just to say that the query vector is a projection into 128 dimensional subspace of the original dimension. Presumably this is done for efficiency reasons.
42:48 - D. B.
Computing this query looks like taking a certain matrix, which I'll label wq, and multiplying it by the embedding. Compressing things a bit, let's write that query vector as q, and then anytime you see me put a matrix next to an arrow like this one, it's meant to represent that multiplying this matrix by the vector at the arrow's start gives you the vector at the arrow's end. OK. What is W sub Q? That's the matrix of the
43:23 - E. G.
Yeah, it's right there, it's.
43:28 - D. D.
So W sub Q would be the matrix.
43:35 - E. G.
The Q is a vector for E4 and WQ.
43:43 - D. B.
So WQ has 128 rows or whatever it was. Or 11,000 rows.
43:50 - R. S.
Dan, I have to leave in a few minutes, so. All right. Is that all right?
44:01 - D. B.
Yep. OK, thank you.
44:03 - Unidentified Speaker
Go back.
44:05 - D. D.
I thought that's, I thought that. That's what they were talking about.
44:11 - D. B.
Yeah. Go back just a little bit.
44:14 - D. D.
For the first step of this process, you might imagine each noun, like a creature, asking the question, hey, are there any adjectives sitting in front of me?
44:24 - D. B.
And for the words fluffy and blue to each be able to answer, yeah, I'm an adjective and I'm in that position. That question is somehow encoded as yet another vector, another list of numbers, which we call query for this word. This query vector, though, has a much smaller dimension than the embedding vector, say 128. Computing this query looks like taking a certain matrix, which I'll label WQ, and multiplying it by the embedding. Compressing things a bit, let's write that query vector as Q, and then anytime you see me put a matrix next to an arrow like this one, it's meant to represent that multiplying this matrix by the vector at the arrow's start gives you the vector at the arrow's end. In this case, you multiply this matrix by all of the embeddings in the context, producing one query vector for each token. The entries of this matrix are parameters of the model, which means the true behavior is learned from data, and in practice, what this matrix does in a particular attention head is challenging to parse. Okay, so you have, you have query every, embedding leads to a query vector. What does the query vector say, says any adjectives in front of me. That's what he's saying.
45:50 - V. W.
Are there any words specializing the rendering of my meaning in front of me, so that for example you know which mole I am, or which fluffy creature.
46:07 - D. B.
If you have a whole murder mystery ending in the murderer, and the next word is probably was or is, The word. Hmm. No, all right, they're not talking about finding the next word here.
46:32 - Unidentified Speaker
Yeah, it's a little bit.
46:35 - D. D.
He's talking about is the next word adjectives what he was saying. So I'm not 100% clear.
46:46 - D. B.
Yeah, so he's generated a
46:50 - V. W.
a list of query vectors instead of a list of embedding vectors. And those query vectors in this case are answering very specific questions about the structure of the language. Like, where are the adjectives in front of me if we're talking in English?
47:09 - D. D.
Is it a yes or no question? Because that's what it says.
47:14 - V. W.
But it's continuously real valued in the resulting query vectors. There's flavors and priorities and levels of emphases in the resulting query vector. So you could choose the most likely of those.
47:31 - D. B.
But for our sake, imagining an example that we might hope that it would learn, we'll suppose that this query matrix maps the embeddings of nouns to certain directions in this smaller query space that somehow encodes the notion of looking for adjectives in preceding positions. As to what it does to other embeddings, who knows? Maybe it simultaneously tries to accomplish some other goal with those. Right now, we're laser focused on the nouns. At the same time, associated with this is a second matrix called the key matrix, which you also multiply by every one of the embeddings. This produces a second sequence of vectors that we call the Conceptually, you want to think of the keys as potentially answering the queries. This key matrix is also full of tunable parameters, and just like the query matrix, it maps the embedding vectors to that same smaller dimensional space. You think of the keys as matching the queries whenever they closely align with each other. In our example, you would imagine that the key matrix maps the adjectives, like fluffy and blue, to vectors that are closely aligned with the query produced by the word creature. To measure how well each key matches each query, you compute a dot product between each possible key-query pair. I like to visualize a grid full of a bunch of dots, where the bigger dots correspond to the larger dot products, the places where the keys and queries align.
48:59 - V. W.
This is where the term cosine similarity comes from. This is where the idea of cosine similarity comes from. So you can take two abstracts, and we recently did this, how similar is the abstract of this paper to the abstract of this other paper? And if they have a high cosine similarity, we know we probably don't have to read both papers. And the term cosine similarity simply comes from the fact that we're doing a dot product of two vectors to see if we get a hit.
49:34 - Unidentified Speaker
I think that's right.
49:38 - V. W.
So if one of the vectors is zero, so if one of the elements of the vector has a zero in it, it won't have any influence on that particular selection. It's really a selection mechanism, selecting one thing against another. And if both are unity, then they select to the maximum degree possible.
49:56 - D. B.
So the key and the query are both the same number of dimensions because otherwise you couldn't take a dot product.
50:03 - V. W.
Exactly. And you can do that, but that's a mess.
50:08 - D. B.
Where if the keys produced by Fluffy and Blue really do align closely with the query produced by Creature, then the dot products in these two spots would be some large positive numbers. In the lingo, machine learning people would say that this means the embeddings of Fluffy and Blue attend to the embedding of Creature. Okay. That's what attention means then. Dot products are big.
50:38 - D. D.
Yeah, it means fluffy and blue are together.
50:42 - V. W.
That's a big statement right there for understanding a tension mechanism.
50:47 - D. B.
By contrast to the dot product between the key for some other word like the and the query for creature would be some small or negative value that reflects that these are unrelated to each other. So we have this grid values, it can be any real number from negative infinity to infinity, giving us a score for how relevant each word is to updating the meaning of every other word. The way we're about to use these scores is to take a certain weighted sum along each column, weighted by the relevance. So instead of having values range from negative infinity to infinity, what we want is for the numbers in these columns to be between 0 and 1, and for each column to add up to 1, as if they were a probability distribution. If you're coming in from the last chapter, you know what we need to do then. We compute a softmax along each one of these columns to normalize the values. In our picture, after you apply softmax to all of the columns, we'll fill in the grid with these normalized values. At this point you're safe to think about each column as giving weights according to how relevant the word on the left is to the corresponding We call this grid an attention pattern. Now if you look at the original transformer paper, there's a really compact way that they write this all down. Here the variables q and k represent the full arrays of query and key vectors respectively, those little vectors you get by multiplying the embeddings by the query and the key matrices. This expression up in the numerator is a really compact way to represent the grid of all possible dot products between pairs of keys and queries. A small technical detail that I didn't mention is that for numerical stability, it happens to be helpful to divide all of these values by the square root of the dimension in that key query space. Then this softmax that's wrapped around the full expression is meant to be understood to apply column by column. As to that v term, we'll talk about it in just a second. Before that, there's one other technical detail that so far I have skipped. Okay, that's a good place to stop. I think it's about 1108. I see we're right about the shift in segments right about 1108 right about here. Any comments so far before we quit? Okay, well then I'm going to go ahead and say we got up to Make sure we don't.
53:38 - Unidentified Speaker
Miss anything? OK, folks. Any last comments before we adjourn? Hope you guys have a good weekend. All right.
53:55 - D. B.
Same to you.
53:58 - D. D.
Take care.
54:00 - D. B.
Bye, guys.
54:01 - D. D.
Bye, everyone.