AIn't What It Used to Be: 7/11/25: Do chapter 7 video

Artificial Intelligence Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time on Fridays. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many.

Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes (170^th meeting, July 11, 2025)

Table of Contents

* Agenda and minutes

* Appendix: Transcript (when available)

Agenda and Minutes

Announcements, updates, questions, etc. as time allows: none.
DD has generously agreed to do a demo on generating the transcripts of these meetings. Here is one of the problems he encountered and can discuss:
- When I [...] went to ChatGPT [I] discovered it changed models and I had to import my prompts. The model settings were lost and the new model's context window was too short. I changed to an older model and the model made up new entries for the transcript. I adjusted the temperature and got it figured out. It has been an interesting week...
EG and DD are finishing slides surveying different ML models.
VW will demo his wind tunnel system soon.
If anyone has an idea for an MS project where the student reports to us for a few minutes each week for discussion and feedback - a student could likely be recruited! Let me know

JH suggests a project in which AI is used to help students adjust their resumes to match key terms in job descriptions, to help their resumes bubble to the top when the many resumes are screened early in the hiring process.
We discussed book projects but those aren't the only possibilities.

VW had some specific AI-related topics that need books about them.

DD suggests having a student do something related to Mark Windsor's presentation. He might like to be involved, but this would not be absolutely necessary.

Any questions you'd like to bring up for discussion, just let me know.
Anyone read an article recently they can tell us about next time?
Any other updates or announcements?
Hoping for a summary/review of the book at some point from [ebsherwin@ualr], who wrote:
Greetings all,
In a recent session on working with AI, Dr brian Berry (VP Research and Dean of GradSchool) recommended this book:
The AI-Driven Leader: Harnessing AI to Make Faster, Smarter Decisions
by Geoff Woods
I just bought it based on his recommendation and if anyone is interested will gladly meet to talk about the book. Nothing "heavy duty" just an accountability group.
If you have read the book already and if the group forms, you are welcome to join the discussion.
I'll wait till Monday morning before I start reading -- so if you do not see this message immediately, do reach out!
Best,
Chapter 7 video, https://www.youtube.com/watch?v=9-Jl0dxWQs8. We finished it.
Here is the latest on future readings and viewings

Let me know of anything you'd like to have us evaluate for a fuller reading.
https://transformer-circuits.pub/2025/attribution-graphs/biology.html.
https://arxiv.org/pdf/2001.08361. 5/30/25: eval was 4.
We can evaluate https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10718663 for reading & discussion.
popular-physicsprize2024-2.pdf got a evaluation of 5.0 for a detailed reading.
https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-refusals
https://venturebeat.com/ai/anthropic-flips-the-script-on-ai-in-education-claude-learning-mode-makes-students-do-the-thinking
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
(Biology of Large Language Models)
https://www.forbes.com/sites/robtoews/2024/12/22/10-ai-predictions-for-2025/
Prompt engineering course:
https://apps.cognitiveclass.ai/learning/course/course-v1:IBMSkillsNetwork+AI0117EN+v1/home
Neural Networks, Deep Learning: The basics of neural networks, and the math behind how they learn, https://www.3blue1brown.com/topics/neural-networks
LangChain free tutorial,https://www.youtube.com/@LangChain/videos
Chapter 6 recommends material by Andrej Karpathy, https://www.youtube.com/@AndrejKarpathy/videos for learning more.
Chapter 6 recommends material by Chris Olah, https://www.youtube.com/results?search_query=chris+olah
Chapter 6 recommended https://www.youtube.com/c/VCubingX for relevant material, in particular https://www.youtube.com/watch?v=1il-s4mgNdI
Chapter 6 recommended Art of the Problem, in particular https://www.youtube.com/watch?v=OFS90-FX6pg
LLMs and the singularity: https://philpapers.org/go.pl?id=ISHLLM&u=https%3A%2F%2Fphilpapers.org%2Farchive%2FISHLLM.pdf (summarized at: https://poe.com/s/WuYyhuciNwlFuSR0SVEt). 6/7/24: vote was 4 3/7. We read the abstract. We could start it any time. We could even spend some time on this and some time on something else in the same meeting.

Schedule back burner "when possible" items:

TE is in the informal campus faculty AI discussion group. SL: "I've been asked to lead the DCSTEM College AI Ad Hoc Committee. ... We’ll discuss AI’s role in our curriculum, how to integrate AI literacy into courses, and strategies for guiding students on responsible AI use."
Anyone read an article recently they can tell us about?
If anyone else has a project they would like to help supervise, let me know.
(2/14/25) An ad hoc group is forming on campus for people to discuss AI and teaching of diverse subjects by ES. It would be interesting to hear from someone in that group at some point to see what people are thinking and doing regarding AIs and their teaching activities.
The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates now and then.

Appendix: Transcript

Artificial Intelligence Study Group
Fri, Jul 11, 2025

0:15 - Conference Room (D. B.) - Speaker 2
Hi, everyone.

0:16 - Unidentified Speaker
Hello.

0:16 - Unidentified Speaker
Hello.

0:17 - M. M.
Hello, everybody. Hello. Happy Friday. All right.

0:20 - Conference Room (D. B.) - Speaker 2
I got to find this Chapter 7 video. Did you get the message from V.?

0:28 - Unidentified Speaker
Yeah.

0:29 - Unidentified Speaker
Yeah.

0:31 - M. M.
That's why I have to. Yeah.

0:34 - Unidentified Speaker
Oh, V.'s not V.'s not doing his thing today.

0:38 - Unidentified Speaker
No.

0:39 - E. G.
Oh, he had a kid issue. Crash.

0:42 - Unidentified Speaker
Yeah.

0:43 - M. M.
Yeah, he's not, doesn't have a computer. So this is the reason. E., I sent you the link for Nvidia resource. I saw that. Yeah, for the, any kind of educational unit can request hardware. Yeah, so. Grant, we just apply. I know people already received some grants. So unfortunately, V. cannot present today.

1:29 - Unidentified Speaker
Okay.

1:30 - Unidentified Speaker
Okay, I found it.

1:34 - Conference Room (D. B.) - Speaker 1
Yeah, so we need to when your kid is hurting and there's nothing you can do about it.

1:53 - Unidentified Speaker
Where?

1:54 - Unidentified Speaker
Inside?

2:45 - Conference Room (D. B.) - Speaker 2
Okay, I found the video. We'll get to it. Let me go back and share my screen. OK, so V. was going to demo his wind tunnel AI system today. A terrible system crash, and he's trying to figure out what's going on. And he'll get to it soon, but not today. And D. generously agreed to do a demo for how he processes the transcripts for these meetings using AI to anonymize and so on. And there's some interesting observations about that. D., do that next week?

3:51 - D. D.
Yeah, that'll be fine. OK.

3:55 - Conference Room (D. B.) - Speaker 2
And as always, we're collecting ideas for master's student projects. So currently, we have a number of them. So of course, there's the book project, one that we did last semester, which we can do better in the future, as we talked about. And also, J. has a project designed to have students adjust their resumes to match job descriptions. So when companies post job descriptions, they use keywords, and they probably use those same keywords to retrieve relevant resumes. So if you don't use the right keywords, even though you're highly qualified, your resume won't be pulled. So the project is to fix that problem. I hope I have that right, J. OK. And D. also has suggested having a student do something related to M. W.'s presentation a few weeks ago. And so if you recall that, it was pretty interesting. His presentation was on a system for turning academic papers into code. If the paper described an algorithm and some data, I guess it would generate, as I understand it, it would generate synthetic data and code to analyze it. So it was pretty interesting. And so we'll see if we get a student interested in that. And then V. had some other topics that he wants books written about different AI-related topics. We're back to the book project here. We should probably make that a sub.

5:46 - E. G.
D. D. and I are almost done with our slide presentation on machine models.

5:54 - Conference Room (D. B.) - Speaker 2
Ah, OK. Let me put that here.

5:58 - E. G.
I started drafting the talking points to it. Slides. There's some discussion on our side that we need to flesh out, but we should have that done in a week or two.

6:16 - Conference Room (D. B.) - Speaker 2
Okay, slides on what was it again?

6:20 - E. G.
Different types of machine learning models.

6:24 - D. D.
Like a kind of like a survey, maybe. Okay. Yeah, probably, you know, really good for, you know, people that don't know very much about machine learning, and for undergraduates. Cool. Yeah, I mean, I don't know.

6:47 - Conference Room (D. B.) - Speaker 2
Did you mention that last week? Or? I don't remember.

6:51 - E. G.
And we mentioned it about a month ago.

6:54 - Conference Room (D. B.) - Speaker 2
Oh, OK.

6:55 - E. G.
But it's been a back burner item with everything else going on. So as we have time, we'll put cycles to it. OK, cool.

7:06 - Conference Room (D. B.) - Speaker 2
All right, that'll be good. And if you're interested, I could put out a notice. Since it's sort of tutorial style, I could put out a notice and maybe get some visitors or something. I don't know that too many undergraduates are available in the summer or whatever, but you know, people are available and might as well make it available to them if they, if.

7:30 - D. D.
Well, we could consider our first run a test run.

7:33 - Unidentified Speaker
Okay, we'll do.

7:34 - Conference Room (D. B.) - Speaker 2
That's probably a better, better choice.

7:36 - Unidentified Speaker
Okay, no problem.

7:37 - Conference Room (D. B.) - Speaker 2
Any questions you'd like to bring up for discussion? Just, just let me know. I had one, I didn't write it down a few days ago, so I don't, forgot what it was, but it was, it was, it would have been an interesting discussion. Maybe I'll think of it again and add it. If you have any questions, discussion questions, just random questions that intrigue you about AI you want some input on, just let me know and I'll put them on the agenda. Any other updates or announcements, anyone? Okay. And hopefully at some point, Dr. S. will give us a review of the paper that she has a little discussion group that she arranged for this summer. I don't know how many people signed up, but even if nobody signs up, she's still gonna Read it herself. And that's the name of the book. And she generously was willing to give us her take on the book at some point. She's in the psychology department. So it's not like, hyper-technical computer science stuff. All right, well, with that, we'll go to our Chapter 7 video. This is the last video in the series, and after that, we'll go on to Read some abstracts and initial minutes of different videos, whatever, and evaluate them. We've got something here that was evaluated at a four, but that's a little old at this point. Evaluated as a five. That's the maximum. I think that was J. H.'s Nobel Prize winner speech transcript. But anyway, we probably should rethink some of these evaluations since it's been so long, and then we'll decide what to Read or view. With that, let's go to the video. And this is If you feed a large which modeled the phrase, M. J. plays the sport of playing. And you have a pitch, what kind of thing? I don't know. But anyway, we'll do our usual. I'll just play it for a minute or two, and then we'll discuss and continue on like that. Let me just make sure the audio is good. I'm just going to play it for a couple of seconds and ask you if you can hear it, and then we'll do it for real.

10:35 - Conference Room (D. B.) - Speaker 2
If you feed a large language model, the first You all hear that pretty well?

10:41 - D. D.
I can.

10:42 - Conference Room (D. B.) - Speaker 2
You can or you can't? I can.

10:45 - Unidentified Speaker
OK, well, good.

10:46 - Conference Room (D. B.) - Speaker 2
All right, well, let's do it for real then.

10:49 - Conference Room (D. B.) - Speaker 1
If you feed a large language model the phrase, M. J. plays the sport of blank, and you have it predict what comes next, and it correctly predicts basketball, this would suggest that somewhere inside its hundreds of billions of parameters, it's baked in knowledge about a specific person and his specific sport. And I think in general, anyone who's played around with one of these models.

11:13 - Conference Room (D. B.) - Speaker 2
Any comments so far, or questions?

11:16 - Unidentified Speaker
Discussion?

11:17 - Conference Room (D. B.) - Speaker 1
As the clear sense that it's memorized tons and tons of facts. So a reasonable question you could ask is, how exactly does that work? And where those facts live? Last December, a few researchers from Google DeepMind posted about work on this question, and they were using the specific example of matching athletes to their sports. And although a full mechanistic understanding of how facts are stored remains unsolved, they had some interesting partial results, including the very general high-level conclusion that the facts seem to live inside a specific part of these networks known for fancifully as the multi-layer perceptrons, or MLPs for short.

12:06 - Unidentified Speaker
Any comments, questions?

12:08 - Conference Room (D. B.) - Speaker 2
So to me, it's fascinating that they don't know. I mean, you ask an AI what sport M. J. plays, and it's going to tell you basketball, I assume. But nobody really knows for sure how they know it. It's this weird phenomenon of an emergent property that was not designed in, it just sort of happens by itself. All right.

12:41 - Conference Room (D. B.) - Speaker 1
In the last couple of chapters, you and I have been digging into the details behind transformers, the architecture underlying large language models, and also underlying a lot of other modern AI In the most recent chapter, we were focusing on a piece called attention, and the next step for you and me is to dig into the details of what happens inside these multilayer perceptrons, which make up the other big portion of the network. The computation here is actually relatively simple, especially when you compare it to attention. It boils down essentially to a pair of matrix multiplications with a simple something in between. However, interpreting what these computations are is exceedingly challenging. Our main goal here is to step through the computations and make them memorable, but I'd like to do it in the context of showing a specific example of how one of these blocks could, at least in principle, store a concrete fact. Specifically, it'll be storing the fact that M. J. plays basketball. I should mention the layout here is inspired by a conversation I had with one of those DeepMind researchers, N. N. For the most part, I will assume that you've either watched last two chapters, or otherwise you have a basic sense for what a transformer is. But refreshers never hurt, so here's the quick reminder of the overall flow. You and I have been studying a model that's trained to take in a piece of text and predict what comes next. That input text is first broken into a bunch of tokens, which means little chunks that are typically words or little pieces of words. And each token is associated with a high-dimensional vector, which is to say long list of numbers. This sequence of vectors then repeatedly passes through two kinds of operation. Attention, which allows the vectors to pass information between one another, and then the multilayered perceptrons, the thing that we're going to dig into today. And also there's a certain normalization step in between. After the sequence of vectors has flowed through many, many different iterations of both of these blocks, by the end, the hope is that each vector has soaked up enough information, both from the context, all of the other words in the input, and also from the general knowledge that was baked into the model weights through training.

14:59 - Conference Room (D. B.) - Speaker 2
Any questions so far? I saw in this diagram that it looked like perceptron blocks and retention blocks were alternating. Did see that right?

15:14 - Unidentified Speaker
Yeah, it appears so.

15:21 - E. G.
OK, not sure why.

15:36 - Conference Room (D. B.) - Speaker 2
All right, well, look at these blocks.

15:39 - Conference Room (D. B.) - Speaker 1
By the end, the hope is that each vector has soaked up enough information, both from the context, all of the other words in the input, and also from the general knowledge that was baked into the model weights through training, that it can be used to make a prediction of what token comes next. One of the key ideas that I want you to have in your mind is that all of these vectors live in a very, very high dimensional space, and when you think about that space, different directions can encode different kinds of meaning. So a very classic example that I like to refer back to is how if you look at the embedding of woman and subtract the embedding of man, and you take that little step and you add it to another masculine noun, something like uncle, you land somewhere very, very close to the corresponding feminine noun. In this sense, this particular direction encodes gender information. The idea is that many other distinct directions in this super high-dimensional space could correspond to other features that the model might want to represent. In a transformer, these vectors don't merely encode the meaning of a single word, though. As they flow through the network, they imbibe a much richer meaning based on all the context around them, and also based on the model's knowledge. Ultimately, each one needs to encode something far, far beyond the meaning of a single word, since it needs to be sufficient to predict what will come next. We've already seen how attention blocks let you incorporate context, but a majority of the model parameters actually live inside the MLP blocks, and one thought for what they might be doing is that they offer extra capacity to store facts. Like I said, the lesson here is going to center on the concrete toy example of how exactly it could store the fact that M. J. plays basketball. Now this toy example is going to require that you and I make a couple of assumptions about that high-dimensional space. First, we'll suppose that one of the directions represents the idea of a first name M., and then another nearly perpendicular direction represents the idea of the last name J., and then yet a third direction will represent the idea of basketball. So specifically what I mean by this is if you look in the network and you pluck out one of the vectors being processed, if its dot product with this first name M. direction is 1, That's what it would mean for the vector to be encoding the idea of a person with that first name. Otherwise, that dot product would be 0, or negative, meaning the vector doesn't really align with that direction. And for simplicity, let's completely ignore the very reasonable question of what it might mean if that dot product was bigger than 1. Similarly, its dot product with these other directions would tell you whether it represents the last name J., or basketball. So let's say a vector is meant to represent the full name, M. J. Then its dot product with both of these directions would have to be 1. Since the text, M. J., spans two different tokens, this would also mean we have to assume that an earlier attention block has successfully passed information to the second of these two vectors, so as to ensure that it can encode both names. With all of those as the assumptions, let's now dive into the meat of the lesson. What happens inside a multi-layer perceptron. You might think of this sequence of vectors flowing into the block, and remember, each vector was originally associated with one of the tokens from the input text. What's going to happen is that each individual from that sequence, goes through a short series of operations, we'll unpack them in just a moment, and at the end we'll get another vector with the same dimension. That other vector is going to get added to the original one that flowed in, and that sum is the result flowing out. This sequence of operations is something you apply to every vector in the sequence, associated with every token in the input, and it all happens in parallel. In particular, the vectors don't talk to each other in this step, they're all kind of doing their own thing. And for you and me, that actually makes it a lot simpler, because it means if we understand what happens to just one of the vectors through this block, we effectively understand what happens to all of them. When I say this block is going to encode the fact that M. J. plays basketball, what I mean is that if a vector flows in that encodes first name M. and last name J., then this sequence of computations will produce something that includes that direction basketball, which is what we'll add on the vector in that position. The first step of this process looks like multiplying that vector by a very big matrix. No surprises there, this is deep learning. And this matrix, like all of the other ones we've seen, is filled with model parameters that are learned from data, which you might think of as a bunch of knobs and dials that get tweaked and tuned to determine what the model behavior is. Now one nice way to think about matrix multiplication is to imagine each row of that matrix as being its own vector, and taking a bunch of dot products between those rows and the vector being processed, which I'll label as e for embedding. For example, suppose that very first row happened to equal this firstname-m direction that we're presuming exists. That would mean that the first component in this output, this dot product right here, would be 1 if that vector encodes the firstname-m, and 0 or negative otherwise.

21:11 - Conference Room (D. B.) - Speaker 2
Even more fun, think about what it would mean if that. Any questions out there?

21:32 - Unidentified Speaker
Okay.

21:34 - Conference Room (D. B.) - Speaker 1
First row was the first name M. plus last name J. direction. And for simplicity, let me go ahead and write that down as m plus j. Then taking a dot product with this embedding e, things distribute really nicely, so it looks like m dot e plus j dot e. And notice how that means the ultimate value would be 2 if the vector encodes the full name M. J., and otherwise it would be 1 or something smaller than 1. And that's just one row in this matrix. You might think of other rows as in parallel.

22:11 - Conference Room (D. B.) - Speaker 2
You all know what dot product is? You take a dot product of two vectors?

22:17 - E. G.
Yeah, it's linear algebra.

22:19 - Conference Room (D. B.) - Speaker 2
Well, how do you do it?

22:21 - E. G.
Actually, I've got a program. What I'll do is I'll post it.

22:27 - Conference Room (D. B.) - Speaker 2
If you have two vectors with, let's say, two vectors have five elements each, you multiply the first elements of each vector then you multiply the second element of each vector, and you multiply the third element of each vector, fourth and fifth, and then you add up the products. That's the dot product. So you multiply corresponding elements and then add up all the results. And then you get a number.

22:59 - Conference Room (D. B.) - Speaker 1
Hello, asking some other kinds of questions, probing at some other sort features of the vector being processed. Very often this step also involves adding another vector to the output, which is full of model parameters learned from data. This other vector is known as the bias. For our example, I want you to imagine that the value of this bias in that very first component is negative one, meaning our final output looks like that relevant dot product, but minus one. You might very reasonably ask why I would want you to assume that the model has learned this, and in a moment you'll see why it's very clean and nice if we have a value here which is positive if and only if our vector encodes the full name M. J., and otherwise it's zero or negative. The total number of rows in this matrix, which is something like the number of questions being asked in the case of GPT-3, whose numbers we've been following, is just under 50,000. In fact, it's exactly four times the number of dimensions in this embedding space. That's a design choice, you could make it more, you could make it less, but having a clean multiple tends to be friendly for hardware. Since this matrix full of weights maps us into a higher dimensional space, I'm going to give it the shorthand w up. I'll continue labeling the vector we're processing as e, and let's label this bias vector as b up and put that all back down in the diagram. At this point, a problem is that this operation is purely linear, but language is a very non-linear process. If the entry that we're measuring is high for M. plus J., it would also necessarily be somewhat triggered by M. plus P. and also A. plus J.

24:37 - Conference Room (D. B.) - Speaker 2
Any questions, comments, discussion points? See, I thought if you were going to add J. to M., that white arrow on the bottom should be that kind of grayed out arrow on top. In other words, you add the vectors end to end at the end of the at the arrowhead of the first vector, you add the tail of the second vector if you're going to add vectors. So I'm confused. It looks like you end up in the same place on the graph, on the graphic there, whether you do it his way or my way. Oh well.

25:25 - Unidentified Speaker
Here we go.

25:26 - Conference Room (D. B.) - Speaker 1
Despite those being unrelated conceptually, what you really want is a simple yes or no for the full name. So the next step is to pass this large intermediate vector through a very simple nonlinear function. A common choice is one that takes all of the negative values and maps them to zero and leaves all of the positive values unchanged. And continuing with the deep learning tradition of overly fancy names, this very simple function is often called the rectified linear unit, or ReLU for short. Here's what the graph looks like, So taking our imagined example where this first entry of the intermediate vector is 1 if and only if the full name is M. J., and 0 or negative otherwise, after you pass it through the ReLU you end up with a very clean value where all of the 0 and negative values just get clipped to 0. So this output would be 1 for the full name M. J. and 0 otherwise.

26:27 - Conference Room (D. B.) - Speaker 2
In other words, it very directly mimics the behavior of an AND gate. Anything, anyone?

26:35 - M. M.
Why do we want to get rid To speed up calculations you're losing a little bit something because after new models they use different activation function that even with the negative values they give a small value not to do the gradient descent zero. The gradient will become zero if this is zero, you know. So, but it's speeding the calculation, losing a little bit nodes, neurons from the architecture, but it's increasing the speed.

27:22 - D. D.
And it's computational.

27:24 - Unidentified Speaker
Yeah.

27:24 - D. D.
It's computationally easier too.

27:26 - Unidentified Speaker
Yeah.

27:27 - D. D.
I mean, there's a lot of them, you know, thousands, hundreds of thousands of computations.

27:37 - M. M.
Yes. So, Removing some of them, dropping some of the neurons is okay.

27:45 - E. G.
I dropped a file in the chat because, one, I couldn't remember all of the vector mathematics, so I just created a file with it. But it actually goes through identifying orthogonal parallelism. Well, all right.

28:10 - Conference Room (D. B.) - Speaker 2
So computationally, better to just drop all the negative numbers. You don't have to do as many multiplications or whatever, right? So why not drop any positive value that's less than 0.1, and that way you drop some more? Make those zero, then you can drop some more computations with that.

28:39 - M. M.
Isn't that what Softmax does? Exactly. So different functions, different activation functions can do different results. Like E. mentioned, if you want Softmax, you can do this. So depending what the best for the model is, transformer is doing Relu, most of the CNN are doing Relu, but the new models, this diffusion models, they're using different activation functions. Yeah, you can do whatever activation function.

29:17 - D. D.
That R-E-L-U activation function has a problem with disappearing numbers or something. Computation, there's something else that I would have to look But that particular activation function there needs them to be positive.

29:35 - M. M.
But this yellow that they mention in the text, this is the new one that actually negative numbers have some value, small value, but still value.

29:50 - Unidentified Speaker
So.

29:50 - Unidentified Speaker
OK.

29:51 - Conference Room (D. B.) - Speaker 1
Often models will use a slightly modified function that's called the JLU, which has the same basic shape, it's just a bit smoother, but for our purposes it's a little bit cleaner if we only think about the ReLU. Also, when you hear people refer to the neurons of a transformer, they're talking about these values right here. Whenever you see that common neural network picture with a layer of dots and a bunch of lines connecting to the previous layer, which we had earlier in this series, that's typically meant to convey this combination of a linear step, a matrix multiplication, followed by some simple term-wise nonlinear function like a ReLU, you would say that this neuron is active whenever this value is positive and that it's inactive if that value is zero.

30:40 - Conference Room (D. B.) - Speaker 2
Well, my comment here at this point is if you're trying to use a real neuron as a sort of analogy, you're trying to model Well, a neuron, neurons can only be positively activated and they may fire if they're sufficiently positively activated. But negative activations, I don't know if there's a way to negatively activate a neuron or not, but if it doesn't fire, it doesn't fire. And it doesn't matter how much you tell it not to fire. It's just the same thing, right? A negative value is all negative values that behave the same way.

31:26 - Unidentified Speaker
Unless you go into politics.

31:28 - E. G.
Next step looks very similar to the first one.

31:31 - Conference Room (D. B.) - Speaker 1
You multiply by a very large matrix and you add on a certain bias term. In this case, the number of dimensions in the output is back down to the size of that embedding space. So I'm going to go ahead and call this the down-projection matrix. And this time, instead of thinking of things by row, it's actually nicer to think of it column by column. You see, another way that you can hold matrix multiplication in your head is to imagine taking each column of the matrix and multiplying it by the corresponding term in the vector that it's processing, and adding together all of those rescaled columns. The reason it's nicer to think about this way is because here the columns have the same dimension as the embedding space, so we can think of them as directions in that space. For instance, we will imagine that the model has learned to make that first column into this basketball direction that we suppose exists. What that would mean is that when the relevant neuron in that first position is active, we'll be adding this column to the final result. But if that neuron was inactive, if that number was zero, then this would have no effect. And it doesn't just have to be basketball. The model could also make into this column many other features that it wants to associate with something that has the full name M. J. And at the same time, all of the other columns in this matrix are telling you what will be added to the final result if the corresponding neuron is active. And if you have a bias in this case, it's something that you're just adding every single time, regardless of the neuron values. You might wonder what's that doing. As with all parameter-filled objects here, it's kind of hard to say exactly. Maybe there's some bookkeeping that the network needs to do. But you can feel free to ignore it for now. Making our notation a little more compact again, I'll call this big matrix W down, and similarly call that bias vector B down, and put that back into our diagram. Like I previewed earlier, what you do with this final result is add it to the vector that flowed into the block at that position, and that gets you this final result. So for example, if the vector flowing in encoded both first name M. and last name J., then because this sequence of operations will trigger that AND gate, it will add on the basketball direction, so what pops out will encode all of those together. And remember, this is a process happening to every one of those vectors in parallel. In particular, taking the GPT-3 numbers, it means that this block doesn't just have 50,000 neurons in it, it has 50,000 times the number of tokens in the input. So, that is the entire operation. Two matrix products, each with a bias added, and a simple clipping function in between. Any of you who watched the earlier videos of this series will recognize this structure as the most basic kind of neural network that we studied there. In that example, it was trained to recognize handwritten digits. Over here, in the context of a transformer for a large language model, this is one piece in a larger architecture, and any attempt to interpret what exactly it's doing heavily intertwined with the idea of encoding information into vectors of a high-dimensional embedding space. That is the core lesson, but I do want to step back and reflect on two different things, the first of which is a kind of bookkeeping, and the second of which involves a very thought-provoking fact about higher dimensions that I actually didn't know until I dug into transformers. In the last two chapters, you and I started counting up the total number of parameters in GPT-3 and seeing exactly where they live. So let's quickly finish up the game here. I already mentioned how this up projection matrix has just under 50,000 rows, and that each row matches the size of the embedding space, which for GPT-3 is 12,288. Multiplying those together, it gives us 604 million parameters just for that matrix. And the down projection has the same number of parameters just with a transposed shape. So together, they give about 1.2 billion parameters. The bias vector also accounts for a couple more parameters, but it's a trivial proportion of the total, so I'm not even going to show it. In GPT-3, this sequence of embedding vectors flows through not one, but 96 distinct MLPs, so the total number of parameters devoted to all of these blocks adds up to about 116 billion. This is around two-thirds of the total parameters in the network, and when you add it to everything that we had before for the attention blocks, the embedding, and the unembedding, you do indeed get that grand total of 175 billion as advertised. It's probably worth mentioning there's another set of parameters associated with those normalization steps that this explanation has skipped over, but like the bias vector, they account for a very trivial proportion of the total. As to that second point of reflection, you might be wondering if this central toy example we've been spending so much time on reflects how facts are actually stored in real large language models. It is true that the rows of that first matrix can be thought of as directions in this embedding space, and that means the activation of each neuron tells you how much a given vector aligns with some specific direction. It's also true that the columns of that second matrix tell you what will be added to the result if that neuron is active. Both of those are just mathematical facts. However, the evidence does suggest that individual neurons very rarely represent a single clean feature like M. J. And there may actually be a very good reason this is the case, related to an idea floating around interpretability researchers these days known as superposition. This is a hypothesis that might help to explain both why the models are especially hard to interpret, and also why they scale surprisingly well. The basic idea is that if you have an n-dimensional space and you want to represent a bunch of different features using directions that are all perpendicular to one another in that space, you know, that way if you add a component in one direction it doesn't influence any of the other directions, then the maximum number of vectors you can fit is only n, the number of dimensions. To a mathematician, actually, this is the definition of dimension. But where it gets interesting is if you relax that constraint a little bit and you tolerate some noise. Say you allow those features to be represented by vectors that aren't exactly perpendicular, they're just nearly perpendicular, maybe between 89 and 91 degrees apart. If we were in two or three dimensions, this makes no difference. That gives you hardly any extra wiggle room to fit more vectors in, which makes it all the more counterintuitive that for higher dimensions, the answer changes dramatically. I can give you a really quick and dirty illustration of this using some scrappy Python. It's going to create a list of 100 dimensional vectors, each one initialized randomly. And this list is going to contain 10,000 distinct vectors, so 100 times as many vectors as there are dimensions. This plot right here shows the distribution of angles between pairs of these vectors. So because they started at random, those angles could be anything from 0 to 180 degrees. But you'll notice that already, even for random vectors, there's this heavy bias for things to be closer to 90 degrees. Then what I'm going to do is run a certain optimization process that iteratively nudges all of these vectors so that they try to become more perpendicular to one another. After repeating this many different times, here's what the distribution of angles looks like, We have to actually zoom in on it here, because all of the possible angles between pairs of vectors sit inside this narrow range between 89 and 91 degrees. In general, a consequence of something known as the Johnson-Lindenstrauss lemma is that the number of vectors you can cram into a space that are nearly perpendicular like this grows exponentially with the number of dimensions. This is very significant for large language models, which might benefit from associating independent ideas with nearly perpendicular directions. It means that it's possible for it to store many, many more ideas than are dimensions in the space that it's allotted. This might partially explain why model performance seems to scale so well with size. A space that has 10 times as many dimensions can store way, way more than 10 times as many independent ideas. And this is relevant not just to that embedding space where the vectors flowing through the model live, but also to that vector full of neurons in the middle of that multilayer perceptron that we just studied. That is to say, at the sizes of GPT it might not just be probing at 50,000 features, but if it instead leveraged this enormous added capacity by using nearly perpendicular directions of the space, it could be probing at many, many more features of the vector being processed. But if it was doing that, what it means is that individual features aren't going to be visible as a single neuron lighting up. It would have to look like some specific combination of neurons instead, a superposition. For any of you curious to more, a key relevant search term here is sparse autoencoder, which is a tool that some of the interpretability people use to try to extract what the true features are, even if they're very superimposed on all these neurons. I'll link to a couple really great anthropic posts all about this. At this point, we haven't touched every detail of a transformer, but you and I have hit the most important points. The main thing that I want to cover in a next chapter is the training process. On the one hand, the short answer answer for how training works is that it's all backpropagation, and we covered backpropagation in a separate context with earlier chapters in the series. But there is more to discuss, like the specific cost function used for language models, the idea of fine-tuning using reinforcement learning with human feedback, and the notion of scaling laws. Quick note for the active followers among you, there are a number of non-machine learning related videos that I'm excited to sink my teeth into before I make that next chapter, so it might be a but do it'll come in due time.

41:45 - Unidentified Speaker
All right.

41:47 - Conference Room (D. B.) - Speaker 2
Many people contribute to this.

41:52 - M. M.
It's really very good, but a lot of people are working.

42:03 - Unidentified Speaker
So many.

42:05 - Conference Room (D. B.) - Speaker 2
OK. Any other comments or questions? I'm just sort of, I guess that's sort of flabbergasted by this phenomenon where this technology has these emergent properties that even the inventors didn't really anticipate. And also they don't even know how it stores data. Like if you want to store that M. J. plays basketball in a database, you'd know how to do it. You'd have a key, right? A record with a key would probably be the string M. J. There would be a column for, you know, sport played or profession or something like that. And in the cell in that column would be, you know, basketball. That's how you'd sort in a database and compare sizes.

43:02 - D. D.
Go ahead. It tokenizes the, it's, and then the token is known by its weights. So it's got its own way of, you know, storing it, but it's not, you know, it's not something that we would Read like a database, but it's stored in the weights.

43:25 - M. M.
Yeah, but can be converted, can be converted from one storage format to another. Storage format. This is also possible. You know, another...

43:37 - Conference Room (D. B.) - Speaker 2
That'd be a good project.

43:39 - M. M.
What D. asked, if you want to do the database, you can convert it to database. That would be a really good project.

43:49 - D. D.
Something that you could get a hold of, like one of the smaller BERT models on HuggyPace, change it over to a database change.

44:02 - D. D.
It had to be a small one.

44:10 - Unidentified Speaker
Yep. Different concept.

44:14 - Conference Room (D. B.) - Speaker 2
Whoops, sorry about that. Right, well, we finished it. Whoops.

44:27 - Unidentified Speaker
Yep.

44:29 - Unidentified Speaker
What did I do?

44:35 - Conference Room (D. B.) - Speaker 2
You know, another question that kind of has a long history in scientific investigation is the question of how humans and animals and brains store memory. And nobody's ever really figured it out, but maybe this is sort of how it's done. It's not stored in any particular neuron. It's just sort of stored as a... Maybe this is something experimental psychology as well. Okay, I guess we're at a good stopping point. We finished Chapter 7 video. Next time, D. will step us through this process by which he does, gets these transcripts from Read.AI. So Read.AI gives, AI gives the transcript, and then he processes it in an interesting way to anonymize it and so on. This is last week's transcript. I didn't get rid of that. And soon we will start a new reading or potentially video, depending on what people want, since we're finished with this series. All right, folks. Well, thanks for joining in, and we'll see you next time. Thanks, guys.

46:28 - D. D.
As always, thank you.

AIn't What It Used to Be

Friday, July 11, 2025

7/11/25: Do chapter 7 video

No comments:

Post a Comment