Friday, June 20, 2025

6/20/25: Guest speaker with demo - Mark Windsor

Artificial Intelligence Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time on Fridays. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many. 
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes (167th meeting, June 20, 2025)

  • Today:
    • Guest speaker with demo: Mark Windsor of Atlas Research (https://atlas-research.io). His system converts academic papers into code, integrating AI and Jupyter Notebooks.

Friday, June 13, 2025

6/13/25: Turning papers into code? And chapter 6 video

Artificial Intelligence Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time on Fridays. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many. 
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes (166th meeting, June 13, 2025)

Table of Contents
* Agenda and minutes
* Appendix: Transcript (when available)

Agenda and Minutes
  • Announcements, updates, questions, etc. as time allows. 
  • Elisabeth Sherwin writes:

    Wednesday, June 18, 12-1pm, we will have a session with Brad Sims. He will teach us about the best practice for crafting prompts and we will then practice it, talk about our results and learn about refining the prompts.


    Zoom link for Teaching with AI meeting: https://ualr-edu.zoom.us/j/87200189042

     

  • If anyone has an idea for an MS project where the student reports to us for a few minutes each week for discussion and feedback - a student could likely be recruited! Let me know .... 
    • We discussed book projects but those aren't the only possibilities.
    • VW had some specific AI-related topics that need books about them.
  • Any questions you'd like to bring up for discussion, just let me know.
    • GS: would like to compose some questions for discussion regarding agentic AI soon, presenting and/or guiding discussion
  • Soon: YP would like to lead a discussion or present expanding our discussions to group.me or a similar way. Any time should be fine. 
  • Anyone read an article recently they can tell us about next time?
  • Any other updates or announcements?
  • From: markwindsorr@atlas-research.io    
    Loved reading your paper (Moore's law, Wright's law and the Countdown to Exponential Space). I'm Mark, a solo developer and founder of Atlas Research.
         I've built an AI integrated Jupyter Notebook that reproduces research papers into executable code, all you need to do is upload the PDF and the AI pipeline does the work. Been reaching out to some authors of popular papers on Arxiv, wanted to ask if it was ok to have your paper in my library that can be displayed in a pdf/markdown/LaTeX viewer alongside the notebook in the app.
         If you're interested, there's a beta at https://atlas-research.io. If you need more credits to claude or open ai, or want a feature built to solve any of your problems, please please reach out. (Will bend over backwards to help). Always available to have a chat too if you'd like to know more. Many researcher/authors' have been using my pipeline to reproduce other peoples work to try out new ideas, especially in finance. Can book here: https://cal.com/mark-windsorr/atlas-research-demo if interested
         Cheers,
    • Could schedule a demo with the AI Discussion Group but time may be a problem, or a demo with anyone interested and at a more possible time, or with permission record it and play it back to the AI group, or anyone could individually meet with him, or ...? Time slots in his calendar seem to be between about 11 pm and 11 am CDT.
  • Chapter 6 video, https://www.youtube.com/watch?v=eMlx5fFNoYc. We got up to 11:07.
  • Here is the latest on future readings and viewings
    • Let me know of anything you'd like to have us evaluate for a fuller reading.
    • https://transformer-circuits.pub/2025/attribution-graphs/biology.html.
    • https://arxiv.org/pdf/2001.08361. 5/30/25: eval was 4.
    • We can evaluate https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10718663 for reading & discussion.
    • popular-physicsprize2024-2.pdf got a evaluation of 5.0 for a detailed reading.
    • https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-refusals
    • https://venturebeat.com/ai/anthropic-flips-the-script-on-ai-in-education-claude-learning-mode-makes-students-do-the-thinking
    • https://transformer-circuits.pub/2025/attribution-graphs/methods.html
      (Biology of Large Language Models)
    • We can work through chapter 7: https://www.youtube.com/watch?v=9-Jl0dxWQs8
    • https://www.forbes.com/sites/robtoews/2024/12/22/10-ai-predictions-for-2025/
    • Prompt engineering course:
      https://apps.cognitiveclass.ai/learning/course/course-v1:IBMSkillsNetwork+AI0117EN+v1/home
  • Schedule back burner "when possible" items:
    • TE is in the informal campus faculty AI discussion group. SL: "I've been asked to lead the DCSTEM College AI Ad Hoc Committee. ... We’ll discuss AI’s role in our curriculum, how to integrate AI literacy into courses, and strategies for guiding students on responsible AI use."
    • Anyone read an article recently they can tell us about?
    • If anyone else has a project they would like to help supervise, let me know.
    • (2/14/25) An ad hoc group is forming on campus for people to discuss AI and teaching of diverse subjects by ES. It would be interesting to hear from someone in that group at some point to see what people are thinking and doing regarding AIs and their teaching activities.
    • The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates now and then. 
Appendix: Transcript 

Friday, June 6, 2025

6/6/25: Finish chapter 6 video

 Artificial Intelligence Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time on Fridays. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many. 
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes (165th meeting, June 6, 2025)

Table of Contents
* Agenda and minutes
* Appendix: Transcript (when available)

Agenda and Minutes
  • Announcements, updates, questions, etc. as time allows. 
  • Marla Johnson writes:

    Coding for Wellness AI Hackathon

    Friday, June 13th will be a team's lucky day and they will take away over $2000 in a cash prize at the Coding for Wellness AI Hackathon.
        You can learn more at this link. https://ualr.edu/news/2025/06/05/ai-hackathon-pitch. Go there to reserve your space at the free event. 
         Our participants will walk away with several class hours in AI and certificates from NVIDIA. Dr. Brian Berry and I are going to these classes -- they are SO good! Students will also learn how to work together as a team, how to manage a product, how to validate their solutions with customers, how to build a pitch deck using AI, etc. 
         I hope to see you there! 
    Marla
     
    ...and... 
     
    We will have 40 students and 40 volunteers in the Library and EIT building next week, June 9 - 13. You may see them around especially at 3 pm when they have mental health breaks including a drumming circle, art therapy, dancing, and singing bowls thanks to UA Little Rock Counseling Services and the Central Arkansas Veteran's Hospital System. 
          I hope many of you cheer our high school and college students and register (for free) to see our big finale of pitches and demos for a $2100 prize. Sign up here: https://forms.gle/LWjxuhPkChoTzVSe8
  • Elisabeth Sherwin writes:

    Greetings all,


    Wednesday, June 18, 12-1pm, we will have a session with Brad Sims. He will teach us about the best practice for crafting prompts and we will then practice it, talk about our results and learn about refining the prompts.


    Zoom link for Teaching with AI meeting

     

    https://ualr-edu.zoom.us/j/87200189042

     

  • If anyone has an idea for an MS project where the student reports to us for a few minutes each week for discussion and feedback - a student could likely be recruited! Let me know .... 
    • We discussed book projects but those aren't the only possibilities.
    • VW had some specific AI-related topics that need books about them.
  • Any questions you'd like to bring up for discussion, just let me know.
    • GS: would like to compose some questions for discussion regarding agentic AI soon, presenting and/or guiding discussion
  • Soon: YP would like to lead a discussion or present expanding our discussions to  group.me or a similar way. Perhaps on 6/13/25 or TBD.
  • Anyone read an article recently they can tell us about next time?
  • Any other updates or announcements?
  • We finished the Chapter 6 video, https://www.youtube.com/watch?v=eMlx5fFNoYc.
  • Here is the latest on future readings and viewings
    • Let me know of anything you'd like to have us evaluate for a fuller reading.
    • https://transformer-circuits.pub/2025/attribution-graphs/biology.html.
    • https://arxiv.org/pdf/2001.08361. 5/30/25: eval was 4.
    • We can evaluate https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10718663 for reading & discussion.
    • popular-physicsprize2024-2.pdf got a evaluation of 5.0 for a detailed reading.
    • https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-refusals
    • https://venturebeat.com/ai/anthropic-flips-the-script-on-ai-in-education-claude-learning-mode-makes-students-do-the-thinking
    • https://transformer-circuits.pub/2025/attribution-graphs/methods.html
      (Biology of Large Language Models)
    • We can work through chapter 7: https://www.youtube.com/watch?v=9-Jl0dxWQs8
    • https://www.forbes.com/sites/robtoews/2024/12/22/10-ai-predictions-for-2025/
    • Prompt engineering course:
      https://apps.cognitiveclass.ai/learning/course/course-v1:IBMSkillsNetwork+AI0117EN+v1/home
  • Schedule back burner "when possible" items:
    • TE is in the informal campus faculty AI discussion group. SL: "I've been asked to lead the DCSTEM College AI Ad Hoc Committee. ... We’ll discuss AI’s role in our curriculum, how to integrate AI literacy into courses, and strategies for guiding students on responsible AI use."
    • Anyone read an article recently they can tell us about?
    • If anyone else has a project they would like to help supervise, let me know.
    • (2/14/25) An ad hoc group is forming on campus for people to discuss AI and teaching of diverse subjects by ES. It would be interesting to hear from someone in that group at some point to see what people are thinking and doing regarding AIs and their teaching activities.
    • The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates now and then. 
Appendix: Transcript 
 
AI Discussion Group
Fri, Jun 6, 2025

0:09 - D. B.
Hi, everyone.

0:13 - Unidentified Speaker
All right. Can you all hear me? Yes, we can.

0:36 - E. G.
And I can hear you, so we're good.

0:38 - D. B.
All right. I'm going to go ahead and share my screen.

1:25 - Unidentified Speaker
All right, well, here we are. Welcome, everyone. So let's see.

1:35 - M. M.
Dan, I sent you the challenges that we're going to do next week is this AI hackathon. Yeah, so some of you asked me to upload the challenges and probably... Okay, yeah, I didn't put them here because it's like 10 pages long, but are you involved?

1:58 - D. B.+M. M.
Well, let me just kind of give people a rundown of what it is, and then...

2:04 - M. M.
Oh, yeah, I'm involved, of course. Actually, it was my suggestion, when I talk with Marla, and Marla accepts the idea very well, and she says, oh, we need this. And now so many participants, so many people are interested. So we will have actually more than 40 people that participate in teams, additional 40 volunteers, maybe around 10 companies that are supporting the event. So she actually advertised event and probably the TV station, the company that is here, our local TV. Wow, cool.

2:56 - D. B.
Will come, yes, to record it.

3:01 - M. M.
So she's wonderful. Advertising the event. And yeah.

3:05 - D. B.+M. M.
This is her, by the way, M. J. She's been to this meeting a couple of times.

3:13 - M. M.
Yes, yes, yes. But she's extremely busy right now organizing all of this very, very well organized event. So the challenge is you can share or you can upload it. I can go ahead.

3:30 - D. B.+M. M.
Mental health is a hackathon about the mental health and interesting topics.

3:39 - M. M.
They are coming from veterans, they are coming from Blue Cross, mostly it's using agenting AI, conversational AI, creating, yes, these are the challenges. Creating bots. My students, myself, my team actually is giving two NVIDIA certificates. One is in large language models and RAC. Together the certificate is this one, but we also teach people a JNTKI. This is one certificate. That is for both, we have actually high school too. So high school and graduate students. So the first one is not coding, but just explain the, you can show the second one and third, five different topics. This is all next week, right? Yes, this is all this next week. We prepare a lot of resources for the students, for students and mostly, yeah, speak easy. This is one of the interesting also. Digital Companion for Social Inheritance. Everything is related with mental health. So the topics are selected, as you can see, this is from Children's Hospital, selected. So they came up- There's going to be teenagers are going to be involved, the high schoolers are going to be involved in this hackathon challenge? Yes, yes. The hackathon, yeah, so we can report more who is the winner next Friday. For now, and they have a good price actually, $2,000 and more. For the winner. Yeah, with the competition judges. Like I say, for one of the certificates, I'll say this is another one from mental health clinics. That one's interesting.

6:09 - E. G.
This one?

6:11 - D. B.
Yeah, it's a veteran.

6:14 - E. G.
The hard part is a lot of veterans don't take into consideration mental health. They think of it as a weakness.

6:29 - D. B.
I understand that PTSD is pretty hard to treat. It is.

6:37 - E. G.
It's also hard to diagnose. It masks itself. It has to be in a situation to present itself. It's like a landmine. Is nothing more than a landmine that needs an event for it to explode.

6:58 - M. M.
We think that with all of this, E., instead of people saying we have resources here, but we have resources in different folder that is for the participants. We think that people can express the feelings and concerns better with some kind of digital avatar instead of, yes, instead of talking with people, okay? Exactly. What people will say and stuff. So what we teach the students how to use this retrieval augmented generation tool, put particular documents, okay, where they start learning how to use large language models, variety. Actually, it's a very intensive program, so probably I should give you the program, but we will do online. So for all of you, we will have it online. We're just experimenting here in person. But what I want to say, yes, this is our Avatars, they are very helping people to communicate, to feel not alone, to feel understand it, and with the information is very guided with variety of resources additional to what ChatGPT can give your open source resources. You can show the last one is very good. This is why I. actually present to you guys here. The storytelling, okay? Storytelling is text to image and text to video, okay? And I actually heard on the TV that for veterans also, they want to share the stories and they like this storytelling. So this is actually not only for veterans, so not only for mental health, but we are applying for kids also, for language education. We participate in Hackathon in NVIDIA competition with this project and I. like it. And I think that he mentioned here is for kids, but it's not only. But anyway, I'm telling you, this is very, very interesting project. All of them.

9:29 - D. B.+M. M.
So take a look.

9:32 - M. M.
and we will do for you online. All right, so yeah, so Dr.

9:39 - D. B.
M.'s involved in it, M. J.'s arranging it, and it's a mental health hackathon, and this is going to be next week, has a prize of $2,100. Yeah. And oh, they've got 40 students, 40 volunteers, and oh, okay, you can go there. They'll be in the library and EIT building next week. And she suggests going around 3 p.m. When they have mental health hackathons, they have mental health breaks, including stuff.

10:15 - M. M.
Yeah, if you're around, you're welcome. John, are you involved in this?

10:21 - J. C.
I am not. Nope.

10:23 - Unidentified Speaker
No.

10:23 - M. M.
Yeah. I know Marla has been very busy, though.

10:28 - J. C.+M. M.
Yeah. So.

10:29 - M. M.
But if you have time, stop by here, OK? We will be happy to see you. And next Friday is the official ceremony, award ceremony. Yeah, let's see.

10:42 - D. B.+M. M.
There's something about Friday. When is it?

10:46 - M. M.
They're going to have an award ceremony, right?

10:49 - D. B.
Yeah. Demo and pitch competition is going to be 3 to 5 on next Friday?

10:57 - D. B.+M. M.
Yes, exactly.

10:58 - M. M.
This is open to everybody and actually, they will probably present some demos or something competition. So it's very, very interesting. I told you I participate in NVIDIA and was nice and IWS also competition. It's very interesting because the teams, they present the work. Marla doesn't expect. Product, you know about designing the product and actually J. recommends your books. I mean the books that you recommend think like a human, you know Speak like a human remember that J. C. actually is my mentor if you don't know him. He's a very very famous Businessman looks like this is going to overlap with our meeting. I mean, are you going to be there in the auditorium, Fanny? Yeah, it's true.

12:02 - D. B.
This is true. How many people here are planning on going to that, would rather, or are planning to go to this besides F.?

12:12 - M. M.
Daniel can come, probably. OK.

12:14 - D. B.
Well, I'm just thinking we could cancel the meeting next week. But I think we won't because, you people aren't going to be at the demo.

12:25 - M. M.
But then you guys can tell us what happened the following week. We can record it. We can record it.

12:34 - D. B.
Marla usually record it.

12:36 - M. M.
But she's not going to live stream it, is she? I don't know. Probably not.

12:43 - D. B.
If we do it with Blackboard, we can do it. Right. Well, anyways. Yeah, I cannot.

12:51 - D. B.+M. M.
I look forward to hearing more about it.

12:55 - D. B.
And there's another. So you can go to that if you're in town. Go to that next Friday. And then there's another thing going on a couple of weeks in the future. On the 18th, Wednesday, there's going to be a workshop on prompts. It's going to be for people who are not AI specialists or computer folks. It's going to be for teachers at UALR. So a kind of general introduction to prompt engineering.

13:34 - D. D.
It's probably going to be really good. Even if we fancy ourselves experts, there's probably something we can learn there.

13:43 - Unidentified Speaker
Yeah.

13:43 - D. B.
Definitely. Definitely. So anyway, you can register at this. Link, if you like, and go to it. It's going to be on by Zoom, right? I don't know.

13:55 - M. M.
That would be good. It's good.

13:58 - D. B.
It's Zoom. Yes, it's Zoom. And T. E., who's here sometimes, goes to those. They have weekly meetings during the semester on how to teach with AI, and he's been going to them. Except they're not having them during the summer. Okay, as always, I'll just mention that if you have an idea for an MS master's project you wanna see sort of jointly as a group supervised, let me know and I'll try to recruit a student. We tried it for a book project and it worked out pretty well, I thought. And also, if you have any questions you'd like to bring up for discussion, let me know and I'll put them on the agenda. We did that last time. We had an interesting question on rubrics for evaluating master's type AI projects. And we had a discussion about that. And Y. said something about potentially leading a discussion about expanding this group to group me, something like that. So he's certainly welcome to do that. Anyone read any articles? Want to brief us on next time? If so, let me know. Any other announcements or updates or anything like that? Well, if not, let's go to our chapter. We've been sort of just been on the back burner for a while, this chapter six video. And we're up to time 1308. We'll start there as soon as I work my windows properly. Let me go to here. Here it is.

15:45 - E. G.
OK, I'm going to go like that.

15:48 - D. B.
And I'm just going to play a minute or two of it. And then we'll stop and discuss it. And we'll keep on going like that. So here we go. I'm going to turn up the volume a little bit.

16:09 - Unidentified Speaker
Okay, great.

16:10 - D. B.
Computing this pattern lets the model deduce which words are relevant to which other words. Can you all hear that okay, by the way?

16:17 - E. G.
Yes, sir. Okay.

16:18 - D. B.
Now you need to actually update the embeddings, allowing words to pass information to whichever other words they're relevant to. For example, you want the embedding of fluffy to somehow cause a change to creature that moves it to a different part of this 12,000 dimensional embedding space. That more specifically encodes a fluffy creature. Comments or questions? Discussion points? Okay. What I'm going to do here is first show you the most straightforward way that you could do this, though there's a slight way that this gets modified in the context of multi-headed attention. This most straightforward way would be to use a third matrix, what we call the value matrix, which you multiply by the embedding of that first word, for example, fluffy. The result of this is what you would call a value vector. And this is something that you add to the embedding of the second word. In this case, something you add to the embedding of creature.

17:22 - E. G.
Comments or questions? Well, I do. And this is something I was thinking about because I've already watched this and these vectors are pointing to a point in space, as he said, 10,000 dimensions here. But as I was thinking about it, and Dr. M., I'd like to get your opinion on this, is rather than going in a point and looking for a vector in space, wouldn't it more be a cloud? So you're kind of looking like an area in between all of these bound areas.

17:58 - M. M.
Yeah, but they create the cloud. This is what is initial creation of the cloud. You're just creating the cloud now.

18:07 - E. G.
But in here, it's looking at changing a vector to point over to this other place. Is it pointing to a place or is it pointing to an area in space in this particular case?

18:22 - M. M.
It's pointing in the area. You're right. It's pointing in the area. But it's pointing with the influence of content information. So this is the most important, you know? So it's not taking just individual words, but it's taking the content words, like a fluffy, together with the creature. And it's going to the creature. So fluffy can be something else. It's another cloud. It's going to the cloud that are creatures or animals. Or is going to the cloud, you're correct, because all of this is not individual points. There are multiple points, depending from the dimensionality, 10 million or more, you know, dimension. Thank you for explaining that. I was interpreting it as more of a redirection of a vector.

19:17 - Unidentified Speaker
Well, what is W sub V?

19:20 - D. B.
That's the weights, the weight-

19:24 - M. M.
These are the weights currently.

19:24 - M. M.
current value of the weights, current situation of the neural network. For example, you have, I like this cat, but in one moment is coming, I like the drop too. So where do you go? To the cats or drop with like, So after one sentence is coming another sentence that probably completely the content. And the neural network is taking one only word and predict the weights. Another word and predict the weights. But the transformer is doing more. It's going back and forward, you know, to capture all this content and to create. This is why this vector space will be dynamic. It's not fixed in, you know, put the They are going in the right direction. No, they are flexible. They are moving until the whole training is done.

20:29 - D. B.
Okay, so the vertical vector there is fluffy, right? And then we're doing a matrix multiplication with W's. The weights of the neural network is processing fluffy to give us fluffy creature. To give you just current value, this current vector, okay, this value is the current vector, but like I say, it's not stable until the end of the training of the neural network.

20:59 - M. M.
So this is just the initial step that is, and it's like this, you know, it's a training like many hours, depending from the corpus, the amount of text that needs to It's the terms that are annealing out the final product.

21:22 - Unidentified Speaker
So you start with creature, that's a huge corpus of data.

21:29 - E. G.
Fluffy creature, you're reducing it down. Blue fluffy creature, reducing it down further.

21:39 - M. M.
Correct. Yeah, yeah. Concentrate or allocate precisely where this vector will go and to train the model and the model is trained, it's okay. All right, here we go.

21:52 - D. B.
So this value vector lives in the same very high dimensional space as the embeddings. When you multiply this value matrix by the embedding of a word, you might think of it as saying, if this word is relevant to adjusting the of something else, what exactly should be added to the embedding of that something else in order to reflect this? Looking back in our diagram, let's set aside all of the keys and the queries, since after you compute the attention pattern, you're done with those. Then you're going to take this value matrix and multiply it by every one of those embeddings to produce a sequence of value vectors. You might think of these value vectors as being kind of associated with the corresponding keys. For each column in this diagram, you multiply each of the value vectors by the corresponding weight in that column. For example, here, under the embedding of creature, you would be adding large proportions of the value vectors for fluffy and blue, while all of the other value vectors get zeroed out, or at least nearly zeroed out. And then finally, the way to actually update the embedding associated with this column, previously encoding some context-free meaning of creature, you add to all of these rescaled values in the column, producing a change that you want to add that I'll label delta e, and then you add that to the original embedding. Hopefully, what results is a more refined vector encoding the more contextually rich meaning, like that of a fluffy blue creature. Okay, any comments or questions or anything? All right. And of course, you don't just this to one embedding, you apply the same weighted sum across all of the columns in this picture, producing a sequence of changes. Adding all of those changes to the corresponding embeddings produces a full sequence of more refined embeddings popping out of the attention block. Zooming out, this whole process is what you would describe as a single head of attention. As I've described things so far, this process is parameterized by three distinct matrices, filled with tunable parameters, the key, the query, and the value. I want to take a moment to continue what we started in the last chapter with a scorekeeping where we count up the total number of model parameters using the numbers from GPT-3. These key and query matrices each have 12,288 columns, matching the embedding dimension, and 128 rows, matching the dimension of that smaller key query space. This gives us an additional 1.5 million parameters for each one. If you look at that value matrix by contrast, the way I've described things so far would suggest that it's a square matrix that has 12,288 columns and 12,288 rows, since both its inputs and its outputs live in this very large embedding space. If true, that would mean about 150 million added parameters. And to be clear, you could do that. You could devote orders of magnitude more parameters to the value map than to the key and query. But in practice, it is much more efficient if instead you make it so that the number of parameters devoted to this value map is the same as the number devoted to the key in the query. This is especially relevant in the setting of running multiple attention heads in parallel. The way this looks is that the value map is factored as a product of two smaller matrices. Conceptually, I would still encourage you to think about the overall linear map, one with inputs and outputs both in this larger embedding space, for example taking the embedding of blue to this blueness direction that you would add to nouns. It's just that it's broken up into two separate steps. The first matrix on the right here has a smaller number of rows, typically the same size as the key query space. What this means is you can think of it as mapping the large embedding vectors down to a much smaller space. This is not the conventional naming, but I'm going to call this the value-down matrix. The second matrix maps from the smaller space back up to the embedding space, producing the vectors that you use to make the actual updates. I'm going to call this one the value-up matrix, which, again, is not conventional. The way that you would see this written in most papers looks a little different. I'll talk about it in a minute. In my opinion, it tends to make things a little more conceptually confusing. To throw in linear algebra jargon here, what we're basically doing is constraining the overall value map to be a low-rank transformation. Turning back to the parameter count, all four of these matrices have the same size. Adding them all up, we get about 6.3 million parameters for one attention head.

OK. Well, we certainly heard the idea of the number of parameters in a neural network. Any comments or questions? Do I continue? As a quick side note, to be a little more accurate, everything described so far is about what you would call a self-attention head to distinguish it from a variation that comes up in other models that's called cross-attention. This isn't relevant to our GPT example, but if you're curious, cross-attention involves models that process two distinct types of data, like text in one language and text in another language that's part of an ongoing generation of a translation, or maybe audio input of speech and an ongoing transcription. A cross-attention head looks almost identical, the only difference is that the key and query maps act on different data sets. In a model doing translation, for example, the keys might come from one language, while the queries come from another, and the attention pattern could describe which words from one language correspond to which words in another. And in this setting, there would typically be no masking, since there's not really any notion of later tokens affecting earlier ones. Staying focused on self-attention, though, if you understood everything so far, and if you were to stop here, you would come away with the essence of what attention really is. All that's really left to us is to lay out the sense in which you do this many, many different times. In our central example, we focused on adjectives updating nouns. But of course, there are lots of different ways that context can influence the meaning of a word. If the words "they crashed the preceded the word car" it has implications for the shape and the structure of that car. And a lot of associations might If the word wizard is anywhere in the same passage as Harry, it suggests that this might be referring to Harry Potter, whereas if instead the words Queen, Sussex, and William were in that passage, then perhaps the embedding of Harry should instead be updated to refer to the prince. For every different type of contextual updating that you might imagine, the parameters of these key and query matrices would be different to capture the different attention patterns, and the parameters of our value map would be different based on what should be added to the embeddings. Well, anything? All right, continue. And again, in practice, the true behavior of these maps is much more difficult to interpret, where the weights are set to do whatever the model needs them to do to best accomplish its goal of predicting the next token. As I said before, everything we described is a single head of attention and a full attention block inside a transformer, of what's called multi-headed attention, where you run a lot of these operations in parallel, each with its own distinct key, query, and value maps. GPT-3, for example, uses 96 attention heads inside each block. Considering that each one is already a bit confusing, it's certainly a lot to hold in your head. Just to spell it all out very explicitly, this means you have 96 distinct key and query matrices, producing 96 distinct attention patterns. Then each head has its own distinct value matrices used to produce 96 sequences of value vectors. These are all added together using the corresponding attention patterns as weights. What this means is that for each position in the context, each token, every one of these heads produces a proposed change to be added to the embedding in that position. So what you do is you sum together all of those proposed changes, one for each head, and you add the result to the original of that position. This entire sum here would be one slice of what's outputted from this multi-headed attention block, a single one of those refined embeddings that pops out the other end of it. Again, this is a lot to think about, so don't worry at all if it takes some time to sink in. The overall idea is that by running many distinct heads in parallel, you're giving the model the capacity to learn many distinct ways that context can...

30:37 - E. G.
I got a question here. Sorry, what? I got a question here. It keeps using 96 heads in GPT-3. One of the things that I tested out was the parallelism. Now, I put a model that fully fit in my video card And it was able to address requests, prompts, 20 times faster, 30 times faster than a larger model or the same model where I had to actually use physical memory of my computer. How is that it doesn't add up?

31:28 - E. G.
Well, this is training the model.

31:39 - D. D.
This is when they train it, not when you're inferring it.

31:47 - D. D.+E. G.
OK. Exactly.

31:49 - M. M.
Daniel, I'll answer your question. It is the training and it's a huge model. So if you use something else and perform better in your case, yeah. In many cases, actually, the small models can perform better.

32:11 - E. G.
No, it was just, it's not the small or the large. It was operating in the video card or operating on system memory.

32:23 - M. M.
Yeah, but it's inference, but we're talking about training here, like...

32:30 - E. G.
Okay. All right.

32:32 - D. B.
Just the training.

32:34 - M. M.
Yeah. Any more comments? Questions? Well, we don't know exactly, or should we, okay, for all of the models, how they are trained. We have some information and different embeddings even. There is a table, I think I have it here, what the different models are using. Okay. All right.

33:01 - D. B.
There's one added slightly annoying thing that I should really mention for any of you who go on to read more about transformers. You remember how I said that the value map is factored out into these two distinct matrices, which I labeled as the value-down and the value-up matrices. The way that I framed things would suggest that you see this pair of matrices inside each attention head, and you could absolutely implement it this way. That would be a valid design. But the way that you see this written in papers, and the way that it's implemented in practice, looks a little different. All of these value-up matrices for each head appear stapled together in one giant matrix that we call the output matrix, with the entire multi-headed attention block. And when you see people refer to the value matrix for a given attention head, they're typically only referring to this first step, the one that I was labeling as the value down projection into the smaller space. For the curious among you, I've left an on-screen note about it. It's one of those details that runs the risk of distracting from the main conceptual points, but I do want to call it out just so that you know if you read about this in other sources. Setting aside all the technical nuances, in the preview from the last chapter we saw how data flowing through a transformer doesn't just flow through a single attention block. For one thing, it also goes through these other operations called multi-layer perceptrons. We'll talk more about those in the next chapter. And then it repeatedly goes through many, many copies of both of these operations. What this means is that after a given word imbibes some of its context, there are many more chances for this more nuanced embedding to be influenced by its more nuanced surroundings. The further down the network you go, with each embedding taking in more and more meaning from all the other embeddings, which themselves are getting more and more nuanced, the hope is that there's the capacity to encode higher-level and more abstract ideas about a given input beyond just descriptors and grammatical structure. Things like sentiment and tone and whether it's a poem and what underlying scientific truths are relevant to the piece and things like that. Turning back one more time, GPT-3 includes 96 distinct layers, so the total number of key, query, and value parameters is multiplied by another 96, which brings the total sum to just under 58 billion distinct parameters devoted to all of the attention heads. That is a lot, to be sure, but it's only about a third of the 175 billion that are in the network in total. So even though attention gets all of the attention, the majority of parameters come from the blocks sitting in between these steps. In the next chapter, you and I will talk more about those other blocks, and also a lot more about the training process. Any comments or questions?

35:53 - Unidentified Speaker
All right, well, I'll continue.

35:55 - D. B.
A big part of the story for the success of the attention mechanism is not so much any specific kind of behavior that it enables, but the fact that it's extremely parallelizable. Meaning that you can run a huge number of computations in a short time using GPUs. Given that one of the big lessons about deep learning in the last decade or two has been that scale alone seems to give huge qualitative improvements in model performance, there's a huge advantage to parallelizable architectures that let you do this. If you want to learn more about this stuff, I've left lots of links in the description. In particular, anything produced by A. K. or C. O. tend to be pure gold. In this video, I wanted to just jump into attention in its current form, but if you're curious about more of the history for how we got here and how you might reinvent this idea for yourself, my friend V. just put up a couple videos giving a lot more of that motivation. Also, B. C. from the channel The Art of the Problem has a really nice video about the history of large language models. All right.

36:59 - R. S.
Any last All right, so. All right.

37:10 - D. B.
Let's say we finished.

37:15 - J. C.
So I guess my question is, for those of you that are in the field, what does this really mean? We've kind of jumped from, I guess, where I came in, you had the graduate student who was assuming that the AI models were a tool. And now this is into the, how do you make such a tool? Is that even right? It's been a long time I had linear algebra or matrix theory.

38:04 - D. B.
So this, just to give a little bit of background, this is the sixth video of seven by this guy who does sort of tutorials in YouTube using these sophisticated animations. So, you know, I was a little lost. I picked up a little more, I think, in the earlier videos. You know, at some point you can't just watch videos, you have to work examples on paper or program them or something. Anyone else have any comments?

38:41 - E. G.
I could sum it up. The AIs or large language models, using them is like sausage. It's like eating a sausage. We're learning how the sausage is made, what goes into it, what proportion, And that's what this has done. And based on what I'm hearing with some of the new anthropic models coming out, this may be even dated. Because a lot of the models now are, well, we haven't reached it yet, but AI Singularity, where the models are able to build better models themselves outside of human interaction. But better by what definitions?

39:28 - J. C.
Probably, I think, the one that we most use, accuracy.

39:33 - E. G.
One that I like to leverage is being a programmer for 40 years. One of the things I use is I put in a question on how to code something. I had a junior who did a whole website just using Claude AI but it is the worst most convoluted spaghetti code you'll ever come across. I think that that if you take a look at the output from it and the quality of it as it starts to improve, the code generated is more robust, more thorough, more compartmentalized, single responsibility, class-structured applications, then yes, it's growing right now. And some of the code that you get from these don't even work.

40:37 - M. M.
Did you try this copilot?

40:40 - E. G.
I have tried copilot, and it's good.

40:46 - Unidentified Speaker
Yeah.

40:46 - M. M.
But it still hallucinates.

40:48 - E. G.
It assumes that that's because the thing is, I will say. Build the method, build the function using test driven development, I want to test harness so that way I could test the function, then I'll ask it run the test to see if it passes. Does it pass? No. What's wrong with the code?

41:10 - M. M.
The code is incorrect. Fix the code.

41:14 - E. G.
and it'll come back with a fix. I say, now run the test. Does the code work? No. It doesn't know to say, OK, keep testing back and forth until you get something that works on the test harness.

41:28 - J. C.
If you tell it to do that, wouldn't it?

41:32 - E. G.
It keeps coming back with incorrect code because you'll ask it, does it work?

41:37 - J. C.
And the thing is, I'll already know ahead of time whether or not it'll But if you told it to iterate until it passed the test, to code and test and adapt and then, I mean, basically the systems development cycle or whatever. On small, simple things, yes.

42:00 - E. G.
On complex, say if I'm using like an object relational manager, an ORM, and I'm asking it to do some type of interpretation I do a lot of stuff with delegates and Lambda functions, link queries, where I use a predicate builder. It gets confused. Recursion, it has issue with.

42:26 - D. B.+E. G.
I give some, I teach a little bit of programming in one of my courses I taught last semester.

42:36 - D. B.
And the students, they want to use AI to do it. Then to improve the program from week to week, the AI bogs down and the ones who are too reliant on AI start losing points because they don't know how to fix the code and the AI can't do it. My experience is that you can use AI to write a function or to write different functions, then I can go and link the functions together and have them call each other properly. But you can't just give it a complicated problem statement and have it write a complete program. The students bogged down because they couldn't. For example, I asked them to build a dictionary of HTML commands that demonstrated the HTML commands. Then another homework later, after they made the dictionary long enough, I say, okay, alphabetize it. Well, it was too long. The AI wouldn't alphabetize it. Then some students, some of them just couldn't do it.

43:42 - J. C.
Yeah.

43:42 - D. B.
If you asked it to alphabetize a list of 10 HTML demo commands, sure, it could do it. But maybe not 30.

43:56 - Unidentified Speaker
Unless you pay them. Maybe you pay.

44:00 - D. B.
So you can use it as a programming assistant. If you don't know how to use certain commands in a language, when you need a simple function to be written, it'll do it. And then you don't have to learn the details of those commands. But you still have to get the function working and then connect it to others by yourself, in my experience.

44:31 - Unidentified Speaker
All right, well, so we finished chapters.

44:34 - D. B.
I guess we'll do chapter seven, and then we'll move on to reading some other paper, paragraph at a time, or watching another video a minute at a time, or whatever. But I guess since we're up to through chapter six, we might as well do chapter seven, and maybe we'll get started on that next week. Any other thoughts or comments? Before we adjourn. All right, let's see.

45:02 - J. C.
Just think how many billion people drive cars without knowing thermodynamics. Yeah, me, I'm one of them. Yeah, and we have a lot of accidents, but probably not because we don't understand thermodynamics.

45:25 - E. G.
I think it's more because we don't understand physics specific in general, not a specific thermodynamics. I mean we used to ride horses and we didn't know biology.

45:38 - J. C.
We're good at doing things we don't know.

45:42 - E. G.
Oh yeah, I think that that's the cornerstone of the human race.

45:48 - M. M.
But today there was a video how they monitor the health of horses with AI, the movement of the horse, okay? They monitor and predict, I have this study, probably J. knows about this study, of movements of people, okay? It's actually one-dimensional or multi-dimensional signal. So now they make it for the horse race, prediction of if the animal is hurt or whatever, they change the pattern of movements with AI.

46:45 - Unidentified Speaker
It's very good.

46:47 - D. B.
Well, we’ve certainly covered a lot today. Thanks everyone for the lively discussion. Any final thoughts or comments before we wrap up?

46:55 - Unidentified Speaker
No, I'm good. Thanks for the meeting.

46:58 - M. M.
It was great to learn about the upcoming hackathon and all the other updates. Looking forward to next week’s session.

47:05 - J. C.
Same here. Have a good week, everyone.

47:08 - E. G.
Take care, everybody.

47:10 - D. B.
Alright, take care, everyone. See you next week.