Friday, July 25, 2025

7/25/25: Evaluate some potential readings

Artificial Intelligence Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time on Fridays. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many. 
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes (171st meeting, July 25, 2025)

Table of Contents
* Agenda and minutes
* Appendix: Transcript (when available)

Agenda and Minutes
  • Announcements, updates, questions, etc. as time allows: none.
  • DD has generously agreed to do a demo on generating the transcripts of these meetings. Here is one of the problems he encountered and can discuss:
    • When I [...] went to ChatGPT [I] discovered it changed models and I had to import my prompts. The model settings were lost and the new model's context window was too short. I changed to an older model and the model made up new entries for the transcript. I adjusted the temperature and got it figured out. It has been an interesting week... 
  • EG and DD are working on slides surveying different ML models.
  • VW will demo his wind tunnel system soon. 
  • If anyone has an idea for an MS project where the student reports to us for a few minutes each week for discussion and feedback - a student could likely be recruited! Let me know
    • JH suggests a project in which AI is used to help students adjust their resumes to match key terms in job descriptions, to help their resumes bubble to the top when the many resumes are screened early in the hiring process.
    • JC suggested: social media are using AI to decide what to present to them, the notorious "algorithms." Suggestion: a social media cockpit from which users can say what sorts of things they want. Screen scrape the user's feeds from social media outputs to find the right stuff. Might overlap with COSMOS. Project could be adapted to either tech-savvy CS or application-oriented IS or IQ students.
    • We discussed book projects but those aren't the only possibilities.
      • VW had some specific AI-related topics that need books about them.  
    • DD suggests having a student do something related to Mark Windsor's presentation. He might like to be involved, but this would not be absolutely necessary.
      • markwindsorr@atlas-research.io writes on 7/14/2025:
        Our research PDF processing and text-to-notebook workflows are now in beta and ready for you to try.
        You can now:
        - Upload research papers (PDF) or paste in an arXiv link and get executable notebooks
        - Generate notebook workflows from text prompts
        - Run everything directly in our shared Jupyter environment
        This is an early beta, so expect some rough edges - but we're excited to get your feedback on what's working and what needs improvement.
        Best, Mark
        P.S. Found a bug or have suggestions? Hit reply - we read every response during beta.
        Log In Here: https://atlas-research.io
  • Any questions you'd like to bring up for discussion, just let me know.
  • Anyone read an article recently they can tell us about next time?
  • Any other updates or announcements?
  • Hoping for a summary/review of the book at some point from [ebsherwin@ualr], who wrote: 
    Greetings all, 
      In a recent session on working with AI, Dr brian Berry (VP Research and Dean of GradSchool) recommended this book:
      The AI-Driven Leader: Harnessing AI to Make Faster, Smarter Decisions
    by Geoff Woods
      I just bought it based on his recommendation and if anyone is interested will gladly meet to talk about the book. Nothing "heavy duty" just an accountability group.
       If you have read the book already and if the group forms, you are welcome to join the discussion.
      I'll wait till Monday morning before I start reading -- so if you do not see this message immediately, do reach out!
       Best,
  • Here is the latest on future readings and viewings
    • Let me know of anything you'd like to have us evaluate for a fuller reading.
    • 7/25/25: eval was 4.5 (over 4 people). https://transformer-circuits.pub/2025/attribution-graphs/biology.html.
    • https://arxiv.org/pdf/2001.08361. 5/30/25: eval was 4. 7/25/25: vote was 2.5.
    • We can evaluate https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10718663 for reading & discussion. 7/25/25: vote was 3.25 over 4 people.
The meeting ended here. 
  • Schedule back burner "when possible" items:
    • TE is in the informal campus faculty AI discussion group. SL: "I've been asked to lead the DCSTEM College AI Ad Hoc Committee. ... We’ll discuss AI’s role in our curriculum, how to integrate AI literacy into courses, and strategies for guiding students on responsible AI use."
    • Anyone read an article recently they can tell us about?
    • If anyone else has a project they would like to help supervise, let me know.
    • (2/14/25) An ad hoc group is forming on campus for people to discuss AI and teaching of diverse subjects by ES. It would be interesting to hear from someone in that group at some point to see what people are thinking and doing regarding AIs and their teaching activities.
    • The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates now and then. 
Appendix: Transcript 

Artificial Intelligence Study Group
Fri, Jul 25, 2025

0:32 - Unidentified Speaker
Oh.

1:05 - D. B.
All right, well it's 5.01. We'll give folks another minute because I think D. was tentatively going to present something today, but things have been sort of relaxed in terms of the schedule, so he may actually not even show up. I don't know. We'll give him another minute. OK, well, I'm just going to assume that we have a small group. And one of the nice things about the design of these meetings is that there's always a fallback, which will be to do some readings and or video viewings. And today we can check a few abstracts or whatever and kind of evaluate different next readings. So we'll see about D. next time or some other future time. E., you and D. are gonna work on some slides surveying different ML models. Is that kind of on the back burner? Should I move that from sort of definite plans to sort of future or what do you think?

2:56 - E. G.
Well, it's tough enough plans, but it's future. We've already got the slides. I've given him some verbiage. My goal was to help D. along the path of better understanding models, which models to use and when. Okay. All right.

3:12 - D. B.
Well, then I'll leave it on this list, which means we'll check back on at each meeting until either it happens we decided to move it into the back burner category. Similarly, V. was offered to demo his wind tunnel system, but he's not here. And as always, anyone has ideas for MS projects, I'll add them to the list here. We've got a list of a few. And when the semester starts and I can start getting students who need projects, I'll suggest that they come to a meeting and we'll talk about the projects with them and see if they want to do something. If anyone has any other master's projects they want to propose, just let me know, and I'll add it to the list.

4:02 - Unidentified Speaker
OK.

4:03 - J. C.
I have an idea. Yeah. I don't know whether you want it now or write it up and send it to you.

4:12 - D. B.
Why don't you give me a hint, and then you can write it up later.

4:18 - J. C.
OK. I was at a meeting. At Cosmos earlier this week, and talking about social media. So my, my thought was that all the social media companies use AI use algorithms to decide what to present to you. And they decide for their purposes. And so every day, I see people frustrated by what Facebook is feeding them or, you know, or some other media and it crossed my mind, what if you turned it around? What if you had a social media cockpit and you used AI to only get the feeds that you wanted? Facebook would be out there, the AI would look at it, and if you say you don't want political things, you only want things from your friends, or you only want things on three topics, that's what you get. And you'd get it from whatever social media you subscribe to. Maybe it's Facebook and LinkedIn, and you get the feeds you want from those. And I think it would be, I love the idea.

5:37 - E. G.
I think that would be wonderful, but they would have to, make their fees available versus API, otherwise you're building screen scrapers that would have to walk the pages.

5:51 - E. G.
I've built screen scrapers before, and they are an ugly way to do it.

5:58 - J. C.
Yeah, and there are APIs, but they change. Not saying it would be easy, but I've been trying to work with Copilot on screen scrapers and have had occasional success. So maybe a new world on screen scrapers too with AI assistance. And that might be, I mean, that would be a related topic that'd be fairly narrow. You know, like, can you use AI to help you write a book? Can you use AI to help you screen scrape, to help you go through page after page?

6:42 - D. B.
So screen scrape social media outputs to find the right stuff.

6:48 - J. C.
But I think if I think of it on my machine, it would have my sign on to Facebook, my sign on to LinkedIn. And so it could do anything that I would do on Facebook or LinkedIn and just throw away the junk. You know, not not present me with my my favorite this last week is for some reason, I am getting push up bra ads for older women. And then the other thing I've been getting, I can't remember all the jargon, but high performance, oil oil-fed, screw-driven air compressors. And it's always got this whole set of keywords. You know, it's not just air compressors, it's this lengthier subset of air compressors. And somehow, between that and push-up bras, that's defining me this week.

8:01 - D. B.
Okay. Well, I think...

8:03 - Unidentified Speaker
Go ahead.

8:04 - D. B.
Yeah, I tried to take a couple of notes here, so feel free to send me an update.

8:14 - J. C.
But it might make a good overlap between the Cosmos people and other efforts.

8:22 - E. G.
OK, I just had Claude and OpenAI. And they said they can't help with scraping or LinkedIn directly as this would violate their terms of service and could raise privacy and legal issues.

8:41 - E. G.
There are things you can do.

8:43 - D. B.
For example, if you don't want a work, actual working system, you can develop a concept where you paste in the pages and show that it can, in principle, scrape them properly or something like that. There's a lot of variations that would make this a project, not necessarily a marketable product, but an investigation.

9:05 - J. C.
In the long run, somebody could develop an AI. Again, you're not going to be looking at anything that I don't get to look at. It's going to look at my Facebook page, my Facebook feed. It's not taking every page in Facebook and using it somehow to train an AI or, you know, figure out how to blow up motorboats or bicycles or something.

9:39 - D. B.
Does this sound to you like something that would be sort of most suitable for a tech-savvy grad student, like a computer science student, or maybe somebody who's a kind of application-oriented, like a information science student, I think you could do it either.

10:04 - J. C.
First off, I've been using PowerShell, which I know nothing about, but seems to be pretty handy and Copilot has successfully coached me through using it. So I think you have to be less tech savvy in some areas these days. You can focus on the application and get the code or the right tool suggestion from elsewhere. OK.

10:41 - D. B.
Well, one thing that we could do is if I start getting students who want to do projects, I can suggest they come to these meetings And we can show them this list and try to explain and see.

10:59 - J. C.
I think that'd be sort of a marketplace of ideas would be useful and interesting. Yeah, we can see what they want to do.

11:08 - D. B.
We are probably going to have fewer grad students going forward than we've had in the past, but hopefully some of them will be wanting to do projects. And when they come to me asking for projects, I always sort of don't know what to say. What to say. I'll say, are you free at 4 PM on Fridays? That'll be my first question. OK. Anything else on this MS Project activity? All right. Let me just save the page so we don't lose that. Any other questions? You ever have any questions you want to bring up for general discussion, just let me know and I'll add them to the list. E. S., the psychology professor, notified me that she finished reading the book. So at some point, I'll get her to Get her here to give us a review. That was this book.

12:26 - Unidentified Speaker
All right, well, we're sort of in between.

12:28 - D. B.
We finished all those videos, and we're sort of figuring out what to do next. So I thought we could go through the process of reading a few abstracts or maybe a few clips from different videos or both and evaluating them. So here's one of them. Honestly, I don't even remember what this is, but I figure we'll just take a look at the first paragraph of the abstract, talk about it, and vote on it. So I'm going to bring it up here. And let's see what this is. Is.

13:12 - Unidentified Speaker
Here is a paragraph.

13:15 - D. B.
And let's just Read from here to here, and then we can decide. I'll just We'll just see what we think. Of it. Any comments or thoughts?

14:08 - Unidentified Speaker
I think we are not that far off.

14:14 - E. G.
I mean, if you take a look at human, and I don't mean to go against anybody's senses, but human evolution over the hundreds of thousands of years, we're seeing the same thing in large language models in a far more compressed time frame.

14:44 - D. B.
Yeah, considering the human brain or animal brains for that matter, I mean, this is happening fast.

14:54 - E. G.
I mean, look where AI was five years ago. Large language models. Granted, we had feed-forward neural net. We had recumbent neural net. We had neural net. Now, with large language models, if you take a look at it, it's neural net in a multidimensional layer with intermediary pieces for aggregators.

15:23 - D. B.
All right. Any other thoughts on this? Do you want to evaluate it with kind of a vote, or do you want to Read another paragraph?

15:39 - E. G.
All right, we'll Read another paragraph.

15:51 - D. B.
Sheesh, I can't highlight anymore.

15:58 - Unidentified Speaker
One more time.

16:01 - Unidentified Speaker
Didn't work.

16:02 - Unidentified Speaker
All right.

16:04 - Unidentified Speaker
This is the paragraph.

16:08 - D. B.
Oh, there it

16:11 - D. B.
All right. We'll Read that.

16:22 - Unidentified Speaker
Comments or questions?

16:36 - J. C.
What tool would you use? I mean, there have been tools in computer science that let you monitor what code in a system actually got executed and how often. Well, I don't think it's tools, but it's paradigms.

16:58 - E. G.
When computers first came out, they had one processor that could do one thing at a time. Then the advent of parallel processing occurred. We had a processor with multiple or a unit with multiple processors on it so it could run things concurrently. We now have AI chips that have tens of thousands of processors. I think those are the tools. Now, in biology, we're not governing biology. We're monitoring it. In AI, we're not monitoring it only. We're governing it.

17:47 - Unidentified Speaker
All right.

17:47 - E. G.
So I think this would be a great paper to Read.

17:54 - D. B.
OK. I'm just pleased by the site.

18:00 - J. C.
there. And am I right that the chips are analog for some parts of AI processes? Are what?

18:13 - D. B.
Chips are what?

18:15 - J. C.
That they do analog computing.

18:19 - Unidentified Speaker
Digital.

18:19 - E. G.
Analog would be wavelength. Digital is It is digital.

18:28 - D. B.
They use these things called ReLU, rectified linear, what does ReLU stand for? Rectified linear something, to decide, you know, to make decisions. But they do it.

18:50 - J. C.
But for instance, Greg, processing? Is that not analog?

18:54 - D. B.
I mean, I think under, you know, at the basic level, it's using digital circuits. But actually, you know, when you're, when you're, you know, like we saw with these, with these videos, when you're doing these calculations, and finding how, what the probability is of a given word, it's effectively an analog number, right? I mean, in the sense... Any number along a scale. I'm just curious.

19:26 - J. C.
I'm old enough that when I took Fortran and then took a computer science course that was everything about computers taught then, we used analog computers as one of our units of work.

19:44 - E. G.
It was the same with you, I I worked on, my first computer had vacuum tubes.

19:52 - Unidentified Speaker
Yeah.

19:52 - D. B.
I mean, neurons in a brain are sort of somewhat analog in some ways.

19:59 - J. C.
Yeah. And I guess I had had an understanding that maybe not the heart of large language models, but that related to new chips and stuff was that voice and images and other things were or partially analog processing, maybe just to digitize them or to, you know, for input or for output. Yeah.

20:28 - D. B.
All right.

20:29 - D. B.
Any other thoughts anyone wants to mention before we evaluate it? All right. Well, here's how we've done it in the past. So, you know, the one to five star thing for rating Amazon products or rating courses in university typically, where, so we're going to say that one means you definitely don't want to Read this together, five means you definitely do want to Read it together, three means you don't know, and then two and four are sort of leaning one way or the other. So if you can go to your chat window and just...

21:11 - R. S.
title again of this paper?

21:14 - D. B.
Hang on let me just, I gotta, all right let's see what the title is.

21:22 - Unidentified Speaker
Okay and when when was this paper published?

21:26 - D. B.
I don't know. It's a where it's a kind of a well 2020 it's listed as 2025.

21:33 - E. G.
But it's on CLAWD 3.5 so it's only going to be at the most a couple of years old.

21:42 - J. C.
Well, what is it? What what is the haiku mean in that pod 3.5 haiku?

21:47 - D. B.
Isn't that a one of the so they have several they have multiple models. And I think maybe one of the leaner one of the models is called haiku.

22:01 - J. C.
Yeah.

22:01 - J. C.
Okay.

22:02 - D. B.
Oh, it says anthropic is lightweight production model. So it's a It's the more efficient but less powerful model, perhaps. All right, well, yeah. So go ahead in the chat, just put in your number from 1 to 5, and we'll go from there. OK, we've got one vote.

22:39 - J. C.
Now, maybe the biology is changing on that side too, because we're dinking around with that to modify people to avoid birth defects and so forth. D., are you a member on this or not?

23:00 - R. S.
Are you voting on this or not?

23:04 - D. B.
You want me to?

23:06 - R. S.
Yeah.

23:07 - R. S.
Because we have less people today. Yeah.

23:10 - D. B.
I'll give it a four.

23:12 - Unidentified Speaker
So that's an average.

23:13 - D. B.
We've got two fives and two fours. That's an average of 4.5. I'll mention A. is. I just jumped in to see you guys because I've missed so many meetings and wanted to know what you were up to.

23:28 - H. J.
Oh, yeah, yeah.

23:29 - D. B.
We just Read three paragraphs of an article and we were voting want to Read it in more detail.

23:37 - H. J.
Yeah, so I think that's, I think I don't even deserve to vote.

23:41 - D. B.
All right, well we'll do another, we'll do another article and you can vote on that one. No, it's okay. I mean we are, that's the next plan anyway, so since today is kind of a sort of evaluating of multiple possibilities day, as it turns out. All right, well, let's then go to the next article. And that's this one. This back in May, but AI is moving with the speed of light, so I think we should probably reevaluate. Maybe we can average the two evaluations or something like that. So let's go to here. And this is a, looks like a archive.org preprint. Written by a bunch of people in the AI field, in OpenAI, called Scaling Laws for Neural Language Models. Any questions or thoughts or comments on the title?

25:25 - R. S.
What are scaling laws? The size of the neural network?

25:31 - D. B.
Maybe, I don't know. It's changing the size of something. The abstract has a little bit more in it.

25:45 - Unidentified Speaker
All right, let's take a look.

25:48 - Unidentified Speaker
I want to scroll down a little bit, yeah.

25:53 - Unidentified Speaker
Is this a newer article?

25:57 - H. J.
What year is this one from 2020.

26:03 - Unidentified Speaker
2020.

26:04 - D. B.
Because this has been a really big topic. I mean, in computing, it's always a question, especially AI, even 40 years ago, AI and AI scaling was always a key research question. Anyway, let's take a look. Let's Read through the abstract, and then we'll see if there's any discussions or questions. Thanks. All right, this is dense. I'm thinking we should go through it sentence by sentence. But before we do, does anyone have any questions or comments on the whole thing? All right, let's go through it one sentence at a time. All right, let's start with that sentence. Comments or questions? Well, I'm baffled by what is cross-entropy loss? Does anybody know? Anyone into looking it up? Anybody?

28:01 - E. G.
Log loss is a measure used in machine learning to evaluate performance of classification models.

28:08 - D. B.
What is it again?

28:10 - E. G.
It's also known as log loss. It's a measure used in machine learning to evaluate the performance of a classification model by quantifying the difference between the prediction probabilities and the actual labels. Sounds like it's a version of R-squared.

28:32 - D. B.
Yeah, so it's just a measure of how well the thing classifies, does classification.

28:39 - D. B.
Expected versus the anticipated.

28:41 - D. B.
All right, so it's, we don't really need to know what it is technically, it's just a measure of how well the Classifier works.

28:52 - Unidentified Speaker
OK.

28:53 - D. B.
Do you all know D. R.?

28:57 - H. J.
Sorry, do you all know D. R.?

29:02 - Unidentified Speaker
No, I don't.

29:04 - H. J.
Yeah, he's a PhD candidate at UA Little Rock and he's been working on and even published and now is getting ready to launch a product that basically does this. It analyzes your AI tools to see whether or not they have a lot of entropy, to see if they're still targeted for high quality output. Low error rates. So I just think this stuff is moving so fast. I don't know how. You guys are trying to figure out if you want these to be topics for this group?

29:58 - D. B.
Whether we want to Read this paper in more detail, yeah.

30:02 - H. J.
Yeah. I think that reading anything that is older like this is not going to be very useful. But I'm not the technical person who be reading these things anyway. But I do think that that suggests that D. R. would be a great person to get in here to talk about his work on compliance and because what he's doing is, you know, he's trying to, he's taking away the black boxness, black boxness, you know, of AI programming, specifically in the area of regulation and compliance.

30:40 - D. B.
Uh-huh. That'd be great if he wants to present his project. Yeah. I can certainly tell him about it.

30:48 - H. J.
I think it'd be good. Yeah. Do you mind doing that?

30:52 - D. B.
I was going to send him an email.

30:55 - H. J.
I'll send him an email right now.

30:58 - D. B.
OK. Sounds great.

30:59 - H. J.
Yeah. I don't know that I would be helpful in evaluating these more technical discussion.

31:05 - H. J.
Well, I mean, the evaluation.

31:07 - D. B.
is of whether to Read them in more detail, and if the audience, you know, to the degree that the audience has people in it, you know, like you, who are not technical, you would be perfectly legitimate to say you're not interested in it.

31:26 - E. G.
Yeah, okay. I think the only benefit is to, like Moore's law, is to see whether or not what they postulated came to fruition? Are we actually seeing that type of performance and convergence?

31:42 - D. B.
Yeah, I think the question of how much AI is improving over time is really interesting. This is not a paper about that, but maybe it is. As the independent variable when you're talking about how fast the field is improving?

32:08 - E. G.
Well, that, and we'll get to that sentence, but that's in the last sentence where it talks about optimally compute-efficient training involves training in very large models on relatively modest data, stopping it significantly, stopping significantly before convergence by identifying where they're not getting a large accuracy increase on the continued training, where it goes up to a couple of sentences before, where you're overfitting. And you're now not improving the model, but degrading it.

32:54 - D. B.
Yeah, right. Any other thoughts? On related to this particular sentence? All right, well, let's. The one.

33:20 - Unidentified Speaker
Any comments?

33:35 - J. C.
We are writing this one?

33:38 - D. B.
Yes.

33:39 - Unidentified Speaker
Okay.

33:45 - D. B.
All right, let's look. The next one. Any comments or questions on this one?

34:15 - J. C.
I'm not seeing a new one.

34:19 - D. B.
Oh, I just, I'm highlighting the sentence.

34:22 - J. C.
Oh, just the sentences.

34:24 - J. C.
I see.

34:25 - D. B.
Yeah. All right, to me, this is suggesting, you know, you can look at how, how increasing the model size slows down the training speed, or how over fitting changes as the model size increases or the data increases? I mean, you more ordinarily think of a model as being better if it's bigger, right? But then it's also more subject to overfitting. Any other thoughts or comments? All right, next sentence.

35:32 - Unidentified Speaker
Questions or comments on that one?

35:44 - E. G.
I think it falls in the yes, of course category. Identifying the relationship for a fixed budget, yes. I mean, that falls in the, like, doing a study, do teenage boys spend a lot of time thinking about girls or sports.

36:13 - D. B.
It's a classic computing problem, how you optimize, you trade off one thing against another to optimize based on fixed resources, right? Memory versus time and algorithms, electric power, versus some measure of computing performance for battery-operated devices. All right, let's try the last sentence. Any questions or comments? Is this saying that the bigger the model, the less data you need? Well, OK. That's what I'm reading.

37:45 - E. G.
Yeah.

37:46 - D. B.
Any other thoughts before we evaluate it? If not, go ahead and type in your number. Again, one means you definitely don't want to Read this paper together, five means you definitely do, three means you don't either way, and two and four are Yes or no?

38:11 - R. S.
In the title of the paper, can you show us the title again, D.?

38:29 - Unidentified Speaker
All right.

38:35 - D. B.
Two votes? Surely we can get more than two votes. All right. M., you're more than welcome to vote, but it's not required. All right, I'm going to give it a four. So we got, let's see, two, one, two, three, four. Is 2.5. That'd be pretty harsh for 2.5. All right. Okay, well, I think we have time to probably do one more. Let's try this one. This is another paper in the IEEE Digital Library. Don't know which one it was.

39:57 - J. C.
Oh, this is by my What is the date of publication of this?

40:10 - R. S.
What year is this?

40:14 - D. B.
I'm not sure. It's on the bottom there.

40:20 - R. S.
October 2024. Yeah.

40:23 - D. B.
Several months old.

40:26 - R. S.
Can I see the title again?

40:34 - D. B.
Let's see how big I can get it so we can Read it. I'm going to go full on here 29. Okay, here's the paragraph. In PDF, so I can't, it's harder to highlight. Let's Read it and then talk it. Any comments or questions? Shall we Read another little bit before we decide? All right, well, if nobody wants to Read any more of it, to see, we'll go ahead and evaluate it now.

42:23 - J. C.
I'd be interested in the next paragraph.

42:27 - Unidentified Speaker
OK.

42:28 - D. B.
All right, well, I think we can probably Read to here, which would be a question and a part of an answer. So let's start with... I can fit this all on one page.

42:53 - Unidentified Speaker
Okay.

42:54 - D. B.
Let's start first. Comments or questions? I mean, to me, isn't this what intelligent, maybe not particularly benign, but intelligent human actors do all the time? I believe so.

43:57 - E. G.
In fact, there's been arguments that a lot of the AI systems, LLMs, have safeguards to prevent this sort of activity.

44:12 - Unidentified Speaker
All right.

44:14 - D. B.
Any other comments? All right, well, then let's go ahead and Green. Sorry, I'm just trying to make it bigger. There we go. Comments or questions?

45:11 - Unidentified Speaker
All right.

45:13 - D. B.
We'll finish his answer for the first question. One. Anything?

45:48 - Unidentified Speaker
That's interesting.

45:49 - Unidentified Speaker
Yeah.

45:50 - J. C.
Let's see how much more there is here.

45:56 - D. B.
OK, so another paragraph, but it's a half paragraph on each page, so we'll have to live with that. All right. Comments? Yeah. Sorry, I didn't understand. You were muffled when you said you were meeting somebody or Yeah, I was on the phone with my Zoom meeting right now.

46:48 - Unidentified Speaker
One of my papers got accepted. I was on the phone with the editor.

46:56 - D. B.
All right. Any comments on that? All right. Let's Read the rest of the paragraph.

47:04 - Unidentified Speaker
Any comments?

47:08 - D. B.
Well, my comment is that basically we're talking about what happens when the AIs get smarter than people, and this is one thing, you know, only one of many things that would become sort of problematic. All right, well, let's go ahead and evaluate it. Again, it's one to five. Five means you definitely want to Read the whole thing. One means you definitely don't. Hello? R., any input?

48:27 - Unidentified Speaker
Give it a four.

48:31 - D. B.
All right, we've got two threes and a four. R., are you going to vote? All right, so getting back You got. All right, well, I think we're at a good stopping point. And maybe next week or next time we have time to do this, we'll Read a few more and then pick the best one for more detailed reading. Any last thoughts before we adjourn?

49:45 - Unidentified Speaker
D., I did vote.

49:47 - R. S.
Oh, what did you vote?

49:49 - D. B.
What was your vote? Three. So it's 3.25.

49:53 - R. S.
I had my audio off, so I guess I said it multiple times.

50:01 - D. B.
OK. OK. Anything else anyone wants to bring up before we adjourn? I hope you have better attendance next week.

50:14 - R. S.
Yeah.

50:15 - D. B.
Well, sometimes you have a small group you can tune things to the interest better and gives us a little more influence on what we Read next, right? Okay. All right. Thank you.

50:30 - R. S.
All right.

50:31 - D. B.
Take care, everyone. Have a good weekend.

Friday, July 11, 2025

7/11/25: Do chapter 7 video

Artificial Intelligence Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time on Fridays. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many. 
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes (170th meeting, July 11, 2025)

Table of Contents
* Agenda and minutes
* Appendix: Transcript (when available)

Agenda and Minutes
  • Announcements, updates, questions, etc. as time allows: none.
  • DD has generously agreed to do a demo on generating the transcripts of these meetings. Here is one of the problems he encountered and can discuss:
    • When I [...] went to ChatGPT [I] discovered it changed models and I had to import my prompts. The model settings were lost and the new model's context window was too short. I changed to an older model and the model made up new entries for the transcript. I adjusted the temperature and got it figured out. It has been an interesting week... 
  • EG and DD are finishing slides surveying different ML models.
  • VW will demo his wind tunnel system soon. 
  • If anyone has an idea for an MS project where the student reports to us for a few minutes each week for discussion and feedback - a student could likely be recruited! Let me know
    • JH suggests a project in which AI is used to help students adjust their resumes to match key terms in job descriptions, to help their resumes bubble to the top when the many resumes are screened early in the hiring process.
    • We discussed book projects but those aren't the only possibilities.
      • VW had some specific AI-related topics that need books about them.  
    • DD suggests having a student do something related to Mark Windsor's presentation. He might like to be involved, but this would not be absolutely necessary. 
  • Any questions you'd like to bring up for discussion, just let me know.
  • Anyone read an article recently they can tell us about next time?
  • Any other updates or announcements?
  • Hoping for a summary/review of the book at some point from [ebsherwin@ualr], who wrote: 
    Greetings all, 
      In a recent session on working with AI, Dr brian Berry (VP Research and Dean of GradSchool) recommended this book:
      The AI-Driven Leader: Harnessing AI to Make Faster, Smarter Decisions
    by Geoff Woods
      I just bought it based on his recommendation and if anyone is interested will gladly meet to talk about the book. Nothing "heavy duty" just an accountability group.
       If you have read the book already and if the group forms, you are welcome to join the discussion.
      I'll wait till Monday morning before I start reading -- so if you do not see this message immediately, do reach out!
       Best,
  • Chapter 7 video, https://www.youtube.com/watch?v=9-Jl0dxWQs8. We finished it.
  • Here is the latest on future readings and viewings
  • Schedule back burner "when possible" items:
    • TE is in the informal campus faculty AI discussion group. SL: "I've been asked to lead the DCSTEM College AI Ad Hoc Committee. ... We’ll discuss AI’s role in our curriculum, how to integrate AI literacy into courses, and strategies for guiding students on responsible AI use."
    • Anyone read an article recently they can tell us about?
    • If anyone else has a project they would like to help supervise, let me know.
    • (2/14/25) An ad hoc group is forming on campus for people to discuss AI and teaching of diverse subjects by ES. It would be interesting to hear from someone in that group at some point to see what people are thinking and doing regarding AIs and their teaching activities.
    • The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates now and then. 
Appendix: Transcript 

Artificial Intelligence Study Group
Fri, Jul 11, 2025

0:15 - Conference Room (D. B.) - Speaker 2
Hi, everyone.

0:16 - Unidentified Speaker
Hello.

0:16 - Unidentified Speaker
Hello.

0:17 - M. M.
Hello, everybody. Hello. Happy Friday. All right.

0:20 - Conference Room (D. B.) - Speaker 2
I got to find this Chapter 7 video. Did you get the message from V.?

0:28 - Unidentified Speaker
Yeah.

0:29 - Unidentified Speaker
Yeah.

0:31 - M. M.
That's why I have to. Yeah.

0:34 - Unidentified Speaker
Oh, V.'s not V.'s not doing his thing today.

0:38 - Unidentified Speaker
No.

0:39 - E. G.
Oh, he had a kid issue. Crash.

0:42 - Unidentified Speaker
Yeah.

0:43 - M. M.
Yeah, he's not, doesn't have a computer. So this is the reason. E., I sent you the link for Nvidia resource. I saw that. Yeah, for the, any kind of educational unit can request hardware. Yeah, so. Grant, we just apply. I know people already received some grants. So unfortunately, V. cannot present today.

1:29 - Unidentified Speaker
Okay.

1:30 - Unidentified Speaker
Okay, I found it.

1:34 - Conference Room (D. B.) - Speaker 1
Yeah, so we need to when your kid is hurting and there's nothing you can do about it.

1:53 - Unidentified Speaker
Where?

1:54 - Unidentified Speaker
Inside?

2:45 - Conference Room (D. B.) - Speaker 2
Okay, I found the video. We'll get to it. Let me go back and share my screen. OK, so V. was going to demo his wind tunnel AI system today. A terrible system crash, and he's trying to figure out what's going on. And he'll get to it soon, but not today. And D. generously agreed to do a demo for how he processes the transcripts for these meetings using AI to anonymize and so on. And there's some interesting observations about that. D., do that next week?

3:51 - D. D.
Yeah, that'll be fine. OK.

3:55 - Conference Room (D. B.) - Speaker 2
And as always, we're collecting ideas for master's student projects. So currently, we have a number of them. So of course, there's the book project, one that we did last semester, which we can do better in the future, as we talked about. And also, J. has a project designed to have students adjust their resumes to match job descriptions. So when companies post job descriptions, they use keywords, and they probably use those same keywords to retrieve relevant resumes. So if you don't use the right keywords, even though you're highly qualified, your resume won't be pulled. So the project is to fix that problem. I hope I have that right, J. OK. And D. also has suggested having a student do something related to M. W.'s presentation a few weeks ago. And so if you recall that, it was pretty interesting. His presentation was on a system for turning academic papers into code. If the paper described an algorithm and some data, I guess it would generate, as I understand it, it would generate synthetic data and code to analyze it. So it was pretty interesting. And so we'll see if we get a student interested in that. And then V. had some other topics that he wants books written about different AI-related topics. We're back to the book project here. We should probably make that a sub.

5:46 - E. G.
D. D. and I are almost done with our slide presentation on machine models.

5:54 - Conference Room (D. B.) - Speaker 2
Ah, OK. Let me put that here.

5:58 - E. G.
I started drafting the talking points to it. Slides. There's some discussion on our side that we need to flesh out, but we should have that done in a week or two.

6:16 - Conference Room (D. B.) - Speaker 2
Okay, slides on what was it again?

6:20 - E. G.
Different types of machine learning models.

6:24 - D. D.
Like a kind of like a survey, maybe. Okay. Yeah, probably, you know, really good for, you know, people that don't know very much about machine learning, and for undergraduates. Cool. Yeah, I mean, I don't know.

6:47 - Conference Room (D. B.) - Speaker 2
Did you mention that last week? Or? I don't remember.

6:51 - E. G.
And we mentioned it about a month ago.

6:54 - Conference Room (D. B.) - Speaker 2
Oh, OK.

6:55 - E. G.
But it's been a back burner item with everything else going on. So as we have time, we'll put cycles to it. OK, cool.

7:06 - Conference Room (D. B.) - Speaker 2
All right, that'll be good. And if you're interested, I could put out a notice. Since it's sort of tutorial style, I could put out a notice and maybe get some visitors or something. I don't know that too many undergraduates are available in the summer or whatever, but you know, people are available and might as well make it available to them if they, if.

7:30 - D. D.
Well, we could consider our first run a test run.

7:33 - Unidentified Speaker
Okay, we'll do.

7:34 - Conference Room (D. B.) - Speaker 2
That's probably a better, better choice.

7:36 - Unidentified Speaker
Okay, no problem.

7:37 - Conference Room (D. B.) - Speaker 2
Any questions you'd like to bring up for discussion? Just, just let me know. I had one, I didn't write it down a few days ago, so I don't, forgot what it was, but it was, it was, it would have been an interesting discussion. Maybe I'll think of it again and add it. If you have any questions, discussion questions, just random questions that intrigue you about AI you want some input on, just let me know and I'll put them on the agenda. Any other updates or announcements, anyone? Okay. And hopefully at some point, Dr. S. will give us a review of the paper that she has a little discussion group that she arranged for this summer. I don't know how many people signed up, but even if nobody signs up, she's still gonna Read it herself. And that's the name of the book. And she generously was willing to give us her take on the book at some point. She's in the psychology department. So it's not like, hyper-technical computer science stuff. All right, well, with that, we'll go to our Chapter 7 video. This is the last video in the series, and after that, we'll go on to Read some abstracts and initial minutes of different videos, whatever, and evaluate them. We've got something here that was evaluated at a four, but that's a little old at this point. Evaluated as a five. That's the maximum. I think that was J. H.'s Nobel Prize winner speech transcript. But anyway, we probably should rethink some of these evaluations since it's been so long, and then we'll decide what to Read or view. With that, let's go to the video. And this is If you feed a large which modeled the phrase, M. J. plays the sport of playing. And you have a pitch, what kind of thing? I don't know. But anyway, we'll do our usual. I'll just play it for a minute or two, and then we'll discuss and continue on like that. Let me just make sure the audio is good. I'm just going to play it for a couple of seconds and ask you if you can hear it, and then we'll do it for real.

10:35 - Conference Room (D. B.) - Speaker 2
If you feed a large language model, the first You all hear that pretty well?

10:41 - D. D.
I can.

10:42 - Conference Room (D. B.) - Speaker 2
You can or you can't? I can.

10:45 - Unidentified Speaker
OK, well, good.

10:46 - Conference Room (D. B.) - Speaker 2
All right, well, let's do it for real then.

10:49 - Conference Room (D. B.) - Speaker 1
If you feed a large language model the phrase, M. J. plays the sport of blank, and you have it predict what comes next, and it correctly predicts basketball, this would suggest that somewhere inside its hundreds of billions of parameters, it's baked in knowledge about a specific person and his specific sport. And I think in general, anyone who's played around with one of these models.

11:13 - Conference Room (D. B.) - Speaker 2
Any comments so far, or questions?

11:16 - Unidentified Speaker
Discussion?

11:17 - Conference Room (D. B.) - Speaker 1
As the clear sense that it's memorized tons and tons of facts. So a reasonable question you could ask is, how exactly does that work? And where those facts live? Last December, a few researchers from Google DeepMind posted about work on this question, and they were using the specific example of matching athletes to their sports. And although a full mechanistic understanding of how facts are stored remains unsolved, they had some interesting partial results, including the very general high-level conclusion that the facts seem to live inside a specific part of these networks known for fancifully as the multi-layer perceptrons, or MLPs for short.

12:06 - Unidentified Speaker
Any comments, questions?

12:08 - Conference Room (D. B.) - Speaker 2
So to me, it's fascinating that they don't know. I mean, you ask an AI what sport M. J. plays, and it's going to tell you basketball, I assume. But nobody really knows for sure how they know it. It's this weird phenomenon of an emergent property that was not designed in, it just sort of happens by itself. All right.

12:41 - Conference Room (D. B.) - Speaker 1
In the last couple of chapters, you and I have been digging into the details behind transformers, the architecture underlying large language models, and also underlying a lot of other modern AI In the most recent chapter, we were focusing on a piece called attention, and the next step for you and me is to dig into the details of what happens inside these multilayer perceptrons, which make up the other big portion of the network. The computation here is actually relatively simple, especially when you compare it to attention. It boils down essentially to a pair of matrix multiplications with a simple something in between. However, interpreting what these computations are is exceedingly challenging. Our main goal here is to step through the computations and make them memorable, but I'd like to do it in the context of showing a specific example of how one of these blocks could, at least in principle, store a concrete fact. Specifically, it'll be storing the fact that M. J. plays basketball. I should mention the layout here is inspired by a conversation I had with one of those DeepMind researchers, N. N. For the most part, I will assume that you've either watched last two chapters, or otherwise you have a basic sense for what a transformer is. But refreshers never hurt, so here's the quick reminder of the overall flow. You and I have been studying a model that's trained to take in a piece of text and predict what comes next. That input text is first broken into a bunch of tokens, which means little chunks that are typically words or little pieces of words. And each token is associated with a high-dimensional vector, which is to say long list of numbers. This sequence of vectors then repeatedly passes through two kinds of operation. Attention, which allows the vectors to pass information between one another, and then the multilayered perceptrons, the thing that we're going to dig into today. And also there's a certain normalization step in between. After the sequence of vectors has flowed through many, many different iterations of both of these blocks, by the end, the hope is that each vector has soaked up enough information, both from the context, all of the other words in the input, and also from the general knowledge that was baked into the model weights through training.

14:59 - Conference Room (D. B.) - Speaker 2
Any questions so far? I saw in this diagram that it looked like perceptron blocks and retention blocks were alternating. Did see that right?

15:14 - Unidentified Speaker
Yeah, it appears so.

15:21 - E. G.
OK, not sure why.

15:36 - Conference Room (D. B.) - Speaker 2
All right, well, look at these blocks.

15:39 - Conference Room (D. B.) - Speaker 1
By the end, the hope is that each vector has soaked up enough information, both from the context, all of the other words in the input, and also from the general knowledge that was baked into the model weights through training, that it can be used to make a prediction of what token comes next. One of the key ideas that I want you to have in your mind is that all of these vectors live in a very, very high dimensional space, and when you think about that space, different directions can encode different kinds of meaning. So a very classic example that I like to refer back to is how if you look at the embedding of woman and subtract the embedding of man, and you take that little step and you add it to another masculine noun, something like uncle, you land somewhere very, very close to the corresponding feminine noun. In this sense, this particular direction encodes gender information. The idea is that many other distinct directions in this super high-dimensional space could correspond to other features that the model might want to represent. In a transformer, these vectors don't merely encode the meaning of a single word, though. As they flow through the network, they imbibe a much richer meaning based on all the context around them, and also based on the model's knowledge. Ultimately, each one needs to encode something far, far beyond the meaning of a single word, since it needs to be sufficient to predict what will come next. We've already seen how attention blocks let you incorporate context, but a majority of the model parameters actually live inside the MLP blocks, and one thought for what they might be doing is that they offer extra capacity to store facts. Like I said, the lesson here is going to center on the concrete toy example of how exactly it could store the fact that M. J. plays basketball. Now this toy example is going to require that you and I make a couple of assumptions about that high-dimensional space. First, we'll suppose that one of the directions represents the idea of a first name M., and then another nearly perpendicular direction represents the idea of the last name J., and then yet a third direction will represent the idea of basketball. So specifically what I mean by this is if you look in the network and you pluck out one of the vectors being processed, if its dot product with this first name M. direction is 1, That's what it would mean for the vector to be encoding the idea of a person with that first name. Otherwise, that dot product would be 0, or negative, meaning the vector doesn't really align with that direction. And for simplicity, let's completely ignore the very reasonable question of what it might mean if that dot product was bigger than 1. Similarly, its dot product with these other directions would tell you whether it represents the last name J., or basketball. So let's say a vector is meant to represent the full name, M. J. Then its dot product with both of these directions would have to be 1. Since the text, M. J., spans two different tokens, this would also mean we have to assume that an earlier attention block has successfully passed information to the second of these two vectors, so as to ensure that it can encode both names. With all of those as the assumptions, let's now dive into the meat of the lesson. What happens inside a multi-layer perceptron. You might think of this sequence of vectors flowing into the block, and remember, each vector was originally associated with one of the tokens from the input text. What's going to happen is that each individual from that sequence, goes through a short series of operations, we'll unpack them in just a moment, and at the end we'll get another vector with the same dimension. That other vector is going to get added to the original one that flowed in, and that sum is the result flowing out. This sequence of operations is something you apply to every vector in the sequence, associated with every token in the input, and it all happens in parallel. In particular, the vectors don't talk to each other in this step, they're all kind of doing their own thing. And for you and me, that actually makes it a lot simpler, because it means if we understand what happens to just one of the vectors through this block, we effectively understand what happens to all of them. When I say this block is going to encode the fact that M. J. plays basketball, what I mean is that if a vector flows in that encodes first name M. and last name J., then this sequence of computations will produce something that includes that direction basketball, which is what we'll add on the vector in that position. The first step of this process looks like multiplying that vector by a very big matrix. No surprises there, this is deep learning. And this matrix, like all of the other ones we've seen, is filled with model parameters that are learned from data, which you might think of as a bunch of knobs and dials that get tweaked and tuned to determine what the model behavior is. Now one nice way to think about matrix multiplication is to imagine each row of that matrix as being its own vector, and taking a bunch of dot products between those rows and the vector being processed, which I'll label as e for embedding. For example, suppose that very first row happened to equal this firstname-m direction that we're presuming exists. That would mean that the first component in this output, this dot product right here, would be 1 if that vector encodes the firstname-m, and 0 or negative otherwise.

21:11 - Conference Room (D. B.) - Speaker 2
Even more fun, think about what it would mean if that. Any questions out there?

21:32 - Unidentified Speaker
Okay.

21:34 - Conference Room (D. B.) - Speaker 1
First row was the first name M. plus last name J. direction. And for simplicity, let me go ahead and write that down as m plus j. Then taking a dot product with this embedding e, things distribute really nicely, so it looks like m dot e plus j dot e. And notice how that means the ultimate value would be 2 if the vector encodes the full name M. J., and otherwise it would be 1 or something smaller than 1. And that's just one row in this matrix. You might think of other rows as in parallel.

22:11 - Conference Room (D. B.) - Speaker 2
You all know what dot product is? You take a dot product of two vectors?

22:17 - E. G.
Yeah, it's linear algebra.

22:19 - Conference Room (D. B.) - Speaker 2
Well, how do you do it?

22:21 - E. G.
Actually, I've got a program. What I'll do is I'll post it.

22:27 - Conference Room (D. B.) - Speaker 2
If you have two vectors with, let's say, two vectors have five elements each, you multiply the first elements of each vector then you multiply the second element of each vector, and you multiply the third element of each vector, fourth and fifth, and then you add up the products. That's the dot product. So you multiply corresponding elements and then add up all the results. And then you get a number.

22:59 - Conference Room (D. B.) - Speaker 1
Hello, asking some other kinds of questions, probing at some other sort features of the vector being processed. Very often this step also involves adding another vector to the output, which is full of model parameters learned from data. This other vector is known as the bias. For our example, I want you to imagine that the value of this bias in that very first component is negative one, meaning our final output looks like that relevant dot product, but minus one. You might very reasonably ask why I would want you to assume that the model has learned this, and in a moment you'll see why it's very clean and nice if we have a value here which is positive if and only if our vector encodes the full name M. J., and otherwise it's zero or negative. The total number of rows in this matrix, which is something like the number of questions being asked in the case of GPT-3, whose numbers we've been following, is just under 50,000. In fact, it's exactly four times the number of dimensions in this embedding space. That's a design choice, you could make it more, you could make it less, but having a clean multiple tends to be friendly for hardware. Since this matrix full of weights maps us into a higher dimensional space, I'm going to give it the shorthand w up. I'll continue labeling the vector we're processing as e, and let's label this bias vector as b up and put that all back down in the diagram. At this point, a problem is that this operation is purely linear, but language is a very non-linear process. If the entry that we're measuring is high for M. plus J., it would also necessarily be somewhat triggered by M. plus P. and also A. plus J.

24:37 - Conference Room (D. B.) - Speaker 2
Any questions, comments, discussion points? See, I thought if you were going to add J. to M., that white arrow on the bottom should be that kind of grayed out arrow on top. In other words, you add the vectors end to end at the end of the at the arrowhead of the first vector, you add the tail of the second vector if you're going to add vectors. So I'm confused. It looks like you end up in the same place on the graph, on the graphic there, whether you do it his way or my way. Oh well.

25:25 - Unidentified Speaker
Here we go.

25:26 - Conference Room (D. B.) - Speaker 1
Despite those being unrelated conceptually, what you really want is a simple yes or no for the full name. So the next step is to pass this large intermediate vector through a very simple nonlinear function. A common choice is one that takes all of the negative values and maps them to zero and leaves all of the positive values unchanged. And continuing with the deep learning tradition of overly fancy names, this very simple function is often called the rectified linear unit, or ReLU for short. Here's what the graph looks like, So taking our imagined example where this first entry of the intermediate vector is 1 if and only if the full name is M. J., and 0 or negative otherwise, after you pass it through the ReLU you end up with a very clean value where all of the 0 and negative values just get clipped to 0. So this output would be 1 for the full name M. J. and 0 otherwise.

26:27 - Conference Room (D. B.) - Speaker 2
In other words, it very directly mimics the behavior of an AND gate. Anything, anyone?

26:35 - M. M.
Why do we want to get rid To speed up calculations you're losing a little bit something because after new models they use different activation function that even with the negative values they give a small value not to do the gradient descent zero. The gradient will become zero if this is zero, you know. So, but it's speeding the calculation, losing a little bit nodes, neurons from the architecture, but it's increasing the speed.

27:22 - D. D.
And it's computational.

27:24 - Unidentified Speaker
Yeah.

27:24 - D. D.
It's computationally easier too.

27:26 - Unidentified Speaker
Yeah.

27:27 - D. D.
I mean, there's a lot of them, you know, thousands, hundreds of thousands of computations.

27:37 - M. M.
Yes. So, Removing some of them, dropping some of the neurons is okay.

27:45 - E. G.
I dropped a file in the chat because, one, I couldn't remember all of the vector mathematics, so I just created a file with it. But it actually goes through identifying orthogonal parallelism. Well, all right.

28:10 - Conference Room (D. B.) - Speaker 2
So computationally, better to just drop all the negative numbers. You don't have to do as many multiplications or whatever, right? So why not drop any positive value that's less than 0.1, and that way you drop some more? Make those zero, then you can drop some more computations with that.

28:39 - M. M.
Isn't that what Softmax does? Exactly. So different functions, different activation functions can do different results. Like E. mentioned, if you want Softmax, you can do this. So depending what the best for the model is, transformer is doing Relu, most of the CNN are doing Relu, but the new models, this diffusion models, they're using different activation functions. Yeah, you can do whatever activation function.

29:17 - D. D.
That R-E-L-U activation function has a problem with disappearing numbers or something. Computation, there's something else that I would have to look But that particular activation function there needs them to be positive.

29:35 - M. M.
But this yellow that they mention in the text, this is the new one that actually negative numbers have some value, small value, but still value.

29:50 - Unidentified Speaker
So.

29:50 - Unidentified Speaker
OK.

29:51 - Conference Room (D. B.) - Speaker 1
Often models will use a slightly modified function that's called the JLU, which has the same basic shape, it's just a bit smoother, but for our purposes it's a little bit cleaner if we only think about the ReLU. Also, when you hear people refer to the neurons of a transformer, they're talking about these values right here. Whenever you see that common neural network picture with a layer of dots and a bunch of lines connecting to the previous layer, which we had earlier in this series, that's typically meant to convey this combination of a linear step, a matrix multiplication, followed by some simple term-wise nonlinear function like a ReLU, you would say that this neuron is active whenever this value is positive and that it's inactive if that value is zero.

30:40 - Conference Room (D. B.) - Speaker 2
Well, my comment here at this point is if you're trying to use a real neuron as a sort of analogy, you're trying to model Well, a neuron, neurons can only be positively activated and they may fire if they're sufficiently positively activated. But negative activations, I don't know if there's a way to negatively activate a neuron or not, but if it doesn't fire, it doesn't fire. And it doesn't matter how much you tell it not to fire. It's just the same thing, right? A negative value is all negative values that behave the same way.

31:26 - Unidentified Speaker
Unless you go into politics.

31:28 - E. G.
Next step looks very similar to the first one.

31:31 - Conference Room (D. B.) - Speaker 1
You multiply by a very large matrix and you add on a certain bias term. In this case, the number of dimensions in the output is back down to the size of that embedding space. So I'm going to go ahead and call this the down-projection matrix. And this time, instead of thinking of things by row, it's actually nicer to think of it column by column. You see, another way that you can hold matrix multiplication in your head is to imagine taking each column of the matrix and multiplying it by the corresponding term in the vector that it's processing, and adding together all of those rescaled columns. The reason it's nicer to think about this way is because here the columns have the same dimension as the embedding space, so we can think of them as directions in that space. For instance, we will imagine that the model has learned to make that first column into this basketball direction that we suppose exists. What that would mean is that when the relevant neuron in that first position is active, we'll be adding this column to the final result. But if that neuron was inactive, if that number was zero, then this would have no effect. And it doesn't just have to be basketball. The model could also make into this column many other features that it wants to associate with something that has the full name M. J. And at the same time, all of the other columns in this matrix are telling you what will be added to the final result if the corresponding neuron is active. And if you have a bias in this case, it's something that you're just adding every single time, regardless of the neuron values. You might wonder what's that doing. As with all parameter-filled objects here, it's kind of hard to say exactly. Maybe there's some bookkeeping that the network needs to do. But you can feel free to ignore it for now. Making our notation a little more compact again, I'll call this big matrix W down, and similarly call that bias vector B down, and put that back into our diagram. Like I previewed earlier, what you do with this final result is add it to the vector that flowed into the block at that position, and that gets you this final result. So for example, if the vector flowing in encoded both first name M. and last name J., then because this sequence of operations will trigger that AND gate, it will add on the basketball direction, so what pops out will encode all of those together. And remember, this is a process happening to every one of those vectors in parallel. In particular, taking the GPT-3 numbers, it means that this block doesn't just have 50,000 neurons in it, it has 50,000 times the number of tokens in the input. So, that is the entire operation. Two matrix products, each with a bias added, and a simple clipping function in between. Any of you who watched the earlier videos of this series will recognize this structure as the most basic kind of neural network that we studied there. In that example, it was trained to recognize handwritten digits. Over here, in the context of a transformer for a large language model, this is one piece in a larger architecture, and any attempt to interpret what exactly it's doing heavily intertwined with the idea of encoding information into vectors of a high-dimensional embedding space. That is the core lesson, but I do want to step back and reflect on two different things, the first of which is a kind of bookkeeping, and the second of which involves a very thought-provoking fact about higher dimensions that I actually didn't know until I dug into transformers. In the last two chapters, you and I started counting up the total number of parameters in GPT-3 and seeing exactly where they live. So let's quickly finish up the game here. I already mentioned how this up projection matrix has just under 50,000 rows, and that each row matches the size of the embedding space, which for GPT-3 is 12,288. Multiplying those together, it gives us 604 million parameters just for that matrix. And the down projection has the same number of parameters just with a transposed shape. So together, they give about 1.2 billion parameters. The bias vector also accounts for a couple more parameters, but it's a trivial proportion of the total, so I'm not even going to show it. In GPT-3, this sequence of embedding vectors flows through not one, but 96 distinct MLPs, so the total number of parameters devoted to all of these blocks adds up to about 116 billion. This is around two-thirds of the total parameters in the network, and when you add it to everything that we had before for the attention blocks, the embedding, and the unembedding, you do indeed get that grand total of 175 billion as advertised. It's probably worth mentioning there's another set of parameters associated with those normalization steps that this explanation has skipped over, but like the bias vector, they account for a very trivial proportion of the total. As to that second point of reflection, you might be wondering if this central toy example we've been spending so much time on reflects how facts are actually stored in real large language models. It is true that the rows of that first matrix can be thought of as directions in this embedding space, and that means the activation of each neuron tells you how much a given vector aligns with some specific direction. It's also true that the columns of that second matrix tell you what will be added to the result if that neuron is active. Both of those are just mathematical facts. However, the evidence does suggest that individual neurons very rarely represent a single clean feature like M. J. And there may actually be a very good reason this is the case, related to an idea floating around interpretability researchers these days known as superposition. This is a hypothesis that might help to explain both why the models are especially hard to interpret, and also why they scale surprisingly well. The basic idea is that if you have an n-dimensional space and you want to represent a bunch of different features using directions that are all perpendicular to one another in that space, you know, that way if you add a component in one direction it doesn't influence any of the other directions, then the maximum number of vectors you can fit is only n, the number of dimensions. To a mathematician, actually, this is the definition of dimension. But where it gets interesting is if you relax that constraint a little bit and you tolerate some noise. Say you allow those features to be represented by vectors that aren't exactly perpendicular, they're just nearly perpendicular, maybe between 89 and 91 degrees apart. If we were in two or three dimensions, this makes no difference. That gives you hardly any extra wiggle room to fit more vectors in, which makes it all the more counterintuitive that for higher dimensions, the answer changes dramatically. I can give you a really quick and dirty illustration of this using some scrappy Python. It's going to create a list of 100 dimensional vectors, each one initialized randomly. And this list is going to contain 10,000 distinct vectors, so 100 times as many vectors as there are dimensions. This plot right here shows the distribution of angles between pairs of these vectors. So because they started at random, those angles could be anything from 0 to 180 degrees. But you'll notice that already, even for random vectors, there's this heavy bias for things to be closer to 90 degrees. Then what I'm going to do is run a certain optimization process that iteratively nudges all of these vectors so that they try to become more perpendicular to one another. After repeating this many different times, here's what the distribution of angles looks like, We have to actually zoom in on it here, because all of the possible angles between pairs of vectors sit inside this narrow range between 89 and 91 degrees. In general, a consequence of something known as the Johnson-Lindenstrauss lemma is that the number of vectors you can cram into a space that are nearly perpendicular like this grows exponentially with the number of dimensions. This is very significant for large language models, which might benefit from associating independent ideas with nearly perpendicular directions. It means that it's possible for it to store many, many more ideas than are dimensions in the space that it's allotted. This might partially explain why model performance seems to scale so well with size. A space that has 10 times as many dimensions can store way, way more than 10 times as many independent ideas. And this is relevant not just to that embedding space where the vectors flowing through the model live, but also to that vector full of neurons in the middle of that multilayer perceptron that we just studied. That is to say, at the sizes of GPT it might not just be probing at 50,000 features, but if it instead leveraged this enormous added capacity by using nearly perpendicular directions of the space, it could be probing at many, many more features of the vector being processed. But if it was doing that, what it means is that individual features aren't going to be visible as a single neuron lighting up. It would have to look like some specific combination of neurons instead, a superposition. For any of you curious to more, a key relevant search term here is sparse autoencoder, which is a tool that some of the interpretability people use to try to extract what the true features are, even if they're very superimposed on all these neurons. I'll link to a couple really great anthropic posts all about this. At this point, we haven't touched every detail of a transformer, but you and I have hit the most important points. The main thing that I want to cover in a next chapter is the training process. On the one hand, the short answer answer for how training works is that it's all backpropagation, and we covered backpropagation in a separate context with earlier chapters in the series. But there is more to discuss, like the specific cost function used for language models, the idea of fine-tuning using reinforcement learning with human feedback, and the notion of scaling laws. Quick note for the active followers among you, there are a number of non-machine learning related videos that I'm excited to sink my teeth into before I make that next chapter, so it might be a but do it'll come in due time.

41:45 - Unidentified Speaker
All right.

41:47 - Conference Room (D. B.) - Speaker 2
Many people contribute to this.

41:52 - M. M.
It's really very good, but a lot of people are working.

42:03 - Unidentified Speaker
So many.

42:05 - Conference Room (D. B.) - Speaker 2
OK. Any other comments or questions? I'm just sort of, I guess that's sort of flabbergasted by this phenomenon where this technology has these emergent properties that even the inventors didn't really anticipate. And also they don't even know how it stores data. Like if you want to store that M. J. plays basketball in a database, you'd know how to do it. You'd have a key, right? A record with a key would probably be the string M. J. There would be a column for, you know, sport played or profession or something like that. And in the cell in that column would be, you know, basketball. That's how you'd sort in a database and compare sizes.

43:02 - D. D.
Go ahead. It tokenizes the, it's, and then the token is known by its weights. So it's got its own way of, you know, storing it, but it's not, you know, it's not something that we would Read like a database, but it's stored in the weights.

43:25 - M. M.
Yeah, but can be converted, can be converted from one storage format to another. Storage format. This is also possible. You know, another...

43:37 - Conference Room (D. B.) - Speaker 2
That'd be a good project.

43:39 - M. M.
What D. asked, if you want to do the database, you can convert it to database. That would be a really good project.

43:49 - D. D.
Something that you could get a hold of, like one of the smaller BERT models on HuggyPace, change it over to a database change.

44:02 - D. D.
It had to be a small one.

44:10 - Unidentified Speaker
Yep. Different concept.

44:14 - Conference Room (D. B.) - Speaker 2
Whoops, sorry about that. Right, well, we finished it. Whoops.

44:27 - Unidentified Speaker
Yep.

44:29 - Unidentified Speaker
What did I do?

44:35 - Conference Room (D. B.) - Speaker 2
You know, another question that kind of has a long history in scientific investigation is the question of how humans and animals and brains store memory. And nobody's ever really figured it out, but maybe this is sort of how it's done. It's not stored in any particular neuron. It's just sort of stored as a... Maybe this is something experimental psychology as well. Okay, I guess we're at a good stopping point. We finished Chapter 7 video. Next time, D. will step us through this process by which he does, gets these transcripts from Read.AI. So Read.AI gives, AI gives the transcript, and then he processes it in an interesting way to anonymize it and so on. This is last week's transcript. I didn't get rid of that. And soon we will start a new reading or potentially video, depending on what people want, since we're finished with this series. All right, folks. Well, thanks for joining in, and we'll see you next time. Thanks, guys.

46:28 - D. D.
As always, thank you.