Friday, January 31, 2025

1/31/25: AA and MSM presentations on agent based AI (RAG and video generation)

  Machine Learning Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many. 
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes  (148th meeting, Jan. 31, 2025)

Table of Contents
* Agenda and minutes
* Transcript (when available)

Agenda and minutes
  • MM's student AA presented his work on an agent based architecture for RAG. Then MSM presented an agent based architecture for creating a story with accompanying images and turning it into a video. 
The meeting ended here.
  • Announcements, updates, questions, presentations, etc.
  • Recall the masters project that some students are doing and need our suggestions about:
    1. Suppose a generative AI like ChatGPT or Claude.ai was used to write a book or content-focused website about a simply stated task, like "how to scramble an egg," "how to plant and care for a persimmon tree," "how to check and change the oil in your car," or any other question like that. Interact with an AI to collaboratively write a book or an informationally near-equivalent website about it!
      • BI: Maybe something like "Public health policy." Not present today.
      • LG: Thinking of changing to "How to plan for retirement." 
        • Looking at CrewAI multi-agent tool, http://crewai.com, but hard to customize, now looking at LangChain platform which federates different AIs. They call it an "orchestration" tool.
        • MM has students who are leveraging agents and LG could consult with them
      • ET: Gardening (veggies, herbs in particular). Specifically, growing vegetables from seeds. Using ChatGPT to expand parts of a response. Use followup questions to expand each smaller thing. Planning to try another AI to compare with ChatGPT. These AIs can generate pictures too, if you ask them. But these images aren't necessarily perfect.
  • Anything else anyone would like to bring up? 
  • We are up to 13:05 in the Chapter 6 video, https://www.youtube.com/watch?v=eMlx5fFNoYc and can start there.
  • Schedule back burner "when possible" items:
    • If anyone else has a project they would like to help supervise, let me know.
    • JK proposes complex prompts, etc. (https://drive.google.com/drive/u/0/folders/1uuG4P7puw8w2Cm_S5opis2t0_NF6gBCZ).
    • The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates when available, every now and then but not every week.
      • 1/31/25: There is also an on-campus discussion group about AI in teaching being formed by ebsherwin@ualr.edu.
  • Here is the latest on future readings and viewings

Transcript:

 Suggested text scratch?
Conversation opened. 1 unread message.

ML Discussion Group  
Fri, Jan 31, 2025

0:07 - Unidentified Speaker
Hi, everyone. Hello.

1:41 - D. B.
Well, is there anything on the schedule in particular today? I know that Dr. M.'s students were considering telling us about Lang chain and crew AI.

2:00 - M. M.
Yeah, I want to introduce A. is our new PhD student. That is coming to our group. I appreciate that S., yeah, S. A. actually is joining us. I really appreciate the participation of all of you. Welcome, S. Glad to see you again.

2:27 - Multiple Speakers
What did you say? I said, welcome to S.

2:32 - D. B.
Glad to see you again. Yeah, yeah, yeah.

2:37 - M. M.
Thank you so much.

2:39 - Multiple Speakers
Thank you for the invitation.

2:41 - M. M.
I'm so glad to see you. Yes, thank you for coming. We organized with D., with D., we organized this blog and discussion group every Friday at four o'clock. Please come and join us when you are available.

2:58 - Multiple Speakers
Yeah, if anyone who's here and has not been here before would like to be added to the calendar entry, just let me know.

3:08 - D. B.
I'll put your email on the calendar entry, and it'll be on your Google Calendar. Do you want me to give their emails, R.'s email?

3:18 - M. M.
Well, I mean, if she wants to be on the list, I guess.

3:23 - D. B.
I mean, you want to just give it to me now, or they can contact me later? Yeah, please can you admit Okay, what's your email address?

3:37 - S. A.
S-N-W-R. S-N-W-R.

3:59 - M. M.
particularly A. and S. that are here right now, they participate in several hackathons and competitions. Well, they're well trained for large language models and computer vision and multi-modal techniques, generative AI. So one of the that we have with Census Bureau with Dr. T., we suggest to work with multi-agents and was very well accepted the idea. So how do you do then? Can give, I think A. will make a presentation.

4:47 - D. B.
Yeah, that'd be great.

4:49 - M. M.
If you give the access to, Do you want to share a screen?

4:56 - D. B.
I can unshare my screen if you can share or whatever.

5:01 - Multiple Speakers
Yeah, I can share my screen, Dr. B.

5:05 - Unidentified Speaker
Okay.

5:06 - M. M.
Okay, so they will, please feel free, because it's a discussion group, so feel free. It's not so official, official presentation. Feel free to ask questions and of course, Your feedback is very important. There are many, many multi-agent techniques. Probably, A. can share one of the links that we have. But first, he will start with his experience and how he's using and currently how he's using generative AI. And after that, we will show you some video. OK? So, A.

5:48 - A. A.
Yeah. OK, so good afternoon, everyone. I'm A. I'm currently working in this entity resolution project for the U.S. Census Bureau. So this entity resolution project is like matching records between multiple data sets. So if we have some records that represent the same real-world entity, we directly match them across multiple data sets. The problem is not a lot of machine learning or deep learning algorithms are able to do that perfectly. But when the LLM and the multi-agent systems were introduced, we thought that we would extend this research and try it out to check whether it does a good job or not. So for this, I'll be going through the base of multi-agent system and what we added to the whole application. And then I'll be showing a small code snippet on how I did that using TrueAI. So to get started, so the multi-agent systems are equipped with their own large language models. So usually if we have like multiple tasks that we need to do, the single agent large language model system finds it very difficult because the use users' queries are too much for them to handle. So multi-agent systems do a good job where we can segregate some tasks and we'll be able to combine those outputs of different agents and give it back to the user. This actually increases the reliability of the results. Seen that myself. So the important thing here is we need to give them specific instructions so that they can do the tasks on their own. The main part with this multi-agent system is that the prompts that we give should be precise and up to the mark. Otherwise, it makes mistakes along the way. So let me show you how the multi-agent system works. The diagram here shows us that first of all, we'll be sending the query. The user will give the query to the system. So basically what it does is the parsers might contain our data sets, which may be a PDF file or an Excel or a CSV file. And then the third part is we'll do the embeddings on these documents and create a vector database. So the main reason why we create a vector database is that it helps the LLM models look for the semantic meaning for a particular word so that it would be easier for the LLMs to identify the words which have a similar meaning join them together. And then it goes to the multi-agent systems where they have specific instructions on what to do. So in this particular diagram, we have three agents. We have a ranker agent, the reader, and an orchestrator. Each has a different task. The ranker agent will rank the particular Retreat documents and the reader will read through it and check and the orchestrator will summarize the whole document and give it back to the user as a whole response. So this is how a multi-agent system works, but we did not use the same multi-agent systems. We did a different thing with that. I'll be getting back to it.

10:02 - A. A.
The main thing is, even though multi-agent systems are good at doing subtasks, the problem is that once we input a larger file, like we'll be inputting multiple PDFs or CSV files, which will be like a 1 GB of CSV file or 1 GB of PDF file, so that it becomes harder for the LLMs to loop all the data inside that and give us a result, it takes a lot of time and sometimes it will hallucinate. So because of that problem, we added a retrieval augmented generation system. So this retrieval augmented generation system, what it does is, first of all, we do the embeddings from the data which was given to us, and then it will convert them into vectors. The vectors usually capture the semantic meaning of the document, so that it would be easier for the LLM to combine the words later on. But once the vectorization has been done, the data goes to a Retrieval Augmented Generation system so that the LLM can interact with the Retrieval Augmented Generation system and say like, hey, I only need this particular data from the data set. So the RAG system, what it does is it only takes that particular data which is relevant to the context and gives it to the LLM. Now the LLM does not have to loop through all the data which is in the dataset and give us the answer. So by doing this, it reduces a lot of time and increases the reliability of the answers which are provided by the multi-agent system. So that's why we are using a RAG system. Moving on, in our particular project that we are working on, we are using an improvised version of a RAG system which is called as a multi-query RAG system. So usually when this was initially released, we were using a single query retriever RAG system where the user, where the LLM asks the RAG system that they need this particular information from the document and it gives them the, gives only that information. But the problem is the prompts that we give is directly given by the LLM to the RAG system. So there is a high chance that the RAG system does not give additional relevant information on that. So a lot of data is being lost in that particular process. So they came up with a solution where it's a multi-query RAG system. So what it does is, once the LLM gives the query, this particular system comes equipped with another LLM system inside it. So what the LLM does is that the original query which is given by the RAG system. The LLM uses that query and generates multiple queries. I think, I don't know how many generated queries it executes, but the minimum generated query it executes is like five. So these generated queries are then given to the RAG system once again, so that since we are giving like five or six queries, different queries at the same time, the results will, the RAG system will give them different results, but it would be in the same context. So combining all these results, it will give us like a more, give us a broader context for the LLM to work with so that minimal data is lost in this process. So the context we are getting is much higher than using a single query retrieval RAG system. And then these data are given to the multi-agent system to do the task. So I have tried this on our project with the single query and the multi-query retriever, but the multi-query retriever does an even better job. Job by inputting a larger context and giving us a much more better result than a single query retriever RAG system. So, yeah, I highly suggest that if we use multi-query RAG system, it is better to capture additional information, additional relevant information, and it would be more accurate than a single query RAG system.

15:10 - A. A.
Yeah, so let me just stop my screen for a moment and I'll share a small code snippet and I'll explain that in just a second. Yeah, so here, yeah, so right now we are for our entity resolution task, we are using an agent framework. So in this agent framework we will like, Assign roles for multiple agent. So here in this this first agent is a direct record linkage agent. So the task for this agent is that it will see look through multiple data sets and if it like directly. Directly matches with another data set or by address, it will match it. So we have like multiple agents doing tasks. And in this agent's file, we'll just define the role and their goal for those particular agent. And in the task, we will give them precise instructions on what to do. We'll also provide them with examples so that they can understand the context really well and then do the task with much more preciseness. So in Crue AI, it is very much easier to add the agents and allocate them specific tasks. So we just need to write these prompts and it is much easier to implement it with Crue AI. But while executing this project, The problem with Crue AI is that it is very much better in using text as an input and generating its own answers. But when we use our own data set and try to do a specific task by allowing these agents to read our data and give us the output, it is not doing a good job. Over time, it is like these agents are combining themselves together and then they are trying to repeat itself over and over again. So, Crue AI is not really working in this particular task as we expected. So, we are currently moving towards LangChain. Specifically, they have a library called LangGraph, which is an agent architecture as well, which contains the nodes and edges. So it is supposed to be much more better than TrueAI, but the problem is it is a bit harder to implement with LangChain. TrueAI offers a much easier implementation where we can just plug and play, but in LangChain, it is a bit complicated if we execute it, but it is much better than Crue AI. That's what we know. That's what we can know. Right now we are trying to change the whole system to use LangChain multi agent system so that we need to check their results as well. I hope it is better than Crue AI.

18:49 - Unidentified Speaker
Yep.

18:49 - Multiple Speakers
A. and Dr. M. should we wait for questions until the presentation is over?

18:55 - A. A.
or can we ask questions during the presentation? Yeah, I'm actually done with my presentation. So I'm open for questions right now.

19:08 - Unidentified Speaker
Yeah.

19:09 - Y. P.
Okay. Yeah, the second.

19:11 - A. A.
Yeah. Yeah.

19:12 - Unidentified Speaker
Yes.

19:13 - Multiple Speakers
So then should I go ahead and ask questions?

19:18 - Y. P.
Yes, please. Okay. So I'll start with your last slide. Crue AI versus LangChain. And you mentioned you're finding something difficult or something is not easy. I don't remember the exact words, but you had some difficulties. I would like to know what difficulty you're facing in LangChain. So that if there is anything I could do to help you out.

19:46 - Multiple Speakers
Yeah, I'd not say difficulty. I'm like, I've not implemented yet.

19:50 - A. A.
I was just saying that Crue AI is much easier to implement than LangChain. LangChain might be a bit complicated. That's what I said. If you face any difficulty, let me know.

20:02 - Y. P.
Either me or somebody can help you out if at all you need any help. Yeah, sure, sure.

20:10 - Multiple Speakers
My name is Y. P. or people call me Y.

20:14 - Y. P.
So I wanted to ask you. Thanks a lot.

20:17 - Multiple Speakers
I have a couple of more questions.

20:21 - Y. P.
if I can ask. So, when you're saying agents, agents in Glue AI, agents in other framework, are you actually have, actually, let me start with the first question. When it comes to your use case, are you putting all the data in this framework? Or do you have kind of a segregation process? Yeah. Sorry.

20:52 - A. A.
Yeah, I'm actually getting the data and converting them to chunks. And then those chunks go into the embedding process. And then it goes to the vectors.

21:06 - Y. P.
Yeah. Not that well, I'll explain the other way. So census data is large data, and there are multiple data sets. And see, if you think from processing standpoint, and you were yourself saying that, the traditional rule-based system will work much better than the machine learning or pre-GPT, non-GPT AI engines. And then GPT AI engines will consume the maximum processing power or energy and all that. So when you are doing that, like the way you are thinking of multi-agents within this, have you thought of, now I know you are experimenting might be only on RAG and agents, but is there a scope, Dr. M, also that might be a question for you, where actually you are saying processing-wise, we have eliminated records that really do not need to go through this, where there is clear match, so we are removing that, and then remaining data sets only where it cannot be matched or something, you are putting it for learning. Yeah. So are you doing that or no? Or have you thought of doing that?

22:26 - M. M.
Yes, but he mentioned if there is first step is direct match. This is what you're talking about. Something that is really obvious, we don't continue investigating more. So, we will do this step. We remove what is really obvious and start digging, you know, deeper. Yeah.

22:48 - A. A.
I mean, the thing is, the census data that we got, it's like a very dirty data set. So, in the column called names, there will be the address. In the address, there will be some identification numbers. Pretty jumbled. So what we are trying to do is that we are trying to automate it without any rule-based approach using LLMs alone.

23:18 - Y. P.
So that's what we are trying to do.

23:21 - Multiple Speakers
Thank you for the explanation.

23:24 - M. M.
Yeah, we actually published several papers using transformer, you know, this is something new with multi-agents, but we actually proposed and already proved that with large language models and particularly transformer and explainable transformers give us very, very good results. So we already use for census and we're the first actually university that suggest them, you know, to use the large language models. They reject so many times, but right now they are asking us the court and they approve our work already. So I want to share again, D., with you probably, the new code that I have for this semester for everybody that is in our group to use NVIDIA courses for free. Because they change every semester, they change the code. We have a new code right now. Well, because this is how A. and several of my students are using Lama. Yeah, so we have actually another presentation. And A., you want to show the

24:46 - A. A.
I think, yeah, S. will show that right now.

24:51 - Multiple Speakers
S. will show another too? Yeah.

24:55 - M. M.
Yeah, I like, we have another. Presentation which S. can show.

25:01 - M. S. M.
I wanted to show the N8n platform from YouTube as we have not started working with N8n yet. I think it's better if we show everyone the YouTube video that explains how good N8n is in regards to codeless programming for AI agents. I love it this one but do you want to show first the Storytelling the storytelling Storytelling I think it's better if we show the storytelling in another another day Why we have so many people oh my goodness you have

25:39 - M. M.
the video please prepare the video Multi-agent that then you can you can you can go ahead with that. I'll show the storytelling.

25:49 - Multiple Speakers
Let me just grab the video Okay, grab the storytelling So when you share a screen to show the video, you need to checkmark that it's optimized for video or something like that.

26:05 - D. B.
In Zoom, there's a checkmark. Otherwise, the video won't come out. So when you go to share the screen, there's a special place to check that it's a video.

26:18 - M. M.
Can you see it?

26:20 - Unidentified Speaker
Yeah.

26:22 - M. M.
This is one of the recommended. I have a list of 10 most popular, and this framework is actually without code. Using Telegram, which is an app you can use on your computer.

26:36 - D. B.
No, it's not coming. I'm not seeing a video.

26:40 - Y. P.
Yeah.

26:41 - Multiple Speakers
I think it's in another screen document26:44 - D. B.
I see your Zoom launch meeting screen, so it's the wrong window. Or the wrong tab.

26:44 - D. B.
I see your Zoom launch meeting screen, so it's the wrong window. Or the wrong tab.

26:52 - Unidentified Speaker
Or if you send, okay, there you go, you're sharing. That looks like a video, yes. Yeah, this is the video.

27:06 - M. M.
And S., please prepare our video, it's very good. Agents are taking over right now. So in this video, I'm going to show you how you can automate anything. Can you hear it? I hope so. Yeah. Yes, ma'am. Yeah, yeah, we can hear. I'm going to be showing you a couple of examples of how you can add tools to an AI agent, have a conversation with a bot using Telegram, which is an app you can use on your computer, on your phone, on the go. And this is multimodal. So I can type into it, I can speak into it, whether I'm on my phone, on my desktop, and I can give it requests like it's my and it can go out and do things for me. Let me show you a couple of examples of how this actually works, because as you can see, I have a bunch of things connected. I have Google Calendar, I have Airtable, I have Gmail. Things are all over the place, but this AI agent can manage your tools and then decide which one to use on its own. So you give it basic natural language queries, and it comes back after using all your tools with all of your info, and it actually gives you a good response, schedules things for you, can email people, update your CRM, search your CRM, do so many things. Now this is just the tip of the iceberg. If I go to my Google Calendar, as you can see, tomorrow I have a discussion with J. and AI brainstorming. Discussion with J. is two to three, AI brainstorming eight to nine. If I ask Telegram something like, what's on the schedule for tomorrow? And I could send that off, use very natural, small language, because AI is really good at interpreting that and understanding that. What it's going to do is return schedule that I just showed you in a matter of seconds. As you can see, here's your schedule for tomorrow. Morning routine, discussion with J., AI brainstorming. We go to my calendar, this is exactly what I've got tomorrow. Morning routine, discussion with J., AI brainstorming. I could now add an event or something like that. I could be on the go, remember. I could just use my voice. I could say, can you add an event around 11 p.m. to 11:30 p.m. that goes over AI agent discussions with C. You know, horrible. I'm talking even horribly to it. And it's going to parse that, transcribe the recording, and then create an event for me tomorrow after I gave my schedule. In a matter of seconds, guys. So it says 11 to 11:30 p.m. agent discussions with C. tomorrow scheduled. Go down here, boom, AI agent discussions with C. scheduled. I could then ask it questions about certain people. So maybe C. is in my CRM. I could say, return info about C. to me from my CRM. And I could, you know, have it do multiple things for me. So now I'm even going deeper down the rabbit hole. Event's been created, and then I asked for information about C., and now it's giving me things like his name, his email, his company name, notes about him. So now with his email, I can even, you know, use my emailing tool in order to email him the event details. So things are moving very quickly. I created this thing in 45 minutes, but it might be more difficult for you. So that's why I'm here in order to teach you. And in this video, you're going to learn lessons that are much greater than just, you know, putting together a template that you forget about. You're going to be learning things like how to set up a telegram trigger, how to craft amazing instructions like this. And look, watching a video of me building all this, it's going to help you. You're going to learn some stuff, but if you want to learn in a more structured way, you want just a full guide so that you can whip things up like this in 45 minutes for any use case that you desire, then I highly, highly recommend joining our AI Foundations community. This network is hyper-focused on building AI agents right now, and we've also just released a full course on agent building within N8N. And this course literally takes you from beginner to pro in N8N, literally setting up your workspace all the way to crafting agentic systems like the one that I'm about to build you today. Now, it's important to have the knowledge in order to do this on your own, and in order to thrive in a world full of agents, you're gonna wanna know how to build these. So that's why we made this in-depth course for you to actually learn the tactics of building agents rather than just relying on other people giving you tutorials and templates. We want to give you the actual way to build them. And that's what this course offers. We go in-depth on so many different topics and also the community, the network, the live calls, everything you get in here is just going to excel your AI agent building abilities. So I'll leave a link in the top comment and the description yeah I think that yeah we just grab this template and go over continue uh do you want me to share the video here we have it a prompt helping us create our yes yeah sure yeah so I think that this is free and N8N is free I tried little bit.

32:00 - Unidentified Speaker
I like it because it's no no cotton good. So maybe S. is ready.

32:09 - M. M.
When you know how to how I can I I can stop sharing and S. can share but I can show you the uh thank you the Video and I have from Medium a lot of leaks. They give us for the best for the best frameworks for multi agents but this is the video it's a long so I just show you I'm I'm excited. I need some help for calendar at least somebody to to help me with the calendar, the reminder and stuff like this. And can do whatever job you want, you know, summarize papers, searching for a ticket for you, whatever you want. S., are you ready to show, please?

33:15 - M. S. M.
Yeah, Dr. M., I'm ready. I can show the video that me and my friend, basically I., we created a video for NVIDIA AI competition. So we used the multi-agent NVIDIA and NVIDIA guardrails to create a storytelling project. So it was basically the storytelling was done by agents and then the scripts that was derived from the storytelling. I used those to create images that goes according with the story. And at the end, I have parsed all the images together and then used a text-to-audio tool to basically generate audio for the file. So if you guys want, I can show the video of which I. posted on his LinkedIn. So I can show you the video. We basically, this was our project. Let me share. So this is my friend I. He's basically presenting this and we worked on this together. Everyone, this is me, I. presenting on our entry for the NVIDIA AI Generative Contest and our project today is a story production crew, which basically generates videos of children's stories from just a text prompt. Here you can see I'm putting an input for a story about a monkey, a dog, and a dragon. And you can see our elements will generate a story from this prompt. So what we do here basically is we create three different separate types of agents, one responsible for detailing the scene, the characters and then writing the scripts and the story. So from the story, we generate all the other information and then we pass that on through NVIDIA guardrails and we generate both images and text. And then the text is converted to audio. And then we combine the generated images and the audio to make a video. And here you can see we got the output and soon we'll demo the video. That was produced. So enjoy. Meet M., a friendly dog. During one of his explorations, they quickly become friends and enjoy playing hide-and-seek. M. tells M. about a legendary dragon that lives deep in the forest. Seeking M.'s curiosity, determined to meet the dragon despite M.'s initial hesitation, they embark on an adventure together. They eventually find the dragon, E., who has emerald scales and a fiery mane. Though E.'s initial roar is intimidating, M. and M. stand their ground and explain their desire to be friends. Touched by their bravery, E. reveals that he used to have many friends, but humans grew to fear him over time. E., delighted to find new friends who aren't afraid of him, offers to teach them about fire breathing, wingspans, and the importance of friendship. The three of them go on exciting adventures, exploring caves, chasing butterflies, and having campfires. Through their journey, M. and M. learn that true friendship can bridge any difference, and E. finds joy in their companionship.

37:06 - Unidentified Speaker
OK.

37:07 - M. M.
Thank you. Thank you. OK.

37:09 - Multiple Speakers
So the problems that I'm facing now currently with this project, so they can be divided into three parts. So the first would be parsing.

37:21 - M. S. M.
So the parsing part is for each individual story segment, I have to basically the code gives us different images. So I'm still working on systems so that you don't have to compile each segment of the text to get an image. Instead, we get all the images together. And the second part of the problem that I'm facing is there's consistency on image generation. As you've seen before, in the images, the breed of the dog, the color of the dog, or the background, or any other characters, they don't seem to be consistent. So there was a lot of manual, manual processing involved when I actually fixed it. And also I am trying to figure out how we can keep this parsing consistent, this image generation consistent across all the images, because I'm still facing this problem and I haven't found any, or maybe I haven't looked hard enough. I haven't found any, let's just say there's this one possible solution, one simple solution which I can follow and I can solve this problem. I haven't found that any. And the third one would be, as you've seen, this was generated by stable diffusion. And this was done in the code manually. I would also like to apply this text in DALL-E for image generation and compare what happens if we use it in the DALL-E platform and the Mid-Journey platform. And what happens if we don't use those platforms, but instead use the API keys for the project. So these three are the challenges I'm now facing and currently working with. Thank you so much, Dr. M., for giving me the opportunity to show this video.

39:24 - M. M.
Sure, sure. And we want help from all all of you in this project, you know, the agents can do multiple things like I show you. So any ideas that you have, and I mean, all kinds of companies can benefit about.

39:42 - D. B.
I have a couple of questions. So how much code, actual, you know, code did you have to write to create that movie?

39:53 - M. S. M.
There was quite a bit of coding involved. I mean, for the image generation, there was the normal stable diffusion, your run-of-the-mill stable diffusion coding that I got from a GitHub project and I repurposed it and also I did some of my own coding for that.

40:13 - D. B.
Okay, how many lines would you estimate?

40:16 - M. S. M.
That was more than at least 70, 80, not more than 150.

40:21 - D. B.
Okay, well, that's not that much code for generating a whole movie, right?

40:26 - M. S. M.
I mean, it's not generating a whole movie per se, Dr. B. It's basically generating images. And keeping consistency in those images is what I'm facing problems now. As you can see, there was a lot of manual tweaking involved. And also, at one point, I wasn't getting one of the images right. So I had to use the, you know, the website the DALL-E website to basically fix it. So the DALL-E website has this one, this markup tool where you can mark in the image and it will fix the image for you however you want. Like there was one part of the image that the background of the image that the forest, the forest wasn't coming up right as much as good as I want. So I had to do that manually. So you can see it's still a work in progress and there's still a lot of, you know, tinkering that needs to be done in this case.

41:21 - D. B.
So I mean, did you have one agent generate a story and another agent request images for each segment of the story? Exactly, exactly. And did anyone else have any questions about this? I mean, I don't want to overwhelm, ask all the questions.

41:40 - Y. P.
Ben, I'm always a man with questions. If you have questions. You have a bunch of questions. No, no, I was like I have a man always with questions, but if you have more questions, finish them. I can.

41:57 - D. B.
I can hold my question if we can go back. You had it.

42:02 - M. M.
You showed a diagram showing the agents structure. Yeah, yeah, and every agent the prompt is with language. We're using Crue AI. In this case, and OpenAI. And only the, something with NVIDIA is this guardians, but guardians is just about the security, that the text that the agent generates is not something to hurt people or hurt the child, child, or something. You know, this is the control, security of the text. Can you go to the diagram really? He has three different agents because the initial query is just a story about these characters. So everything else is computer generated. S., can you please show three agents?

42:57 - M. S. M.
Sure, Fany.

42:59 - Multiple Speakers
So everything is computer generated. We didn't put any words. Any. And it's generating. I think one is generating the script. We have a good description.

43:13 - M. M.
Okay, yeah. So the scene generator, character, and the scriptwriter. See, they are the three. So the scene, I think that has all the text. Character description, how they look like, you know? And a script writer is, that they exchange. So every time to create the new image, we're using the new text, the dialogue.

43:47 - D. B.
Yeah.

43:47 - M. M.
So we have an audio and image and combining the work that S. is doing is actually generating the images when we will come to every scene, every frame, what people, the dialogue, the text that they talk. I have the whole document, but I think I., not many people participate. I. presents the whole document, what is included, how the text is generated in each part. And we like it so much. And actually, it can be a game. It can be a game. There are games right now like this.

44:44 - Unidentified Speaker
Any questions? We almost finished, yeah. Yeah, hi.

44:48 - Y. P.
The question that I had was around the inconsistency that you were finding on images.

44:56 - Unidentified Speaker
Yes.

44:56 - Y. P.
Now, I have not personally played with video creations and this kind of effort. My work mainly has been on the data side, but similar to the earlier presentations, is there a possibility of building a RAG model? And have you thought of putting these images the way you want somewhere and call a capability or these images each time when you're doing saying that, hey, you have done this, but I would like you to use this. So is there a possibility of controlling these inconsistencies using a RAG framework or similar framework?

45:43 - M. S. M.
Yeah, that is a great advice. Actually, we are trying to do that. I mean, we're still trying to work or make a workaround. As you know, this there wasn't much RAG involved here. There were agents, but not a lot of RAG. So I think we are trying to work around a frame where we can apply RAG into it.

46:11 - M. M.
Well, to copy, to save, like you suggest, maybe a first generated appearance of these characters, we can copy and save it and give that to keep yes and to give the task to keep this description one agent needs to do exactly this role to keep the description use exactly exactly this image exactly this appearance we can do this and this is great suggestion yeah yeah kind of like a feedback loop correct and map I have one more I

46:52 - Y. P.
did not, I mean, I saw the video, but I was driving while I was seeing, but there is also a possibility where you might have to creep one, two, three, four, and tag those images and tag those images with

47:10 - Unidentified Speaker
background or script or something.

47:12 - Y. P.
So that when you're calling that, because one image perhaps may not be different because the color can change. Change the expression can change and etc so you might have to create literally a database of either hey this is the script or if this is the background this is what I want something like that I mean that was was running in my mind but maybe I'm going to deeper first you have to find the framework and then you have to suggest how you're going to build the framework but those are some of the questions or thoughts that came to my mind and I wanted to share with you all. Thank you.

47:52 - M. S. M.
That is actually a great discussion also. But when we're using stable diffusion or any kind of image generation code segment, there's this notion that we think, that preconceived notion that the images that we are going to get after the generation, that's already basically trained on a thousand of images before. So in In our mind, we did not have that kind of idea to keep a database, but yeah, that can be done. But then I think the workaround will be a bit time-consuming. For a small project, it would be time-consuming, but it is a great advice if we actually can do that. I'll have to look more into it. Thank you so much for your service.

48:44 - Multiple Speakers
Actually, I. extended this. I think that he put multiple different solutions.

48:50 - M. M.
So we can invite I. if it's interesting for the group to give us the feedback, what is new right now. Yeah, why not?

49:01 - D. B.
Yeah, this is the initial step.

49:04 - M. M.
This was six months ago, I think, or more. So it's not new. So we have a new update right now.

49:15 - Unidentified Speaker
Wow.

49:15 - D. B.
Okay. Yeah. Great. Well, thanks. Uh, thank you all for attending and thank you to the presenters for bringing that information. Uh, that was quite interesting. Um, any last comments before we adjourn?

49:35 - D. D.
Thank you guys. Thank you.

49:37 - M. M.
Great presentation. Yeah. Thank you.

49:40 - D. B.
All right. You all have a good weekend.

49:48 - Unidentified Speaker
See you next week. Thank you. See you next week. Thank you. Thank you. Bye. Thank you all.


Displaying jan10FinishedTranscript.txt.

Friday, January 24, 2025

1/24/2025: Presentation by GS

  Machine Learning Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many. 
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes  (147th meeting, Jan. 24, 2025)

Table of Contents
* Agenda and minutes
* Transcript (when available)

Agenda and minutes
  • Announcements, updates, questions, presentations, etc.
    1. Schedule:
      1. For Jan. 24
        1. GS will rehearse his pre-defense slide presentation on extracting medical concepts from randomized controlled experiment writeups.
      2. Next time: briefing on today's seminar presenter: Reino Virrankoski, Aalto University (Finland), spoke at the PhD seminar on Security-Related 5G/6G/FutureG Research Activities at Aalto University.
      3. Next time: We might have MM's student(s) A. and/or I. explain LangChain and CrewAI multi-agent tools.
      4. Next time: Updates and guidance from/for masters students using AI to write books or websites.
  • Recall the masters project that some students are doing and need our suggestions about:
    1. Suppose a generative AI like ChatGPT or Claude.ai was used to write a book or content-focused website about a simply stated task, like "how to scramble an egg," "how to plant and care for a persimmon tree," "how to check and change the oil in your car," or any other question like that. Interact with an AI to collaboratively write a book or an informationally near-equivalent website about it!
      • BI: Maybe something like "Public health policy." Not present today.
      • LG: Thinking of changing to "How to plan for retirement." 
        • Looking at CrewAI multi-agent tool, http://crewai.com, but hard to customize, now looking at LangChain platform which federates different AIs. They call it an "orchestration" tool.
        • MM has students who are leveraging agents and LG could consult with them
      • ET: Gardening (veggies, herbs in particular). Specifically, growing vegetables from seeds. Using ChatGPT to expand parts of a response. Use followup questions to expand each smaller thing. Planning to try another AI to compare with ChatGPT. These AIs can generate pictures too, if you ask them. But these images aren't necessarily perfect.
  • Could any of MM's students tell us what they are doing with agents?
  • Anything else anyone would like to bring up? 
  • We started the Chapter 6 video! We got up to 13:05.

Transcript:

ML discussion group 
 

0:16 - V. W.
Hey there.

0:18 - D. B.
Hello. Just getting logged in here.

2:13 - M. M.
D., are you there? Yes, yes.

2:16 - D. B.
Okay, so I got myself a little bind here. I promised G., he could rehearse his pre-defense slide presentation next week, and then a couple hours later, I promised you that this guy V. can present. So I'm wondering if he could, could he go by the following week?

2:39 - M. M.
No. R. is coming just for next Friday, and probably he needs to go back to Finland.

2:47 - D. B.
Oh, he's actually here. Well, okay.

2:49 - M. M.
Well, It will be in the seminar, so maybe we can invite our people to join the seminar. Actually, V. and I, C., several of us, we work with R. V., do you remember? Yes, our brother.

3:10 - Multiple Speakers
Absolutely. Absolutely, yes. He's a multi-talented contributor.

3:14 - M. M.
Very, very, very talented and very good group of people that he has, but then we will invite our, this group of people we can invite for the seminar. Yeah, yeah, okay.

3:28 - D. B.
So I, what I'll need to do is get a link to the seminar.

3:34 - M. M.
Yes, please.

3:35 - D. B.
I'll send out an email to the calendar list and I'll put it in the minutes and everything. So, so I'm going to say.

3:44 - M. M.
The seminar is if the people are here in person, in 217 I think room, and online. So you give the invitation that we get from L. But V. and my people, my students, B., if you are available, please join us for the presentation in person. Or online, but everybody is invited. 24th? 24th.

4:18 - Multiple Speakers
Next Friday? Yes. Yes, I can be there, thank you.

4:23 - D. B.
Okay, so I need to find out what the link is. Did you probably, someone sent me the link?

4:33 - M. M.
Let me see if I just have it here.

4:37 - D. B.
They're a very good team, very strong, I don't Okay, here's the link, right?

4:46 - M. M.
Yeah. Encourage R. R. wants to go to F because V. knows that they have more drone research than us, but I still believe that we can do the collaboration. Okay for my mom My people they switch. They say that this crew AI it's Easy to use but not so efficient Change they switch it to different platform different tools And they are learning right now. So the Yeah, we've done Suggests to give the presentation maybe yeah, not next week after the next week I will ask my students to share experience with you guys using multi-agent Right now they are so popular very popular people are using them 217.

5:59 - D. B.
Let's give you 11. Yeah, 217. Yeah, we've got this here. Good. Good to see you, man. OK. And I probably should send, I'll just send out a calendar. I'll send out a, you know, from the calendar, it'll let me send an email to everybody too. I'll send out a reminder about this. And so next week, G. will give his pre-defense rehearsal. And then earlier in the day, V. will give his presentation online and in person. And we're all invited. You might have to click on Install button. If you haven't used Collaborate before, it'll want to install something. But that's how it goes. Okay, so I don't have to eat crow. Get away with something there. All right, yeah. Yeah, not necessary.

7:08 - M. M.
Okay, that leaves open January 31st, and you were thinking your students might talk about Lang chain?

7:19 - Unidentified Speaker
Yes.

7:19 - D. B.
You can put them, yes. Okay.

7:24 - M. M.
no better crew AI, so maybe you can put both, crew AI.

7:34 - D. B.
Crew AI? Yeah. And how do you spell that?

7:43 - L. G.
Is that right?

7:45 - M. M.
Yeah, that's correct.

7:48 - D. B.
Correct. OK.

7:50 - Unidentified Speaker
Yeah.

7:51 - M. M.
All right. That sounds good.

7:56 - D. B.
OK, sounds good. Yeah, we're good. That settles the schedule. So again, next week we'll have a rehearsal. G. is extracting medical concepts write-ups, basically texts of describing randomized controlled experiments, which is a type of medical, sort of high quality medical research results. Okay. Today, let's see. So we'll go through our usual thing here. So we've got some master's students who have decided to use AI to see what happens, to see how it works, to write a book or equivalent website on the topic of their choice. And I thought we could just go around to those students and get any questions they have or status updates, you know, whatever we can do to help and hear what you're doing. So I see L., I see you on the screen. Why don't you, I guess it's L. and E.. So L., why don't you start things off? How are things going in your project.

9:36 - L. G.
Okay, so this week, the month, what I've done so far, I started the week looking at crew AI, and it looks okay, but it didn't give a lot of customization for agents. So, like, you could have agents, but it was really hard to customize, like, how the agents would work. And so, Dr. M., I hope I pronounced her name correctly. She suggested that I look at Leng Chen and I started and I downloaded and played with it just this morning. So I haven't done no extreme amount of action with it. But so what it is, it's kind of like a, they describe it as an orchestration software. So there's a lot of different things like you can do rag with it, you can do agents with it, you can do chatbots with it, but it's meant to allow you to build AI applications in a platform that, you know, that is kind of not centric to the AI itself. And you can also use multiple AI. So I don't see the gentleman here, but I believe it was E. who was talking about, Hey, maybe we could get some answers from one AI and check it with the other YouTube band. All right. Wonderful. And it does. It was, it was E..

11:00 - V. W.
He's in the lower left-hand corner.

11:02 - Multiple Speakers
I just probably so many people here this week. I can't see him.

11:06 - L. G.
It does that actually very well. So you get your app keys from each of the AIs and it allows you to kind of use agents in a way that you could say, Hey, chat GPT. I need this. And then run over here, C., can you do this other thing? And so it's, it's kind of like the controller software he was talking about in a traditional system, but it's more about orchestration for, uh, to for developing AI applications. So I only play with it for about two or three hours.

11:36 - D. B.
I think it's very promising, though, for what we're trying to accomplish.

11:41 - L. G.
OK, so they use the term orchestration?

11:44 - D. B.
Yeah, that's the way they describe their software. OK, cool.

11:48 - E. G.
All right, sounds good. Orchestrations are usually the way that you handle the interactivity any application.

11:57 - D. B.
Yeah, it's a good word. I'll just mention, for anyone who wants to check the minutes, link. All right, and let's see, E., how are things going with you?

12:44 - E. T.
Well, so far, I get the major stones for the project for the gardening. So it's first I asked the general question about what do I do for the gardening from seeds, you know, vegetables from seeds, and it gave me the basic things like choosing vegetables, determine what type of soil you have, and the place you are planning garden, then I started Extracting more information as expanding each step. Like how do I choose? Vegetables and it started explaining. Well, you need to first Find your climate area your climate in the area you're living or planning to do your gardening then It started expanding How to choose the soil type and the next question my next next question was how do I determine the soil type? And it gave me some options, like you can use some lab tests, or you can do some DIY tests. And then I expanded those, like how do I do a DIY test to determine the soil type? And it started explaining those. You can do these, and these, and these. So it's going pretty well. As L. said, I'm planning to use another AI to compare the answers of each. So far, I've been using Chachapiti. So I didn't put it into action yet, but it would be interesting to compare the response from one AI to another and see which one would result the best.

14:33 - D. B.
OK. Yeah, cool. You know, you sort of know that you can always ask it to expand, but does it ever reach a point where that method gives you undesirable results? I mean, if it was writing a book, all you need to do is just ask it to expand and expand and expand. I don't know.

14:55 - Multiple Speakers
That's sort of one of the questions I'm sort of wondering what will happen.

15:00 - E. T.
I know that expand.

15:01 - D. B.
Sometimes instead of expanding, you might need to look at alternatives. So instead of expanding one short instruction into a longer account, you might say, well, the real answer might be, well, you do this for this kind of herb and that for that kind of an herb.

15:19 - E. T.
And then you end up with alternatives instead of expansions.

15:23 - D. B.
Anyway, I'm curious to see how it'll turn out. By the way, I planted some prickly pear cacti the seeds and they're germinating.

15:32 - E. T.
That's amazing. I'll try that. I'll try that for next week. Have you considered maybe using, asking for it to create a picture or a diagram as part of it too?

15:45 - L. G.
Like a Motomoto approach where maybe it could create a picture or a diagram or a flow or something that could also add to it.

15:56 - E. T.
That's a great idea.

15:58 - Unidentified Speaker
I tried creating diagrams.

16:00 - E. T.
Well, on the pictures, if I want to add some sort of notes or writing on it, it makes some spelling mistakes. That's what I've encountered so far. On the diagrams, it creates some type of diagrams, but they're not exactly the point that I want to see. Like, not exactly the thing I'm expecting from chat GPT to give me, you know, not the perfect one. But going deep into those things, that will be interesting and see how the results will turn out.

16:41 - D. B.
Yeah, it's, you know, a few months ago, these AIs wouldn't you have to go to a special AI, a special image generating AI to get an image Now, ChatGPT will do it. It'll only do a couple before it says, well, you got to try again in 24 hours. But yeah.

17:03 - V. W.
I've made a lot of attempts to render both static and dynamic biochemical pathways using the various image generation programs. And they fail spectacularly. It's as though you had a genius child that did not yet know how to draw relationships correctly. And so I'm disappointed because I see the potential for gathering all the facts, but the knowledge of how to do intelligent graphic design that people can parse and understand relationships, we're still falling short of that. And if you gain an advantage in that area, I'd sure be interested in hearing it because there's all kinds of pathways in life and in biochemistry that are helpful to understand. And, you know, seed and gardening are the, of that for us, so.

17:51 - D. B.
You could argue that, you know, they don't really understand text. They just pick the most likely next word and they do something similar with images. And so there's a lack of sort of deeper understanding.

18:05 - V. W.
Yeah, they don't understand the two-dimensional coherence of even a planar image in terms of what relates to what for specific entities in a diagram. Midjourney understands how to go through all the world's art and extract from using a GAN, the sort of composite picture of all art ever drawn, surrounded by that textual framework, but it's a completely different mechanism to draw a, even for a program to draw a UML chart or any kind of a depiction of active relationships or knowledge maps or those kinds of graphs.

18:45 - D. B.
Okay. All right. I keep hoping J. will show up and tell us about his agent-based prompt engineering. So I'm just going to leave this here for a couple more weeks. And he still doesn't show up. I just end up deleting it, I guess. Similarly, we could send him an email of appreciation.

19:06 - V. W.
Maybe he'll come back.

19:07 - D. B.
Well, I did send him an email. It wasn't exactly an email of appreciation. Maybe I should have. Butter him up a little bit or something.

19:17 - V. W.
He has a lot on his plate, so I think maybe encouragement would help.

19:21 - D. B.
Okay. Can you do it?

19:23 - V. W.
I'll do it. All right.

19:25 - D. B.
Because I already did one, so I didn't pitch it the way you recommended.

19:30 - L. G.
Maybe I could do it too, because he'd be surprised. Someone he doesn't know, contact him when he's really dying to hear about his research. There you go.

19:40 - V. W.
Yeah. What's his email?

19:43 - D. B.
Let's see. OK.

19:47 - V. W.
I'll send him an email, too. I think it'd be kind of funny.

19:59 - L. G.
But if I got a response, I'll be glad.

20:07 - Unidentified Speaker
Yeah. OK.

20:08 - D. B.
You got that noted down? Yes.

20:13 - Unidentified Speaker
OK. Back to here.

20:15 - D. B.
Also, I. has not been here in a few weeks. I'm curious to know what the campus's official representation to the cross-university study group has to say. But she may not, I may have driven her away by demanding she report. I'll let that go for another couple of weeks before I delete that as well. Anything else, anyone else would like to bring up before we go to start on a video?

20:58 - G. S.
I wanted to present my research next year. If possible. I missed all of you.

21:06 - D. B.
We got you up for next week. Oh, you got me in?

21:10 - G. S.
Oh, I'm sorry.

21:11 - D. B.
I got a bit late. Sorry.

21:13 - G. S.
Yeah, there was a little bit of a confusion, but it all worked out. Oh, OK. Got it.

21:19 - D. B.
Fine. Thank you so much. Thank you. Good. All right. Yeah, so last time we finished Chapter 5. And it turns out it goes up to Chapter 7. So we're going to work on chapter 6 today. Anything else, anyone, before we do that?

21:43 - Unidentified Speaker
OK. All right, I'm going to get this going.

21:57 - D. B.
Is this only two and a half minutes? Oh, I see. We've got to skip this. OK. All right. What I need to do is stop sharing, and then I'll share again, but optimize for video. So stop sharing. Share screen.

22:17 - Unidentified Speaker
Optimize for video. Share. In the last chapter, you and I started to step through the internal workings of a transformer.

22:31 - V. W.
This is one of the key pieces of technology inside large language models, and a lot of other tools in modern wave of AI.

22:48 - Unidentified Speaker
It first hit the scene in a now-famous 2017 paper called Attention is All You Need, and in this chapter, you and I will dig into what this attention mechanism is, visualizing how it processes data.

23:02 - D. B.
Any comments before we just on that base of that intro? We tried reading that paper once or twice, got bogged down pretty quick. But I remember we got bogged down on an equation that this guy has covered in one of his videos. I don't know if we remember it, but we might want to try that paper again. Anyway, here we go. As a quick recap, here's the important context I want you to have in mind.

23:29 - Unidentified Speaker
The goal of the model that you and I are studying is to take in a piece of text and predict what word comes next. The input text is broken up into little that we call tokens, and these are very often words or pieces of words. But just to make the examples in this video easier for you and me to think about, let's simplify by pretending that tokens are always just words. The first step in a transformer is to associate each token with a high-dimensional vector, what we call its embedding. Now the most important idea I want you to have in mind is how directions in this high-dimensional space of all possible embeddings can correspond with semantic meanings. In the last chapter we saw an example for how direction can correspond to gender, in the sense that adding a certain step in this space can take you from the embedding of a masculine noun to the embedding of the corresponding feminine noun. That's just one example you could imagine how many other directions in this high-dimensional space could correspond to numerous other aspects of a word's meaning. The aim of a transformer is to progressively adjust these embeddings so that they don't merely encode an individual word, but instead they bake in some much, much richer contextual meaning. I should say up front that... All right. Any...

24:45 - V. W.
Any comments so far? This is good stuff. So what he...

24:50 - D. B.
I mean, the end of this last arrow, that points to the point in the embedding whose semantic meaning is a king that lived in Scotland? Is that what I... Understand what I mean?

25:05 - V. W.
I think just as significant as the difference in these vectors, which establish directions that correspond to some emergent thing that you want.

25:17 - D. B.
So if you give it a paragraph, and it goes through the entire paragraph, will that paragraph indicate a point in the embedding, in the multidimensional space?

25:31 - Multiple Speakers
In fact, it's finding metaphors.

25:33 - V. W.
So, King and Queen and niece and nephew, those are metaphorical sorts of things.

25:40 - A. B.
But it's also, it's probably like in this example, right? The language like Doth and so forth is a, you know, Shakespearean reference, right? So it's, that's Macbeth, you know, like essentially it's, That's what I was reading into it.

25:59 - Unidentified Speaker
Yeah. Okay.

26:00 - D. B.
Is to progressively adjust these embeddings so that they don't merely encode an individual word, but instead they bake in some much, much richer contextual meaning.

26:11 - Unidentified Speaker
I should say up front that a lot of people find the attention mechanism, this key piece in a transformer, very confusing, so don't worry if it takes for things to sink in. I think that before we dive into the computational details and all the matrix multiplications, it's worth thinking about a couple examples for the kind of behavior that we want attention to enable. Consider the phrases American true mole, one mole of carbon dioxide, and take a biopsy of the mole. You and I know that the word mole has different meanings in each one of these, based on the context. But after the first step of a transformer, the one that breaks up the text and associates the each token with a vector, the vector that's associated with mole would be the same in all three of these cases, because this initial token embedding is effectively a lookup table with no reference to the context. It's only in the next step of the transformer that the surrounding embeddings have the chance to pass information into this one. The picture you might have in mind is that there are multiple distinct directions in this embedding space, encoding the multiple distinct meanings of the word mole, and that a well-trained attention calculates what you need to add to the generic

27:23 - D. B.
embedding to move it to one of these more specific directions. Thoughts or comments? Go ahead. Well, yeah, I think we talked last week and the week before about the temperature and so forth.

27:41 - A. B.
So then, depending on where it goes in this, you know, you know, contextually, right? Like that, that temperature is also affecting, I guess, uh, how creative or unique a response that might give.

27:57 - V. W.
We're temperatures defined as the likelihood of using the most probable outcome versus one of the lower probability outcomes. So the higher the temperature, the more likely are to jump around to less probable scenarios.

28:12 - D. B.
So the word mole has, embedding, but it's kind of a compromise location that sort of averages the three meanings of mole, I would guess. What else could it be? So each meaning of mole has an embedding, and then the word mole itself has some sort of a averaged of those three.

28:34 - V. W.
Well, average implies a blurring of meaning, when in fact, I think the LLM has encoded in it the three distinct meanings of mole, and uses the context to determine which meaning is currently active.

28:48 - D. B.
All right. But in the example, it showed mole having a single embedding, a single column of numbers.

28:55 - M. B. L.
Well, it's using the scientific use of the word mole. So maybe if it was referring to the mammal mole, it might point somewhere else. We'd have to probably backtrack, though.

29:09 - D. B.
Yeah, I'm trying to find exactly. Okay, here's where it is. So if you look at the three mole...

29:18 - V. W.
So that embedding of the mole is a vector. And if those, he is showing those embeddings as the same, but depending on the context, the meaning will differ. Right.

29:32 - D. B.
So the context are the neighboring vectors, which will have to somehow be used to modify the mole vector. So pre-attention, the word mole has a single vector, it looks like to me. And then post-attention, that's where they're going to start changing the vector of mole to reflect its actual meaning in that context.

29:58 - E. G.
That's where I think it is, because it's stated you start with the mole. Word mole, then you start applying information to it. If you're into programming, it's kind of like a decorative pattern. Okay, now that you've got it, you're going to start applying some information to it, some context to it.

30:25 - D. B.
It's a great example of an ambiguous word, mole. All right.

30:31 - M. M.
Any words like this? It's like, unless we hear the context, we don't know which kind of mole is being discussed.

30:43 - V. W.
I'll have a mole, please.

30:46 - Multiple Speakers
That there are multiple distinct directions in this embedding space, encoding the multiple distinct meanings of the word mole, and that a well-trained attention block calculates what you need to add to the generic embedding to move it to one of these more specific directions as a function of the context. To take another example, consider the embedding of the word tower. This is presumably some very generic, non-specific direction in the space associated with lots of other large, tall nouns.

31:20 - V. W.
If this word was immediately preceded by Eiffel, you could imagine wanting the mechanism to update this vector so that it points in a direction that specifically encodes the Eiffel Tower may be correlated with vectors associated with Paris and France and things made of steel. And at that exact moment the metaphor has been created.

31:41 - E. G.
That's where I was talking about context.

31:43 - V. W.
So it's that difference the difference between the original vector and the new one creates the opportunity for analogy at that instant and that analogy can is so multi-dimensional and so rich we get the emergent property we enjoy with LLMs.

31:59 - E. G.
Now, if we take the word tower, if this person towers over another, we've got a different context of the term.

32:07 - V. W.
And even another part of speech, because we have a noun verb duality that's taking place and we have to be able to distinguish which is in operation.

32:19 - E. G.
Is it a verb, a noun?

32:21 - D. B.
I wonder if there's a, a vector that converts nouns to verbs, any noun to any verb.

32:28 - V. W.
That would be a great little experiment to do.

32:32 - D. B.
Mole plus verb verbization gives you, I don't know. Or gives you an adjective, molar concentration. Yeah, that'd be another vector to give you an adjective. So it looks like it's doing, when it says move one vector to another vector, he's talking about doing vector addition. I mean, you add two vectors to get the new vector. So it's addition of vectors, right? Is that dot product?

32:59 - V. W.
Yeah, no, it's vector addition. And a corresponding difference operator is embedded in there too. Because if you add a vector to get a new vector, presumably subtracting that vector gives you the original vector. And looking at the direction of the new vector gives you the refinement that has occurred as you became more specific.

33:20 - Multiple Speakers
Yeah. But arithmetically, are we talking about dot product, I wonder?

33:24 - V. W.
No, dot product is the component of one vector that lies in the direction of another, and the result is a scalar. With vector addition or subtraction, the types input are vectors and the types coming out are vectors. So the dot product is a scalar that tells you what percentage of a vector lies in the direction of another, and that's all it gives you.

33:50 - Unidentified Speaker
OK.

33:50 - D. B.
All right. Any other comments, anyone? OK. Also preceded by the word miniature, then the vector should be updated even further so that it no longer correlates with large, tall things.

34:02 - Unidentified Speaker
More generally than just refining the meaning of a word, the attention block allows the model to move information encoded in one embedding to that of another, potentially ones that are quite far away and potentially with information that's much richer than just a single word. What we saw in the last chapter was how after all of the vectors flow through the network, including many different attention blocks, the computation that you perform to produce a prediction of the next token is entirely a function of the last vector in the sequence. So imagine, for example, that the text you input is most of an entire mystery novel, all the way up to a point near the end, which reads, therefore the murderer was. If the model is going to accurately accurately predict the next word, that final vector in the sequence, which began its life simply embedding the word was, will have to have been updated by all of the attention blocks to represent much, much more than any individual word, somehow encoding all of the information from the full context window that's relevant to predicting the next word.

35:06 - V. W.
To step through the computations though, let's take a look. This shows that the power of a single word can be enormous when placed in a sufficiently complex context. So that was, is so much more powerful than somebody blurting out the single word was, because it carries with it all the information that preceded it in that context.

35:29 - D. B.
So every word, every subsequent word is kind of freighted with the collected context, meaning of its preceding context. That is a quotable quote.

35:39 - V. W.
You have got to write that down. Will be attributed to you in perpetuity. The freight of the previous words. That's fantastic. Any other comments?

35:50 - D. B.
Okay, let's look at a simpler example. Simpler example. Imagine that the input includes the phrase, a fluffy blue creature roamed the verdant forest.

36:00 - Unidentified Speaker
And for the moment, suppose that the only type of update that we care about is having the adjectives adjust the meanings of their core corresponding nouns. What I'm about to describe is what we would call a single head of attention, and later we will see how the attention block consists of many different heads run in parallel. Again, the initial embedding for each word is some high-dimensional vector that only encodes the meaning of that particular word with no context. Actually, that's not quite true. They also encode the position of the word. There's a lot more to say about the specific way that positions are encoded, but right now all you need to know is that the The entries of this vector are enough to tell you both what the word is and where it exists in the context. Let's go ahead and denote these embeddings with the letter E. The goal is to have a series of computations produce a new refined set of embeddings where, for example, those corresponding to the nouns have ingested the meaning from their corresponding adjectives. And playing the deep learning game, we want most of the computations involved to look like matrix vector products where the matrices are full of tunable weights things that the model will learn based on data. To be clear, I'm making up this example of adjectives updating nouns just to illustrate the type of behavior that you could imagine an intention head doing. As with so much deep learning, the true behavior is much harder to parse because it's based on tweaking and tuning a huge number of parameters to minimize some cost function. It's just that as we step through all of the different matrices filled with parameters that are involved in this process, I think it's really helpful to have an imagined example of something that it could be doing to help keep it all more concrete.

37:43 - E. G.
Any cons or thoughts? That further reinforces how, or what we stated earlier, the context of the word, the whole ideation of what's going on with the term. This also goes for tokens like commas.

38:00 - V. W.
People fuss when you don't put commas in a sentence, because the meaning can change so dramatically depending on the phrasing that's provided by the comma. It's at a level below the chunking of a sentence, but above the chunking of simple pairs, twins, or triples of words.

38:22 - D. B.
So that kind of ties into a question I have, which is, since the grammatical structure, the syntactic structure of the preceding Maybe not the preceding paragraph, but the preceding phrase really counts for a lot. It must somehow keep track of syntax. Like the fluffy blue creature, comma, which roamed the verdant forest. Well, it's got to know that which refers to fluffy blue creature. That's a syntactic analysis thing.

38:55 - V. W.
And that's a parse tree of the noun for and verb phrases in which you have the adverbs and the ad nouns, which are adjectives, modifying their corresponding parts of speech. Yeah, but they're not talking about parse trees here, but maybe they should be. No, but they get it for free because of the way they're doing the construction. Because the parse tree is something that has been generalized as emerging from the structure of language. It's also interesting that the encoding includes the position of the word and the position of the word is language dependent. So in French, we would say the table coffee, but in English we would say the coffee table and LLMs don't seem to have a problem with adjusting the position of the word to compensate for language differences.

39:50 - E. G.
All I could think of was Yoda here. Yoda would have a ball.

39:55 - D. B.
Roamed the verdant forest. Hebrew creature did. Yes. Your nerd card has been renewed.

40:01 - V. W.
Yeah.

40:01 - E. G.
All right.

40:02 - Multiple Speakers
So, so I, and it comes to my question, like earlier, I was just trying to put it together for myself.

40:10 - A. B.
So I apologize. I probably just don't understand, but like, so we talked to like the, the temperature, like last week, how does that relate here? Is it, is it just like essentially determining how um, like how smooth or sharp the, like, I guess what I'm saying, like in an example, like if I had a, if I, if I tweaked, uh, temperature higher, low, could that then swing the contact?

40:38 - V. W.
Like, could that then push me like, you know, to, to an outcome that's giving me like, like, uh, It changes the outcome to be a, the higher the temperature, the more likely you are to go to a less probable outcome. So if your temperature is zero, you'll get a reproducible deterministic automaton that will give you the same answer for every prompt. But if your temperature is non-zero, greater than zero, then you will go to scenarios which are less probable. So instead of choosing the word that was 90% likely to be the next one, maybe it'll choose the one that was 87% likely to be the next one.

41:19 - A. B.
Right.

41:22 - A. B.
Because that's used in the softmax function, which is driving this attention layer, right? Right.

41:27 - V. W.
And we discussed the problems with the softmax function and the non-determinism and the complexity that you invoke by considering not only the most likely scenario, but also explore less likely scenarios and ask at the end of that process, how different is where I ended up from where I would have ended up if I've had a lower temperature?

41:47 - A. B.
Yeah. So low temperature means that I'm going to fit very much in context, not going to deviate with it at all.

41:56 - V. W.
It's very thermodynamic.

41:57 - A. B.
If I have high temperature, then I could give a kind of a BS answer that's just not really considering the context at all so much.

42:08 - V. W.
Zero is crystalline perfection and reproducibility. Non-zero temperatures invoke the possibility of noise in the message.

42:15 - D. B.
The temperature doesn't affect the context. It only affects how the context is used. Like if the next word was, you know, the murderer was blank, well, maybe there's a 90% chance the next word is the butler, and a 10% chance the next word was the cook.

42:33 - A. B.
Well, at zero temperature, it'll always pick the butler.

42:36 - V. W.
At higher temperature, it might choose the cook sometimes.

42:39 - Unidentified Speaker
Excellent.

42:39 - A. B.
That makes sense. Thanks for helping put that together.

42:42 - V. W.
Thanks. All right. Any other thoughts or questions so far? All right.

42:51 - Unidentified Speaker
The first step of this process, you might imagine each noun, like a creature, asking the question, hey, are there any adjectives sitting in front of me? And for the words fluffy and blue to each be able to answer, yeah, I'm an adjective and I'm in that position. That question is somehow encoded as yet another vector, another list of numbers, which we call the query for this word. This query vector, though, has a much smaller dimension than the embedding vector, say 128. Computing this query looks like taking a certain matrix, which I'll label WQ, and multiplying it by the embedding. Compressing things a bit, let's write that query vector as Q, and then anytime you see me put a matrix next to an arrow like this one, it's meant to represent that multiplying this matrix by the vector at the arrow's start gives you the vector at the arrow's end. In this case, you multiply this matrix by all of the embeddings in the context producing one query vector for each token.

43:49 - D. B.
The entries of this matrix are parameters of the model, which means the true behavior is learned from data, and in practice, what this matrix does in a particular attention head is challenging to parse.

44:02 - V. W.
The phrase challenging to parse at the end of that section is a kind of an interesting connection between the parse tree view of language and the embedding view of language.

44:14 - D. B.
So here we've got an embedding times a bunch of weights gives a query. Is that right? Now with this stuff up at the top, there's an embedding, you matrix multiply it by a weight

44:14 - D. B.
So here we've got an embedding times a bunch of weights gives a query. Is that right? Now with this stuff up at the top, there's an embedding, you matrix multiply it by a weight matrix or weight vector, and you get a query. Here's the expanded Embedding times weight equals query.

44:46 - V. W.
And the complexity of the query being lower, 128, appears to be local in its scoping. So it's doing a more local scoping to try to determine the structure of the local newsflash that we're getting.

45:04 - D. B.
So an embedding is a meaning. It's a location in space. You multiply it by a weight. Vector to get a query. I guess I'm missing something on the broader...

45:18 - V. W.
The embedding isn't a meaning. The embedding is all possible meanings that we would then like to choose from to proceed with the meaning that is more correct.

45:30 - D. B.
Why are we getting a query at the end?

45:34 - M. M.
We need to change the weights.

45:36 - V. W.
Any adjectives in front of me? Will change what I currently mean. OK. Anything else, anyone?

45:45 - D. B.
All right. But for our sake, imagining an example that we might hope that it would learn, we'll suppose that this query method somehow encodes the notion of looking for adjectives in preceding positions.

46:02 - Unidentified Speaker
As to what it does to other embeddings, who knows? Maybe it simultaneously tries to accomplish some other goal with those. Right now, we're laser focused on the nouns. At the same time, associated with this is a second matrix called the key matrix, which you also multiply by every one of the embeddings. This produces a second sequence of vectors that we call the keys. Conceptually, you want to think of the keys as potentially answering the queries. This key matrix is also full of tunable parameters, and just like the query matrix, it maps the embedding vectors vectors to that same smaller dimensional space. You think of the keys as matching the queries whenever they closely align with each other. In our example, you would imagine that the key matrix maps the adjectives, like fluffy and blue, to vectors that are closely aligned with the query produced by the word creature.

46:54 - V. W.
Yeah, and that alignment is measured by the dot product between those two vectors. If the dot product is one, the query matches the key. If the dot product is zero, it does not. OK. Anything else? OK. To measure how well each key matches each query, you compute a dot product between each possible key query pair.

47:24 - Unidentified Speaker
I like to visualize a grid full of a bunch of dot where the bigger dots correspond to the larger dot products, the places where the keys and queries align. For our adjective-noun example, that would look a little more like this, where if the keys produced by Fluffy and Blue really do align closely with the query produced by Creature, then the dot products in these two spots would be some large positive numbers. In the lingo, machine learning people would say that this means the embeddings of Fluffy and Blue attend to the embedding of Creature. By contrast, the dot product between the key for some other word, like the, and the query for creature would be some small or negative value that reflects that these are unrelated to each other. So we have this grid of values that can be any real number from negative infinity to infinity, giving us a score for how relevant each word is to updating the meaning of every other word.

48:22 - E. G.
The way we're about to use these scores is to take a certain weighted sum along each column weighted by the relevance. So instead of having values range from negative infinity to infinity, what we want is for the numbers in these columns to be between 0 and 1, and for each column to add up to 1, as if they were a

48:42 - Unidentified Speaker
probability distribution. If you're coming in from the last chapter, you know what we need to do then. We compute a softmax along each one of these columns to normalize the values. In our picture, after you apply softmax to all of the columns, we'll fill in the grid with these normalized values. At this point you're safe to think about each column as giving weights according to how relevant the word on the left is to the corresponding value at the top. We call this grid an attention pattern. Now if you look at the original transformer paper, there's a really compact way that they write this all down. Here the variables q and k represent the full arrays of query and key vectors respective those little vectors you get by multiplying the embeddings by the query and the key matrices. This expression up in the numerator is a really compact way to represent the grid of all possible dot products between pairs of keys and queries. A small technical detail that I didn't mention is that for numerical stability it happens to be helpful to divide all of these values by the square root of the dimension in that key query space. Then this softmax that's wrapped around the full expression is meant to be understood to apply column by column.

49:56 - D. B.
As to that V term, we'll talk about it in just a second. Why is it the square root of the dimension?

50:08 - E. G.
Pardon? Man, you're the mathematician.

50:10 - D. B.
Why the square root of the dimension?

50:14 - E. G.
Not the sum.

50:20 - V. W.
What I was going to say is in order to normalize these, in order for your dot products to come out between zero and one, you have to normalize your input space. Here, it looks like the normalization is taking place after the fact to guarantee that you are in a zero to one space, because a dot product is most meaningful when expressed between zero and one, because then you get the component of one vector that lies in the direction of another. And if the two vectors are identical, that's a 1.0, but depending on how you come in with your dot product, if you haven't divided each product by the square root of the sum of the squares of the components, i.e. Normalizing them, you'll get numbers that do not fall in the range zero to one, but which you are always able to normalize later, provided you kept all the values around. I apologize if that sounded like All right, any other thoughts? OK.

51:17 - D. B.
Before that, there's one other technical detail that so far I've skipped. During the training process, when you run this model on a given text example, and all of the weights are slightly adjusted and tuned to either reward or punish it based on how high a probability it assigns to the true

next word in the passage, it turns out to make the whole training process a lot more efficient if you simultaneously have it predict every possible next token following each initial subsequence of tokens in this passage. For example, with the phrase that we've been focusing on, it might also be predicting what words follow creature and what words follow the This is really nice because it means what would otherwise be a single training example effectively acts as many. For the purposes of our attention pattern, it means that you never want to allow later words to influence earlier words, since otherwise they could kind of give away the answer for what comes next. What this means is that we want all of these spots here, the ones representing later tokens influencing earlier ones, to somehow be forced to be zero. The simplest thing you might think to do is to set them equal to zero. But if you did that, the columns wouldn't add up to one anymore. They wouldn't be normalized. So instead, a common way to do this is that before applying softmax, you set all of those entries to be negative infinity. If you do that, then after applying softmax, all of those get turned into zero, but the columns stay normalized. This process is called masking. There are versions of attention where you don't apply it, but in our GPT example, even though this is more relevant during the training phase than it would be, say, running it as a chatbot or something like that, you do always apply this masking to prevent later tokens from influencing earlier ones. Another fact that's worth reflecting on about this attention pattern is how its size is equal to the square of the context size. So this is why context size be a really huge bottleneck for large language models, and scaling it up is non-trivial. As you might imagine, motivated by a desire for bigger and bigger context windows, recent years have seen some variations to the attention mechanism aimed at making context more scalable, but right here, you and I are staying focused on the basics. Take a look at this... Any comments or questions?

53:32 - V. W.
Yeah, the notion of an earlier or a later token or very language-dependent And the clarification comes when he says this is more in the training than in the inferencing phase. Per the coffee table example.

54:04 - Unidentified Speaker
Great. This pattern lets the model deduce which words are relevant to which other words. Now you need to actually update the embeddings, allowing words to pass information to whichever other words they're relevant to. For example, you want the embedding of fluffy to somehow cause a change to creature that moves it to a different part of this 12,000 dimensional embedding space that more specifically encodes a fluffy creature. What I'm going to do here is first show you the most straightforward way that you could do this, There's a slight way that this gets modified in the context of multi-headed attention. This most straightforward way would be to use a third matrix, what we call the value matrix, which you multiply by the embedding of that first word, for example, fluffy. The result of this is what you would call a value vector, and this is something that you add to the embedding of the second word. In this case, something you add to the embedding of creature. So this value vector lives in the same very high-dimensional space as the embeddings. When you multiply this value matrix, by the embedding of a word, you might think of it as saying, if this word is relevant to adjusting the meaning of something else, what exactly should be added to the embedding of that

55:16 - E. G.
something else in order to reflect this?

55:19 - D. B.
Any comments?

55:20 - E. G.
If that works, you should be able to say, or ask, What do I need to say or what does my prompt need to look like to make it look like this picture? Go from the picture backwards. All right. Any other comments, anyone?

55:44 - V. W.
So you'd associate that with a set of drawing actions that could be applied to specialize the drawing to become more specific to a fluffy creature than just a generic creature. Well, if A equals B and B equals C, A equals C.

56:05 - E. G.
So if we're able to go in and say that these words describe this picture, you should be able to pass it that picture to get the prompt or one very close to it. That assumes an invertibility between the description of a picture and the picture itself.

56:27 - V. W.
And one problem with that invertibility is that for a given description, there could be many, many pictures which satisfy that description. So unless you become incredibly specific, you won't get the picture you want. And this could affect the problem that we discussed earlier of being able to get accurate relational diagrams out of chatbots. And that's to work TBD is a really interesting area. All right, well, you know, we're kind of out of time.

57:00 - D. B.
So what I'm going to do is make a note that we're going to start from this breakpoint about 13.05 next time we do this.

57:32 - Unidentified Speaker
And actually that'll be in a couple of weeks. Um, so yeah. All right.

57:37 - D. B.
So, um, so it's interesting. We got to that place in 13 minutes where we watched a piece of video and then we discussed it to increase our mutual understanding of it.

57:50 - V. W.
And that gives us this notion of shooting ratio and shooting ratio. You shoot a Budweiser commercial 250 times until you get the commercial that you want it. And so you have to, in the days of films, spend 250 times as much film as the one length of film that would be used in the ultimate commercial. In our case, we have like a 13-minute progress rate for a 40-minute presentation that's being given by G. S. And so that's our shooting ratio. That's our efficiency of learning. That's 13 over 13 parts and 40 is how fast we as a corporate group are able to explain to our mutual satisfaction what's going on, which is a good number to know because it ultimately limits how much material we can cover in a fixed amount of time. OK. Well, yeah, we definitely are.

58:47 - D. B.
I mean, we don't want to just watch the whole video and then talk about it for sure. All right. So anyway, so next week we'll have G.'s pre-defense rehearsal. And then on the 31st, we'll probably have a discussion about Lang chain and crew AI. And then on February 7th, we'll start from minute 1305. Or maybe we can go back a little bit and re-familiarize ourselves with some of the earlier stuff. We'll just have to see what people want to do. So thanks, everyone. And I'll see you all next time. Take it easy.

59:28 - Multiple Speakers
And our shooting ratio is 33% for this session.

59:35 - D. B.
All right.


Displaying jan10FinishedTranscript.txt.