Machine Learning Study Group
|
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu
Agenda & Minutes (148th meeting, Jan. 31, 2025)
Table of Contents
* Agenda and minutes
* Transcript (when available)
Agenda and minutes
- MM's student AA presented his work on an agent based architecture for RAG. Then MSM presented an agent based architecture for creating a story with accompanying images and turning it into a video.
- Announcements, updates, questions, presentations, etc.
- Recall the masters project that some students are doing and need our suggestions about:
- Suppose a generative AI like ChatGPT or Claude.ai was used to write a book or content-focused website about a simply stated task, like "how to scramble an egg," "how to plant and care for a persimmon tree," "how to check and change the oil in your car," or any other question like that. Interact with an AI to collaboratively write a book or an informationally near-equivalent website about it!
- BI: Maybe something like "Public health policy." Not present today.
- LG: Thinking of changing to "How to plan for retirement."
- Looking at CrewAI multi-agent tool, http://crewai.com, but hard to customize, now looking at LangChain platform which federates different AIs. They call it an "orchestration" tool.
- MM has students who are leveraging agents and LG could consult with them
- ET:
Gardening (veggies, herbs in particular). Specifically, growing
vegetables from seeds. Using ChatGPT to expand parts of a response. Use
followup questions to expand each smaller thing. Planning to try another
AI to compare with ChatGPT. These AIs can generate pictures too, if you
ask them. But these images aren't necessarily perfect.
- Anything else anyone would like to bring up?
- We are up to 13:05 in the Chapter 6 video, https://www.youtube.com/watch?v=eMlx5fFNoYc and can start there.
- Schedule back burner "when possible" items:
- If anyone else has a project they would like to help supervise, let me know.
- JK proposes complex prompts, etc. (https://drive.google.com/drive/u/0/folders/1uuG4P7puw8w2Cm_S5opis2t0_NF6gBCZ).
- The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates when available, every now and then but not every week.
- 1/31/25: There is also an on-campus discussion group about AI in teaching being formed by ebsherwin@ualr.edu.
- Here is the latest on future readings and viewings
- We can work through chapter 7: https://www.youtube.com/watch?v=9-Jl0dxWQs8
- https://www.forbes.com/sites/robtoews/2024/12/22/10-ai-predictions-for-2025/
- https://arxiv.org/pdf/2001.08361
- Computer scientists win Nobel prize in physics! Https://www.nobelprize.org/uploads/2024/10/popular-physicsprize2024-2.pdf got a evaluation of 5.0 for a detailed reading.
- We can evaluate https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10718663 for reading & discussion.
- Chapter 6 recommends material by Andrej Karpathy, https://www.youtube.com/@AndrejKarpathy/videos for learning more.
- Chapter 6 recommends material by Chris Olah, https://www.youtube.com/results?search_query=chris+olah
- Chapter 6 recommended https://www.youtube.com/c/VCubingX for relevant material, in particular https://www.youtube.com/watch?v=1il-s4mgNdI
- Chapter 6 recommended Art of the Problem, in particular https://www.youtube.com/watch?v=OFS90-FX6pg
- LLMs and the singularity: https://philpapers.org/go.pl?id=ISHLLM&u=https%3A%2F%2Fphilpapers.org%2Farchive%2FISHLLM.pdf (summarized at: https://poe.com/s/WuYyhuciNwlFuSR0SVEt).
6/7/24: vote was 4 3/7. We read the abstract. We could start it any
time. We could even spend some time on this and some time on something
else in the same meeting.
Transcript:
Suggested text scratch?
Conversation opened. 1 unread message.
ML Discussion Group
Fri, Jan 31, 2025
0:07 - Unidentified Speaker
Hi, everyone. Hello.
1:41 - D. B.
Well, is there anything on the schedule in particular today? I know that Dr. M.'s students were considering telling us about Lang chain and crew AI.
2:00 - M. M.
Yeah, I want to introduce A. is our new PhD student. That is coming to our group. I appreciate that S., yeah, S. A. actually is joining us. I really appreciate the participation of all of you. Welcome, S. Glad to see you again.
2:27 - Multiple Speakers
What did you say? I said, welcome to S.
2:32 - D. B.
Glad to see you again. Yeah, yeah, yeah.
2:37 - M. M.
Thank you so much.
2:39 - Multiple Speakers
Thank you for the invitation.
2:41 - M. M.
I'm so glad to see you. Yes, thank you for coming. We organized with D., with D., we organized this blog and discussion group every Friday at four o'clock. Please come and join us when you are available.
2:58 - Multiple Speakers
Yeah, if anyone who's here and has not been here before would like to be added to the calendar entry, just let me know.
3:08 - D. B.
I'll put your email on the calendar entry, and it'll be on your Google Calendar. Do you want me to give their emails, R.'s email?
3:18 - M. M.
Well, I mean, if she wants to be on the list, I guess.
3:23 - D. B.
I mean, you want to just give it to me now, or they can contact me later? Yeah, please can you admit Okay, what's your email address?
3:37 - S. A.
S-N-W-R. S-N-W-R.
3:59 - M. M.
particularly A. and S. that are here right now, they participate in several hackathons and competitions. Well, they're well trained for large language models and computer vision and multi-modal techniques, generative AI. So one of the that we have with Census Bureau with Dr. T., we suggest to work with multi-agents and was very well accepted the idea. So how do you do then? Can give, I think A. will make a presentation.
4:47 - D. B.
Yeah, that'd be great.
4:49 - M. M.
If you give the access to, Do you want to share a screen?
4:56 - D. B.
I can unshare my screen if you can share or whatever.
5:01 - Multiple Speakers
Yeah, I can share my screen, Dr. B.
5:05 - Unidentified Speaker
Okay.
5:06 - M. M.
Okay, so they will, please feel free, because it's a discussion group, so feel free. It's not so official, official presentation. Feel free to ask questions and of course, Your feedback is very important. There are many, many multi-agent techniques. Probably, A. can share one of the links that we have. But first, he will start with his experience and how he's using and currently how he's using generative AI. And after that, we will show you some video. OK? So, A.
5:48 - A. A.
Yeah. OK, so good afternoon, everyone. I'm A. I'm currently working in this entity resolution project for the U.S. Census Bureau. So this entity resolution project is like matching records between multiple data sets. So if we have some records that represent the same real-world entity, we directly match them across multiple data sets. The problem is not a lot of machine learning or deep learning algorithms are able to do that perfectly. But when the LLM and the multi-agent systems were introduced, we thought that we would extend this research and try it out to check whether it does a good job or not. So for this, I'll be going through the base of multi-agent system and what we added to the whole application. And then I'll be showing a small code snippet on how I did that using TrueAI. So to get started, so the multi-agent systems are equipped with their own large language models. So usually if we have like multiple tasks that we need to do, the single agent large language model system finds it very difficult because the use users' queries are too much for them to handle. So multi-agent systems do a good job where we can segregate some tasks and we'll be able to combine those outputs of different agents and give it back to the user. This actually increases the reliability of the results. Seen that myself. So the important thing here is we need to give them specific instructions so that they can do the tasks on their own. The main part with this multi-agent system is that the prompts that we give should be precise and up to the mark. Otherwise, it makes mistakes along the way. So let me show you how the multi-agent system works. The diagram here shows us that first of all, we'll be sending the query. The user will give the query to the system. So basically what it does is the parsers might contain our data sets, which may be a PDF file or an Excel or a CSV file. And then the third part is we'll do the embeddings on these documents and create a vector database. So the main reason why we create a vector database is that it helps the LLM models look for the semantic meaning for a particular word so that it would be easier for the LLMs to identify the words which have a similar meaning join them together. And then it goes to the multi-agent systems where they have specific instructions on what to do. So in this particular diagram, we have three agents. We have a ranker agent, the reader, and an orchestrator. Each has a different task. The ranker agent will rank the particular Retreat documents and the reader will read through it and check and the orchestrator will summarize the whole document and give it back to the user as a whole response. So this is how a multi-agent system works, but we did not use the same multi-agent systems. We did a different thing with that. I'll be getting back to it.
10:02 - A. A.
The main thing is, even though multi-agent systems are good at doing subtasks, the problem is that once we input a larger file, like we'll be inputting multiple PDFs or CSV files, which will be like a 1 GB of CSV file or 1 GB of PDF file, so that it becomes harder for the LLMs to loop all the data inside that and give us a result, it takes a lot of time and sometimes it will hallucinate. So because of that problem, we added a retrieval augmented generation system. So this retrieval augmented generation system, what it does is, first of all, we do the embeddings from the data which was given to us, and then it will convert them into vectors. The vectors usually capture the semantic meaning of the document, so that it would be easier for the LLM to combine the words later on. But once the vectorization has been done, the data goes to a Retrieval Augmented Generation system so that the LLM can interact with the Retrieval Augmented Generation system and say like, hey, I only need this particular data from the data set. So the RAG system, what it does is it only takes that particular data which is relevant to the context and gives it to the LLM. Now the LLM does not have to loop through all the data which is in the dataset and give us the answer. So by doing this, it reduces a lot of time and increases the reliability of the answers which are provided by the multi-agent system. So that's why we are using a RAG system. Moving on, in our particular project that we are working on, we are using an improvised version of a RAG system which is called as a multi-query RAG system. So usually when this was initially released, we were using a single query retriever RAG system where the user, where the LLM asks the RAG system that they need this particular information from the document and it gives them the, gives only that information. But the problem is the prompts that we give is directly given by the LLM to the RAG system. So there is a high chance that the RAG system does not give additional relevant information on that. So a lot of data is being lost in that particular process. So they came up with a solution where it's a multi-query RAG system. So what it does is, once the LLM gives the query, this particular system comes equipped with another LLM system inside it. So what the LLM does is that the original query which is given by the RAG system. The LLM uses that query and generates multiple queries. I think, I don't know how many generated queries it executes, but the minimum generated query it executes is like five. So these generated queries are then given to the RAG system once again, so that since we are giving like five or six queries, different queries at the same time, the results will, the RAG system will give them different results, but it would be in the same context. So combining all these results, it will give us like a more, give us a broader context for the LLM to work with so that minimal data is lost in this process. So the context we are getting is much higher than using a single query retrieval RAG system. And then these data are given to the multi-agent system to do the task. So I have tried this on our project with the single query and the multi-query retriever, but the multi-query retriever does an even better job. Job by inputting a larger context and giving us a much more better result than a single query retriever RAG system. So, yeah, I highly suggest that if we use multi-query RAG system, it is better to capture additional information, additional relevant information, and it would be more accurate than a single query RAG system.
15:10 - A. A.
Yeah, so let me just stop my screen for a moment and I'll share a small code snippet and I'll explain that in just a second. Yeah, so here, yeah, so right now we are for our entity resolution task, we are using an agent framework. So in this agent framework we will like, Assign roles for multiple agent. So here in this this first agent is a direct record linkage agent. So the task for this agent is that it will see look through multiple data sets and if it like directly. Directly matches with another data set or by address, it will match it. So we have like multiple agents doing tasks. And in this agent's file, we'll just define the role and their goal for those particular agent. And in the task, we will give them precise instructions on what to do. We'll also provide them with examples so that they can understand the context really well and then do the task with much more preciseness. So in Crue AI, it is very much easier to add the agents and allocate them specific tasks. So we just need to write these prompts and it is much easier to implement it with Crue AI. But while executing this project, The problem with Crue AI is that it is very much better in using text as an input and generating its own answers. But when we use our own data set and try to do a specific task by allowing these agents to read our data and give us the output, it is not doing a good job. Over time, it is like these agents are combining themselves together and then they are trying to repeat itself over and over again. So, Crue AI is not really working in this particular task as we expected. So, we are currently moving towards LangChain. Specifically, they have a library called LangGraph, which is an agent architecture as well, which contains the nodes and edges. So it is supposed to be much more better than TrueAI, but the problem is it is a bit harder to implement with LangChain. TrueAI offers a much easier implementation where we can just plug and play, but in LangChain, it is a bit complicated if we execute it, but it is much better than Crue AI. That's what we know. That's what we can know. Right now we are trying to change the whole system to use LangChain multi agent system so that we need to check their results as well. I hope it is better than Crue AI.
18:49 - Unidentified Speaker
Yep.
18:49 - Multiple Speakers
A. and Dr. M. should we wait for questions until the presentation is over?
18:55 - A. A.
or can we ask questions during the presentation? Yeah, I'm actually done with my presentation. So I'm open for questions right now.
19:08 - Unidentified Speaker
Yeah.
19:09 - Y. P.
Okay. Yeah, the second.
19:11 - A. A.
Yeah. Yeah.
19:12 - Unidentified Speaker
Yes.
19:13 - Multiple Speakers
So then should I go ahead and ask questions?
19:18 - Y. P.
Yes, please. Okay. So I'll start with your last slide. Crue AI versus LangChain. And you mentioned you're finding something difficult or something is not easy. I don't remember the exact words, but you had some difficulties. I would like to know what difficulty you're facing in LangChain. So that if there is anything I could do to help you out.
19:46 - Multiple Speakers
Yeah, I'd not say difficulty. I'm like, I've not implemented yet.
19:50 - A. A.
I was just saying that Crue AI is much easier to implement than LangChain. LangChain might be a bit complicated. That's what I said. If you face any difficulty, let me know.
20:02 - Y. P.
Either me or somebody can help you out if at all you need any help. Yeah, sure, sure.
20:10 - Multiple Speakers
My name is Y. P. or people call me Y.
20:14 - Y. P.
So I wanted to ask you. Thanks a lot.
20:17 - Multiple Speakers
I have a couple of more questions.
20:21 - Y. P.
if I can ask. So, when you're saying agents, agents in Glue AI, agents in other framework, are you actually have, actually, let me start with the first question. When it comes to your use case, are you putting all the data in this framework? Or do you have kind of a segregation process? Yeah. Sorry.
20:52 - A. A.
Yeah, I'm actually getting the data and converting them to chunks. And then those chunks go into the embedding process. And then it goes to the vectors.
21:06 - Y. P.
Yeah. Not that well, I'll explain the other way. So census data is large data, and there are multiple data sets. And see, if you think from processing standpoint, and you were yourself saying that, the traditional rule-based system will work much better than the machine learning or pre-GPT, non-GPT AI engines. And then GPT AI engines will consume the maximum processing power or energy and all that. So when you are doing that, like the way you are thinking of multi-agents within this, have you thought of, now I know you are experimenting might be only on RAG and agents, but is there a scope, Dr. M, also that might be a question for you, where actually you are saying processing-wise, we have eliminated records that really do not need to go through this, where there is clear match, so we are removing that, and then remaining data sets only where it cannot be matched or something, you are putting it for learning. Yeah. So are you doing that or no? Or have you thought of doing that?
22:26 - M. M.
Yes, but he mentioned if there is first step is direct match. This is what you're talking about. Something that is really obvious, we don't continue investigating more. So, we will do this step. We remove what is really obvious and start digging, you know, deeper. Yeah.
22:48 - A. A.
I mean, the thing is, the census data that we got, it's like a very dirty data set. So, in the column called names, there will be the address. In the address, there will be some identification numbers. Pretty jumbled. So what we are trying to do is that we are trying to automate it without any rule-based approach using LLMs alone.
23:18 - Y. P.
So that's what we are trying to do.
23:21 - Multiple Speakers
Thank you for the explanation.
23:24 - M. M.
Yeah, we actually published several papers using transformer, you know, this is something new with multi-agents, but we actually proposed and already proved that with large language models and particularly transformer and explainable transformers give us very, very good results. So we already use for census and we're the first actually university that suggest them, you know, to use the large language models. They reject so many times, but right now they are asking us the court and they approve our work already. So I want to share again, D., with you probably, the new code that I have for this semester for everybody that is in our group to use NVIDIA courses for free. Because they change every semester, they change the code. We have a new code right now. Well, because this is how A. and several of my students are using Lama. Yeah, so we have actually another presentation. And A., you want to show the
24:46 - A. A.
I think, yeah, S. will show that right now.
24:51 - Multiple Speakers
S. will show another too? Yeah.
24:55 - M. M.
Yeah, I like, we have another. Presentation which S. can show.
25:01 - M. S. M.
I wanted to show the N8n platform from YouTube as we have not started working with N8n yet. I think it's better if we show everyone the YouTube video that explains how good N8n is in regards to codeless programming for AI agents. I love it this one but do you want to show first the Storytelling the storytelling Storytelling I think it's better if we show the storytelling in another another day Why we have so many people oh my goodness you have
25:39 - M. M.
the video please prepare the video Multi-agent that then you can you can you can go ahead with that. I'll show the storytelling.
25:49 - Multiple Speakers
Let me just grab the video Okay, grab the storytelling So when you share a screen to show the video, you need to checkmark that it's optimized for video or something like that.
26:05 - D. B.
In Zoom, there's a checkmark. Otherwise, the video won't come out. So when you go to share the screen, there's a special place to check that it's a video.
26:18 - M. M.
Can you see it?
26:20 - Unidentified Speaker
Yeah.
26:22 - M. M.
This is one of the recommended. I have a list of 10 most popular, and this framework is actually without code. Using Telegram, which is an app you can use on your computer.
26:36 - D. B.
No, it's not coming. I'm not seeing a video.
26:40 - Y. P.
Yeah.
26:41 - Multiple Speakers
I think it's in another screen document26:44 - D. B.
I see your Zoom launch meeting screen, so it's the wrong window. Or the wrong tab.
26:44 - D. B.
I see your Zoom launch meeting screen, so it's the wrong window. Or the wrong tab.
26:52 - Unidentified Speaker
Or if you send, okay, there you go, you're sharing. That looks like a video, yes. Yeah, this is the video.
27:06 - M. M.
And S., please prepare our video, it's very good. Agents are taking over right now. So in this video, I'm going to show you how you can automate anything. Can you hear it? I hope so. Yeah. Yes, ma'am. Yeah, yeah, we can hear. I'm going to be showing you a couple of examples of how you can add tools to an AI agent, have a conversation with a bot using Telegram, which is an app you can use on your computer, on your phone, on the go. And this is multimodal. So I can type into it, I can speak into it, whether I'm on my phone, on my desktop, and I can give it requests like it's my and it can go out and do things for me. Let me show you a couple of examples of how this actually works, because as you can see, I have a bunch of things connected. I have Google Calendar, I have Airtable, I have Gmail. Things are all over the place, but this AI agent can manage your tools and then decide which one to use on its own. So you give it basic natural language queries, and it comes back after using all your tools with all of your info, and it actually gives you a good response, schedules things for you, can email people, update your CRM, search your CRM, do so many things. Now this is just the tip of the iceberg. If I go to my Google Calendar, as you can see, tomorrow I have a discussion with J. and AI brainstorming. Discussion with J. is two to three, AI brainstorming eight to nine. If I ask Telegram something like, what's on the schedule for tomorrow? And I could send that off, use very natural, small language, because AI is really good at interpreting that and understanding that. What it's going to do is return schedule that I just showed you in a matter of seconds. As you can see, here's your schedule for tomorrow. Morning routine, discussion with J., AI brainstorming. We go to my calendar, this is exactly what I've got tomorrow. Morning routine, discussion with J., AI brainstorming. I could now add an event or something like that. I could be on the go, remember. I could just use my voice. I could say, can you add an event around 11 p.m. to 11:30 p.m. that goes over AI agent discussions with C. You know, horrible. I'm talking even horribly to it. And it's going to parse that, transcribe the recording, and then create an event for me tomorrow after I gave my schedule. In a matter of seconds, guys. So it says 11 to 11:30 p.m. agent discussions with C. tomorrow scheduled. Go down here, boom, AI agent discussions with C. scheduled. I could then ask it questions about certain people. So maybe C. is in my CRM. I could say, return info about C. to me from my CRM. And I could, you know, have it do multiple things for me. So now I'm even going deeper down the rabbit hole. Event's been created, and then I asked for information about C., and now it's giving me things like his name, his email, his company name, notes about him. So now with his email, I can even, you know, use my emailing tool in order to email him the event details. So things are moving very quickly. I created this thing in 45 minutes, but it might be more difficult for you. So that's why I'm here in order to teach you. And in this video, you're going to learn lessons that are much greater than just, you know, putting together a template that you forget about. You're going to be learning things like how to set up a telegram trigger, how to craft amazing instructions like this. And look, watching a video of me building all this, it's going to help you. You're going to learn some stuff, but if you want to learn in a more structured way, you want just a full guide so that you can whip things up like this in 45 minutes for any use case that you desire, then I highly, highly recommend joining our AI Foundations community. This network is hyper-focused on building AI agents right now, and we've also just released a full course on agent building within N8N. And this course literally takes you from beginner to pro in N8N, literally setting up your workspace all the way to crafting agentic systems like the one that I'm about to build you today. Now, it's important to have the knowledge in order to do this on your own, and in order to thrive in a world full of agents, you're gonna wanna know how to build these. So that's why we made this in-depth course for you to actually learn the tactics of building agents rather than just relying on other people giving you tutorials and templates. We want to give you the actual way to build them. And that's what this course offers. We go in-depth on so many different topics and also the community, the network, the live calls, everything you get in here is just going to excel your AI agent building abilities. So I'll leave a link in the top comment and the description yeah I think that yeah we just grab this template and go over continue uh do you want me to share the video here we have it a prompt helping us create our yes yeah sure yeah so I think that this is free and N8N is free I tried little bit.
32:00 - Unidentified Speaker
I like it because it's no no cotton good. So maybe S. is ready.
32:09 - M. M.
When you know how to how I can I I can stop sharing and S. can share but I can show you the uh thank you the Video and I have from Medium a lot of leaks. They give us for the best for the best frameworks for multi agents but this is the video it's a long so I just show you I'm I'm excited. I need some help for calendar at least somebody to to help me with the calendar, the reminder and stuff like this. And can do whatever job you want, you know, summarize papers, searching for a ticket for you, whatever you want. S., are you ready to show, please?
33:15 - M. S. M.
Yeah, Dr. M., I'm ready. I can show the video that me and my friend, basically I., we created a video for NVIDIA AI competition. So we used the multi-agent NVIDIA and NVIDIA guardrails to create a storytelling project. So it was basically the storytelling was done by agents and then the scripts that was derived from the storytelling. I used those to create images that goes according with the story. And at the end, I have parsed all the images together and then used a text-to-audio tool to basically generate audio for the file. So if you guys want, I can show the video of which I. posted on his LinkedIn. So I can show you the video. We basically, this was our project. Let me share. So this is my friend I. He's basically presenting this and we worked on this together. Everyone, this is me, I. presenting on our entry for the NVIDIA AI Generative Contest and our project today is a story production crew, which basically generates videos of children's stories from just a text prompt. Here you can see I'm putting an input for a story about a monkey, a dog, and a dragon. And you can see our elements will generate a story from this prompt. So what we do here basically is we create three different separate types of agents, one responsible for detailing the scene, the characters and then writing the scripts and the story. So from the story, we generate all the other information and then we pass that on through NVIDIA guardrails and we generate both images and text. And then the text is converted to audio. And then we combine the generated images and the audio to make a video. And here you can see we got the output and soon we'll demo the video. That was produced. So enjoy. Meet M., a friendly dog. During one of his explorations, they quickly become friends and enjoy playing hide-and-seek. M. tells M. about a legendary dragon that lives deep in the forest. Seeking M.'s curiosity, determined to meet the dragon despite M.'s initial hesitation, they embark on an adventure together. They eventually find the dragon, E., who has emerald scales and a fiery mane. Though E.'s initial roar is intimidating, M. and M. stand their ground and explain their desire to be friends. Touched by their bravery, E. reveals that he used to have many friends, but humans grew to fear him over time. E., delighted to find new friends who aren't afraid of him, offers to teach them about fire breathing, wingspans, and the importance of friendship. The three of them go on exciting adventures, exploring caves, chasing butterflies, and having campfires. Through their journey, M. and M. learn that true friendship can bridge any difference, and E. finds joy in their companionship.
37:06 - Unidentified Speaker
OK.
37:07 - M. M.
Thank you. Thank you. OK.
37:09 - Multiple Speakers
So the problems that I'm facing now currently with this project, so they can be divided into three parts. So the first would be parsing.
37:21 - M. S. M.
So the parsing part is for each individual story segment, I have to basically the code gives us different images. So I'm still working on systems so that you don't have to compile each segment of the text to get an image. Instead, we get all the images together. And the second part of the problem that I'm facing is there's consistency on image generation. As you've seen before, in the images, the breed of the dog, the color of the dog, or the background, or any other characters, they don't seem to be consistent. So there was a lot of manual, manual processing involved when I actually fixed it. And also I am trying to figure out how we can keep this parsing consistent, this image generation consistent across all the images, because I'm still facing this problem and I haven't found any, or maybe I haven't looked hard enough. I haven't found any, let's just say there's this one possible solution, one simple solution which I can follow and I can solve this problem. I haven't found that any. And the third one would be, as you've seen, this was generated by stable diffusion. And this was done in the code manually. I would also like to apply this text in DALL-E for image generation and compare what happens if we use it in the DALL-E platform and the Mid-Journey platform. And what happens if we don't use those platforms, but instead use the API keys for the project. So these three are the challenges I'm now facing and currently working with. Thank you so much, Dr. M., for giving me the opportunity to show this video.
39:24 - M. M.
Sure, sure. And we want help from all all of you in this project, you know, the agents can do multiple things like I show you. So any ideas that you have, and I mean, all kinds of companies can benefit about.
39:42 - D. B.
I have a couple of questions. So how much code, actual, you know, code did you have to write to create that movie?
39:53 - M. S. M.
There was quite a bit of coding involved. I mean, for the image generation, there was the normal stable diffusion, your run-of-the-mill stable diffusion coding that I got from a GitHub project and I repurposed it and also I did some of my own coding for that.
40:13 - D. B.
Okay, how many lines would you estimate?
40:16 - M. S. M.
That was more than at least 70, 80, not more than 150.
40:21 - D. B.
Okay, well, that's not that much code for generating a whole movie, right?
40:26 - M. S. M.
I mean, it's not generating a whole movie per se, Dr. B. It's basically generating images. And keeping consistency in those images is what I'm facing problems now. As you can see, there was a lot of manual tweaking involved. And also, at one point, I wasn't getting one of the images right. So I had to use the, you know, the website the DALL-E website to basically fix it. So the DALL-E website has this one, this markup tool where you can mark in the image and it will fix the image for you however you want. Like there was one part of the image that the background of the image that the forest, the forest wasn't coming up right as much as good as I want. So I had to do that manually. So you can see it's still a work in progress and there's still a lot of, you know, tinkering that needs to be done in this case.
41:21 - D. B.
So I mean, did you have one agent generate a story and another agent request images for each segment of the story? Exactly, exactly. And did anyone else have any questions about this? I mean, I don't want to overwhelm, ask all the questions.
41:40 - Y. P.
Ben, I'm always a man with questions. If you have questions. You have a bunch of questions. No, no, I was like I have a man always with questions, but if you have more questions, finish them. I can.
41:57 - D. B.
I can hold my question if we can go back. You had it.
42:02 - M. M.
You showed a diagram showing the agents structure. Yeah, yeah, and every agent the prompt is with language. We're using Crue AI. In this case, and OpenAI. And only the, something with NVIDIA is this guardians, but guardians is just about the security, that the text that the agent generates is not something to hurt people or hurt the child, child, or something. You know, this is the control, security of the text. Can you go to the diagram really? He has three different agents because the initial query is just a story about these characters. So everything else is computer generated. S., can you please show three agents?
42:57 - M. S. M.
Sure, Fany.
42:59 - Multiple Speakers
So everything is computer generated. We didn't put any words. Any. And it's generating. I think one is generating the script. We have a good description.
43:13 - M. M.
Okay, yeah. So the scene generator, character, and the scriptwriter. See, they are the three. So the scene, I think that has all the text. Character description, how they look like, you know? And a script writer is, that they exchange. So every time to create the new image, we're using the new text, the dialogue.
43:47 - D. B.
Yeah.
43:47 - M. M.
So we have an audio and image and combining the work that S. is doing is actually generating the images when we will come to every scene, every frame, what people, the dialogue, the text that they talk. I have the whole document, but I think I., not many people participate. I. presents the whole document, what is included, how the text is generated in each part. And we like it so much. And actually, it can be a game. It can be a game. There are games right now like this.
44:44 - Unidentified Speaker
Any questions? We almost finished, yeah. Yeah, hi.
44:48 - Y. P.
The question that I had was around the inconsistency that you were finding on images.
44:56 - Unidentified Speaker
Yes.
44:56 - Y. P.
Now, I have not personally played with video creations and this kind of effort. My work mainly has been on the data side, but similar to the earlier presentations, is there a possibility of building a RAG model? And have you thought of putting these images the way you want somewhere and call a capability or these images each time when you're doing saying that, hey, you have done this, but I would like you to use this. So is there a possibility of controlling these inconsistencies using a RAG framework or similar framework?
45:43 - M. S. M.
Yeah, that is a great advice. Actually, we are trying to do that. I mean, we're still trying to work or make a workaround. As you know, this there wasn't much RAG involved here. There were agents, but not a lot of RAG. So I think we are trying to work around a frame where we can apply RAG into it.
46:11 - M. M.
Well, to copy, to save, like you suggest, maybe a first generated appearance of these characters, we can copy and save it and give that to keep yes and to give the task to keep this description one agent needs to do exactly this role to keep the description use exactly exactly this image exactly this appearance we can do this and this is great suggestion yeah yeah kind of like a feedback loop correct and map I have one more I
46:52 - Y. P.
did not, I mean, I saw the video, but I was driving while I was seeing, but there is also a possibility where you might have to creep one, two, three, four, and tag those images and tag those images with
47:10 - Unidentified Speaker
background or script or something.
47:12 - Y. P.
So that when you're calling that, because one image perhaps may not be different because the color can change. Change the expression can change and etc so you might have to create literally a database of either hey this is the script or if this is the background this is what I want something like that I mean that was was running in my mind but maybe I'm going to deeper first you have to find the framework and then you have to suggest how you're going to build the framework but those are some of the questions or thoughts that came to my mind and I wanted to share with you all. Thank you.
47:52 - M. S. M.
That is actually a great discussion also. But when we're using stable diffusion or any kind of image generation code segment, there's this notion that we think, that preconceived notion that the images that we are going to get after the generation, that's already basically trained on a thousand of images before. So in In our mind, we did not have that kind of idea to keep a database, but yeah, that can be done. But then I think the workaround will be a bit time-consuming. For a small project, it would be time-consuming, but it is a great advice if we actually can do that. I'll have to look more into it. Thank you so much for your service.
48:44 - Multiple Speakers
Actually, I. extended this. I think that he put multiple different solutions.
48:50 - M. M.
So we can invite I. if it's interesting for the group to give us the feedback, what is new right now. Yeah, why not?
49:01 - D. B.
Yeah, this is the initial step.
49:04 - M. M.
This was six months ago, I think, or more. So it's not new. So we have a new update right now.
49:15 - Unidentified Speaker
Wow.
49:15 - D. B.
Okay. Yeah. Great. Well, thanks. Uh, thank you all for attending and thank you to the presenters for bringing that information. Uh, that was quite interesting. Um, any last comments before we adjourn?
49:35 - D. D.
Thank you guys. Thank you.
49:37 - M. M.
Great presentation. Yeah. Thank you.
49:40 - D. B.
All right. You all have a good weekend.
49:48 - Unidentified Speaker
See you next week. Thank you. See you next week. Thank you. Thank you. Bye. Thank you all.
Conversation opened. 1 unread message.
ML Discussion Group
Fri, Jan 31, 2025
0:07 - Unidentified Speaker
Hi, everyone. Hello.
1:41 - D. B.
Well, is there anything on the schedule in particular today? I know that Dr. M.'s students were considering telling us about Lang chain and crew AI.
2:00 - M. M.
Yeah, I want to introduce A. is our new PhD student. That is coming to our group. I appreciate that S., yeah, S. A. actually is joining us. I really appreciate the participation of all of you. Welcome, S. Glad to see you again.
2:27 - Multiple Speakers
What did you say? I said, welcome to S.
2:32 - D. B.
Glad to see you again. Yeah, yeah, yeah.
2:37 - M. M.
Thank you so much.
2:39 - Multiple Speakers
Thank you for the invitation.
2:41 - M. M.
I'm so glad to see you. Yes, thank you for coming. We organized with D., with D., we organized this blog and discussion group every Friday at four o'clock. Please come and join us when you are available.
2:58 - Multiple Speakers
Yeah, if anyone who's here and has not been here before would like to be added to the calendar entry, just let me know.
3:08 - D. B.
I'll put your email on the calendar entry, and it'll be on your Google Calendar. Do you want me to give their emails, R.'s email?
3:18 - M. M.
Well, I mean, if she wants to be on the list, I guess.
3:23 - D. B.
I mean, you want to just give it to me now, or they can contact me later? Yeah, please can you admit Okay, what's your email address?
3:37 - S. A.
S-N-W-R. S-N-W-R.
3:59 - M. M.
particularly A. and S. that are here right now, they participate in several hackathons and competitions. Well, they're well trained for large language models and computer vision and multi-modal techniques, generative AI. So one of the that we have with Census Bureau with Dr. T., we suggest to work with multi-agents and was very well accepted the idea. So how do you do then? Can give, I think A. will make a presentation.
4:47 - D. B.
Yeah, that'd be great.
4:49 - M. M.
If you give the access to, Do you want to share a screen?
4:56 - D. B.
I can unshare my screen if you can share or whatever.
5:01 - Multiple Speakers
Yeah, I can share my screen, Dr. B.
5:05 - Unidentified Speaker
Okay.
5:06 - M. M.
Okay, so they will, please feel free, because it's a discussion group, so feel free. It's not so official, official presentation. Feel free to ask questions and of course, Your feedback is very important. There are many, many multi-agent techniques. Probably, A. can share one of the links that we have. But first, he will start with his experience and how he's using and currently how he's using generative AI. And after that, we will show you some video. OK? So, A.
5:48 - A. A.
Yeah. OK, so good afternoon, everyone. I'm A. I'm currently working in this entity resolution project for the U.S. Census Bureau. So this entity resolution project is like matching records between multiple data sets. So if we have some records that represent the same real-world entity, we directly match them across multiple data sets. The problem is not a lot of machine learning or deep learning algorithms are able to do that perfectly. But when the LLM and the multi-agent systems were introduced, we thought that we would extend this research and try it out to check whether it does a good job or not. So for this, I'll be going through the base of multi-agent system and what we added to the whole application. And then I'll be showing a small code snippet on how I did that using TrueAI. So to get started, so the multi-agent systems are equipped with their own large language models. So usually if we have like multiple tasks that we need to do, the single agent large language model system finds it very difficult because the use users' queries are too much for them to handle. So multi-agent systems do a good job where we can segregate some tasks and we'll be able to combine those outputs of different agents and give it back to the user. This actually increases the reliability of the results. Seen that myself. So the important thing here is we need to give them specific instructions so that they can do the tasks on their own. The main part with this multi-agent system is that the prompts that we give should be precise and up to the mark. Otherwise, it makes mistakes along the way. So let me show you how the multi-agent system works. The diagram here shows us that first of all, we'll be sending the query. The user will give the query to the system. So basically what it does is the parsers might contain our data sets, which may be a PDF file or an Excel or a CSV file. And then the third part is we'll do the embeddings on these documents and create a vector database. So the main reason why we create a vector database is that it helps the LLM models look for the semantic meaning for a particular word so that it would be easier for the LLMs to identify the words which have a similar meaning join them together. And then it goes to the multi-agent systems where they have specific instructions on what to do. So in this particular diagram, we have three agents. We have a ranker agent, the reader, and an orchestrator. Each has a different task. The ranker agent will rank the particular Retreat documents and the reader will read through it and check and the orchestrator will summarize the whole document and give it back to the user as a whole response. So this is how a multi-agent system works, but we did not use the same multi-agent systems. We did a different thing with that. I'll be getting back to it.
10:02 - A. A.
The main thing is, even though multi-agent systems are good at doing subtasks, the problem is that once we input a larger file, like we'll be inputting multiple PDFs or CSV files, which will be like a 1 GB of CSV file or 1 GB of PDF file, so that it becomes harder for the LLMs to loop all the data inside that and give us a result, it takes a lot of time and sometimes it will hallucinate. So because of that problem, we added a retrieval augmented generation system. So this retrieval augmented generation system, what it does is, first of all, we do the embeddings from the data which was given to us, and then it will convert them into vectors. The vectors usually capture the semantic meaning of the document, so that it would be easier for the LLM to combine the words later on. But once the vectorization has been done, the data goes to a Retrieval Augmented Generation system so that the LLM can interact with the Retrieval Augmented Generation system and say like, hey, I only need this particular data from the data set. So the RAG system, what it does is it only takes that particular data which is relevant to the context and gives it to the LLM. Now the LLM does not have to loop through all the data which is in the dataset and give us the answer. So by doing this, it reduces a lot of time and increases the reliability of the answers which are provided by the multi-agent system. So that's why we are using a RAG system. Moving on, in our particular project that we are working on, we are using an improvised version of a RAG system which is called as a multi-query RAG system. So usually when this was initially released, we were using a single query retriever RAG system where the user, where the LLM asks the RAG system that they need this particular information from the document and it gives them the, gives only that information. But the problem is the prompts that we give is directly given by the LLM to the RAG system. So there is a high chance that the RAG system does not give additional relevant information on that. So a lot of data is being lost in that particular process. So they came up with a solution where it's a multi-query RAG system. So what it does is, once the LLM gives the query, this particular system comes equipped with another LLM system inside it. So what the LLM does is that the original query which is given by the RAG system. The LLM uses that query and generates multiple queries. I think, I don't know how many generated queries it executes, but the minimum generated query it executes is like five. So these generated queries are then given to the RAG system once again, so that since we are giving like five or six queries, different queries at the same time, the results will, the RAG system will give them different results, but it would be in the same context. So combining all these results, it will give us like a more, give us a broader context for the LLM to work with so that minimal data is lost in this process. So the context we are getting is much higher than using a single query retrieval RAG system. And then these data are given to the multi-agent system to do the task. So I have tried this on our project with the single query and the multi-query retriever, but the multi-query retriever does an even better job. Job by inputting a larger context and giving us a much more better result than a single query retriever RAG system. So, yeah, I highly suggest that if we use multi-query RAG system, it is better to capture additional information, additional relevant information, and it would be more accurate than a single query RAG system.
15:10 - A. A.
Yeah, so let me just stop my screen for a moment and I'll share a small code snippet and I'll explain that in just a second. Yeah, so here, yeah, so right now we are for our entity resolution task, we are using an agent framework. So in this agent framework we will like, Assign roles for multiple agent. So here in this this first agent is a direct record linkage agent. So the task for this agent is that it will see look through multiple data sets and if it like directly. Directly matches with another data set or by address, it will match it. So we have like multiple agents doing tasks. And in this agent's file, we'll just define the role and their goal for those particular agent. And in the task, we will give them precise instructions on what to do. We'll also provide them with examples so that they can understand the context really well and then do the task with much more preciseness. So in Crue AI, it is very much easier to add the agents and allocate them specific tasks. So we just need to write these prompts and it is much easier to implement it with Crue AI. But while executing this project, The problem with Crue AI is that it is very much better in using text as an input and generating its own answers. But when we use our own data set and try to do a specific task by allowing these agents to read our data and give us the output, it is not doing a good job. Over time, it is like these agents are combining themselves together and then they are trying to repeat itself over and over again. So, Crue AI is not really working in this particular task as we expected. So, we are currently moving towards LangChain. Specifically, they have a library called LangGraph, which is an agent architecture as well, which contains the nodes and edges. So it is supposed to be much more better than TrueAI, but the problem is it is a bit harder to implement with LangChain. TrueAI offers a much easier implementation where we can just plug and play, but in LangChain, it is a bit complicated if we execute it, but it is much better than Crue AI. That's what we know. That's what we can know. Right now we are trying to change the whole system to use LangChain multi agent system so that we need to check their results as well. I hope it is better than Crue AI.
18:49 - Unidentified Speaker
Yep.
18:49 - Multiple Speakers
A. and Dr. M. should we wait for questions until the presentation is over?
18:55 - A. A.
or can we ask questions during the presentation? Yeah, I'm actually done with my presentation. So I'm open for questions right now.
19:08 - Unidentified Speaker
Yeah.
19:09 - Y. P.
Okay. Yeah, the second.
19:11 - A. A.
Yeah. Yeah.
19:12 - Unidentified Speaker
Yes.
19:13 - Multiple Speakers
So then should I go ahead and ask questions?
19:18 - Y. P.
Yes, please. Okay. So I'll start with your last slide. Crue AI versus LangChain. And you mentioned you're finding something difficult or something is not easy. I don't remember the exact words, but you had some difficulties. I would like to know what difficulty you're facing in LangChain. So that if there is anything I could do to help you out.
19:46 - Multiple Speakers
Yeah, I'd not say difficulty. I'm like, I've not implemented yet.
19:50 - A. A.
I was just saying that Crue AI is much easier to implement than LangChain. LangChain might be a bit complicated. That's what I said. If you face any difficulty, let me know.
20:02 - Y. P.
Either me or somebody can help you out if at all you need any help. Yeah, sure, sure.
20:10 - Multiple Speakers
My name is Y. P. or people call me Y.
20:14 - Y. P.
So I wanted to ask you. Thanks a lot.
20:17 - Multiple Speakers
I have a couple of more questions.
20:21 - Y. P.
if I can ask. So, when you're saying agents, agents in Glue AI, agents in other framework, are you actually have, actually, let me start with the first question. When it comes to your use case, are you putting all the data in this framework? Or do you have kind of a segregation process? Yeah. Sorry.
20:52 - A. A.
Yeah, I'm actually getting the data and converting them to chunks. And then those chunks go into the embedding process. And then it goes to the vectors.
21:06 - Y. P.
Yeah. Not that well, I'll explain the other way. So census data is large data, and there are multiple data sets. And see, if you think from processing standpoint, and you were yourself saying that, the traditional rule-based system will work much better than the machine learning or pre-GPT, non-GPT AI engines. And then GPT AI engines will consume the maximum processing power or energy and all that. So when you are doing that, like the way you are thinking of multi-agents within this, have you thought of, now I know you are experimenting might be only on RAG and agents, but is there a scope, Dr. M, also that might be a question for you, where actually you are saying processing-wise, we have eliminated records that really do not need to go through this, where there is clear match, so we are removing that, and then remaining data sets only where it cannot be matched or something, you are putting it for learning. Yeah. So are you doing that or no? Or have you thought of doing that?
22:26 - M. M.
Yes, but he mentioned if there is first step is direct match. This is what you're talking about. Something that is really obvious, we don't continue investigating more. So, we will do this step. We remove what is really obvious and start digging, you know, deeper. Yeah.
22:48 - A. A.
I mean, the thing is, the census data that we got, it's like a very dirty data set. So, in the column called names, there will be the address. In the address, there will be some identification numbers. Pretty jumbled. So what we are trying to do is that we are trying to automate it without any rule-based approach using LLMs alone.
23:18 - Y. P.
So that's what we are trying to do.
23:21 - Multiple Speakers
Thank you for the explanation.
23:24 - M. M.
Yeah, we actually published several papers using transformer, you know, this is something new with multi-agents, but we actually proposed and already proved that with large language models and particularly transformer and explainable transformers give us very, very good results. So we already use for census and we're the first actually university that suggest them, you know, to use the large language models. They reject so many times, but right now they are asking us the court and they approve our work already. So I want to share again, D., with you probably, the new code that I have for this semester for everybody that is in our group to use NVIDIA courses for free. Because they change every semester, they change the code. We have a new code right now. Well, because this is how A. and several of my students are using Lama. Yeah, so we have actually another presentation. And A., you want to show the
24:46 - A. A.
I think, yeah, S. will show that right now.
24:51 - Multiple Speakers
S. will show another too? Yeah.
24:55 - M. M.
Yeah, I like, we have another. Presentation which S. can show.
25:01 - M. S. M.
I wanted to show the N8n platform from YouTube as we have not started working with N8n yet. I think it's better if we show everyone the YouTube video that explains how good N8n is in regards to codeless programming for AI agents. I love it this one but do you want to show first the Storytelling the storytelling Storytelling I think it's better if we show the storytelling in another another day Why we have so many people oh my goodness you have
25:39 - M. M.
the video please prepare the video Multi-agent that then you can you can you can go ahead with that. I'll show the storytelling.
25:49 - Multiple Speakers
Let me just grab the video Okay, grab the storytelling So when you share a screen to show the video, you need to checkmark that it's optimized for video or something like that.
26:05 - D. B.
In Zoom, there's a checkmark. Otherwise, the video won't come out. So when you go to share the screen, there's a special place to check that it's a video.
26:18 - M. M.
Can you see it?
26:20 - Unidentified Speaker
Yeah.
26:22 - M. M.
This is one of the recommended. I have a list of 10 most popular, and this framework is actually without code. Using Telegram, which is an app you can use on your computer.
26:36 - D. B.
No, it's not coming. I'm not seeing a video.
26:40 - Y. P.
Yeah.
26:41 - Multiple Speakers
I think it's in another screen document26:44 - D. B.
I see your Zoom launch meeting screen, so it's the wrong window. Or the wrong tab.
26:44 - D. B.
I see your Zoom launch meeting screen, so it's the wrong window. Or the wrong tab.
26:52 - Unidentified Speaker
Or if you send, okay, there you go, you're sharing. That looks like a video, yes. Yeah, this is the video.
27:06 - M. M.
And S., please prepare our video, it's very good. Agents are taking over right now. So in this video, I'm going to show you how you can automate anything. Can you hear it? I hope so. Yeah. Yes, ma'am. Yeah, yeah, we can hear. I'm going to be showing you a couple of examples of how you can add tools to an AI agent, have a conversation with a bot using Telegram, which is an app you can use on your computer, on your phone, on the go. And this is multimodal. So I can type into it, I can speak into it, whether I'm on my phone, on my desktop, and I can give it requests like it's my and it can go out and do things for me. Let me show you a couple of examples of how this actually works, because as you can see, I have a bunch of things connected. I have Google Calendar, I have Airtable, I have Gmail. Things are all over the place, but this AI agent can manage your tools and then decide which one to use on its own. So you give it basic natural language queries, and it comes back after using all your tools with all of your info, and it actually gives you a good response, schedules things for you, can email people, update your CRM, search your CRM, do so many things. Now this is just the tip of the iceberg. If I go to my Google Calendar, as you can see, tomorrow I have a discussion with J. and AI brainstorming. Discussion with J. is two to three, AI brainstorming eight to nine. If I ask Telegram something like, what's on the schedule for tomorrow? And I could send that off, use very natural, small language, because AI is really good at interpreting that and understanding that. What it's going to do is return schedule that I just showed you in a matter of seconds. As you can see, here's your schedule for tomorrow. Morning routine, discussion with J., AI brainstorming. We go to my calendar, this is exactly what I've got tomorrow. Morning routine, discussion with J., AI brainstorming. I could now add an event or something like that. I could be on the go, remember. I could just use my voice. I could say, can you add an event around 11 p.m. to 11:30 p.m. that goes over AI agent discussions with C. You know, horrible. I'm talking even horribly to it. And it's going to parse that, transcribe the recording, and then create an event for me tomorrow after I gave my schedule. In a matter of seconds, guys. So it says 11 to 11:30 p.m. agent discussions with C. tomorrow scheduled. Go down here, boom, AI agent discussions with C. scheduled. I could then ask it questions about certain people. So maybe C. is in my CRM. I could say, return info about C. to me from my CRM. And I could, you know, have it do multiple things for me. So now I'm even going deeper down the rabbit hole. Event's been created, and then I asked for information about C., and now it's giving me things like his name, his email, his company name, notes about him. So now with his email, I can even, you know, use my emailing tool in order to email him the event details. So things are moving very quickly. I created this thing in 45 minutes, but it might be more difficult for you. So that's why I'm here in order to teach you. And in this video, you're going to learn lessons that are much greater than just, you know, putting together a template that you forget about. You're going to be learning things like how to set up a telegram trigger, how to craft amazing instructions like this. And look, watching a video of me building all this, it's going to help you. You're going to learn some stuff, but if you want to learn in a more structured way, you want just a full guide so that you can whip things up like this in 45 minutes for any use case that you desire, then I highly, highly recommend joining our AI Foundations community. This network is hyper-focused on building AI agents right now, and we've also just released a full course on agent building within N8N. And this course literally takes you from beginner to pro in N8N, literally setting up your workspace all the way to crafting agentic systems like the one that I'm about to build you today. Now, it's important to have the knowledge in order to do this on your own, and in order to thrive in a world full of agents, you're gonna wanna know how to build these. So that's why we made this in-depth course for you to actually learn the tactics of building agents rather than just relying on other people giving you tutorials and templates. We want to give you the actual way to build them. And that's what this course offers. We go in-depth on so many different topics and also the community, the network, the live calls, everything you get in here is just going to excel your AI agent building abilities. So I'll leave a link in the top comment and the description yeah I think that yeah we just grab this template and go over continue uh do you want me to share the video here we have it a prompt helping us create our yes yeah sure yeah so I think that this is free and N8N is free I tried little bit.
32:00 - Unidentified Speaker
I like it because it's no no cotton good. So maybe S. is ready.
32:09 - M. M.
When you know how to how I can I I can stop sharing and S. can share but I can show you the uh thank you the Video and I have from Medium a lot of leaks. They give us for the best for the best frameworks for multi agents but this is the video it's a long so I just show you I'm I'm excited. I need some help for calendar at least somebody to to help me with the calendar, the reminder and stuff like this. And can do whatever job you want, you know, summarize papers, searching for a ticket for you, whatever you want. S., are you ready to show, please?
33:15 - M. S. M.
Yeah, Dr. M., I'm ready. I can show the video that me and my friend, basically I., we created a video for NVIDIA AI competition. So we used the multi-agent NVIDIA and NVIDIA guardrails to create a storytelling project. So it was basically the storytelling was done by agents and then the scripts that was derived from the storytelling. I used those to create images that goes according with the story. And at the end, I have parsed all the images together and then used a text-to-audio tool to basically generate audio for the file. So if you guys want, I can show the video of which I. posted on his LinkedIn. So I can show you the video. We basically, this was our project. Let me share. So this is my friend I. He's basically presenting this and we worked on this together. Everyone, this is me, I. presenting on our entry for the NVIDIA AI Generative Contest and our project today is a story production crew, which basically generates videos of children's stories from just a text prompt. Here you can see I'm putting an input for a story about a monkey, a dog, and a dragon. And you can see our elements will generate a story from this prompt. So what we do here basically is we create three different separate types of agents, one responsible for detailing the scene, the characters and then writing the scripts and the story. So from the story, we generate all the other information and then we pass that on through NVIDIA guardrails and we generate both images and text. And then the text is converted to audio. And then we combine the generated images and the audio to make a video. And here you can see we got the output and soon we'll demo the video. That was produced. So enjoy. Meet M., a friendly dog. During one of his explorations, they quickly become friends and enjoy playing hide-and-seek. M. tells M. about a legendary dragon that lives deep in the forest. Seeking M.'s curiosity, determined to meet the dragon despite M.'s initial hesitation, they embark on an adventure together. They eventually find the dragon, E., who has emerald scales and a fiery mane. Though E.'s initial roar is intimidating, M. and M. stand their ground and explain their desire to be friends. Touched by their bravery, E. reveals that he used to have many friends, but humans grew to fear him over time. E., delighted to find new friends who aren't afraid of him, offers to teach them about fire breathing, wingspans, and the importance of friendship. The three of them go on exciting adventures, exploring caves, chasing butterflies, and having campfires. Through their journey, M. and M. learn that true friendship can bridge any difference, and E. finds joy in their companionship.
37:06 - Unidentified Speaker
OK.
37:07 - M. M.
Thank you. Thank you. OK.
37:09 - Multiple Speakers
So the problems that I'm facing now currently with this project, so they can be divided into three parts. So the first would be parsing.
37:21 - M. S. M.
So the parsing part is for each individual story segment, I have to basically the code gives us different images. So I'm still working on systems so that you don't have to compile each segment of the text to get an image. Instead, we get all the images together. And the second part of the problem that I'm facing is there's consistency on image generation. As you've seen before, in the images, the breed of the dog, the color of the dog, or the background, or any other characters, they don't seem to be consistent. So there was a lot of manual, manual processing involved when I actually fixed it. And also I am trying to figure out how we can keep this parsing consistent, this image generation consistent across all the images, because I'm still facing this problem and I haven't found any, or maybe I haven't looked hard enough. I haven't found any, let's just say there's this one possible solution, one simple solution which I can follow and I can solve this problem. I haven't found that any. And the third one would be, as you've seen, this was generated by stable diffusion. And this was done in the code manually. I would also like to apply this text in DALL-E for image generation and compare what happens if we use it in the DALL-E platform and the Mid-Journey platform. And what happens if we don't use those platforms, but instead use the API keys for the project. So these three are the challenges I'm now facing and currently working with. Thank you so much, Dr. M., for giving me the opportunity to show this video.
39:24 - M. M.
Sure, sure. And we want help from all all of you in this project, you know, the agents can do multiple things like I show you. So any ideas that you have, and I mean, all kinds of companies can benefit about.
39:42 - D. B.
I have a couple of questions. So how much code, actual, you know, code did you have to write to create that movie?
39:53 - M. S. M.
There was quite a bit of coding involved. I mean, for the image generation, there was the normal stable diffusion, your run-of-the-mill stable diffusion coding that I got from a GitHub project and I repurposed it and also I did some of my own coding for that.
40:13 - D. B.
Okay, how many lines would you estimate?
40:16 - M. S. M.
That was more than at least 70, 80, not more than 150.
40:21 - D. B.
Okay, well, that's not that much code for generating a whole movie, right?
40:26 - M. S. M.
I mean, it's not generating a whole movie per se, Dr. B. It's basically generating images. And keeping consistency in those images is what I'm facing problems now. As you can see, there was a lot of manual tweaking involved. And also, at one point, I wasn't getting one of the images right. So I had to use the, you know, the website the DALL-E website to basically fix it. So the DALL-E website has this one, this markup tool where you can mark in the image and it will fix the image for you however you want. Like there was one part of the image that the background of the image that the forest, the forest wasn't coming up right as much as good as I want. So I had to do that manually. So you can see it's still a work in progress and there's still a lot of, you know, tinkering that needs to be done in this case.
41:21 - D. B.
So I mean, did you have one agent generate a story and another agent request images for each segment of the story? Exactly, exactly. And did anyone else have any questions about this? I mean, I don't want to overwhelm, ask all the questions.
41:40 - Y. P.
Ben, I'm always a man with questions. If you have questions. You have a bunch of questions. No, no, I was like I have a man always with questions, but if you have more questions, finish them. I can.
41:57 - D. B.
I can hold my question if we can go back. You had it.
42:02 - M. M.
You showed a diagram showing the agents structure. Yeah, yeah, and every agent the prompt is with language. We're using Crue AI. In this case, and OpenAI. And only the, something with NVIDIA is this guardians, but guardians is just about the security, that the text that the agent generates is not something to hurt people or hurt the child, child, or something. You know, this is the control, security of the text. Can you go to the diagram really? He has three different agents because the initial query is just a story about these characters. So everything else is computer generated. S., can you please show three agents?
42:57 - M. S. M.
Sure, Fany.
42:59 - Multiple Speakers
So everything is computer generated. We didn't put any words. Any. And it's generating. I think one is generating the script. We have a good description.
43:13 - M. M.
Okay, yeah. So the scene generator, character, and the scriptwriter. See, they are the three. So the scene, I think that has all the text. Character description, how they look like, you know? And a script writer is, that they exchange. So every time to create the new image, we're using the new text, the dialogue.
43:47 - D. B.
Yeah.
43:47 - M. M.
So we have an audio and image and combining the work that S. is doing is actually generating the images when we will come to every scene, every frame, what people, the dialogue, the text that they talk. I have the whole document, but I think I., not many people participate. I. presents the whole document, what is included, how the text is generated in each part. And we like it so much. And actually, it can be a game. It can be a game. There are games right now like this.
44:44 - Unidentified Speaker
Any questions? We almost finished, yeah. Yeah, hi.
44:48 - Y. P.
The question that I had was around the inconsistency that you were finding on images.
44:56 - Unidentified Speaker
Yes.
44:56 - Y. P.
Now, I have not personally played with video creations and this kind of effort. My work mainly has been on the data side, but similar to the earlier presentations, is there a possibility of building a RAG model? And have you thought of putting these images the way you want somewhere and call a capability or these images each time when you're doing saying that, hey, you have done this, but I would like you to use this. So is there a possibility of controlling these inconsistencies using a RAG framework or similar framework?
45:43 - M. S. M.
Yeah, that is a great advice. Actually, we are trying to do that. I mean, we're still trying to work or make a workaround. As you know, this there wasn't much RAG involved here. There were agents, but not a lot of RAG. So I think we are trying to work around a frame where we can apply RAG into it.
46:11 - M. M.
Well, to copy, to save, like you suggest, maybe a first generated appearance of these characters, we can copy and save it and give that to keep yes and to give the task to keep this description one agent needs to do exactly this role to keep the description use exactly exactly this image exactly this appearance we can do this and this is great suggestion yeah yeah kind of like a feedback loop correct and map I have one more I
46:52 - Y. P.
did not, I mean, I saw the video, but I was driving while I was seeing, but there is also a possibility where you might have to creep one, two, three, four, and tag those images and tag those images with
47:10 - Unidentified Speaker
background or script or something.
47:12 - Y. P.
So that when you're calling that, because one image perhaps may not be different because the color can change. Change the expression can change and etc so you might have to create literally a database of either hey this is the script or if this is the background this is what I want something like that I mean that was was running in my mind but maybe I'm going to deeper first you have to find the framework and then you have to suggest how you're going to build the framework but those are some of the questions or thoughts that came to my mind and I wanted to share with you all. Thank you.
47:52 - M. S. M.
That is actually a great discussion also. But when we're using stable diffusion or any kind of image generation code segment, there's this notion that we think, that preconceived notion that the images that we are going to get after the generation, that's already basically trained on a thousand of images before. So in In our mind, we did not have that kind of idea to keep a database, but yeah, that can be done. But then I think the workaround will be a bit time-consuming. For a small project, it would be time-consuming, but it is a great advice if we actually can do that. I'll have to look more into it. Thank you so much for your service.
48:44 - Multiple Speakers
Actually, I. extended this. I think that he put multiple different solutions.
48:50 - M. M.
So we can invite I. if it's interesting for the group to give us the feedback, what is new right now. Yeah, why not?
49:01 - D. B.
Yeah, this is the initial step.
49:04 - M. M.
This was six months ago, I think, or more. So it's not new. So we have a new update right now.
49:15 - Unidentified Speaker
Wow.
49:15 - D. B.
Okay. Yeah. Great. Well, thanks. Uh, thank you all for attending and thank you to the presenters for bringing that information. Uh, that was quite interesting. Um, any last comments before we adjourn?
49:35 - D. D.
Thank you guys. Thank you.
49:37 - M. M.
Great presentation. Yeah. Thank you.
49:40 - D. B.
All right. You all have a good weekend.
49:48 - Unidentified Speaker
See you next week. Thank you. See you next week. Thank you. Thank you. Bye. Thank you all.
Displaying jan10FinishedTranscript.txt.