Friday, February 14, 2025

2/14/25: HM on "Evaluation of Snowflake Data Cloud Data Pipelines and AI/ML Capabilities"

    Artificial Intelligence Study Group

Welcome! We meet from 4:00-4:45 p.m. Central Time. Anyone can join. Feel free to attend any or all sessions, or ask to be removed from the invite list as we have no wish to send unneeded emails of which we all certainly get too many. 
Contacts: jdberleant@ualr.edu and mgmilanova@ualr.edu

Agenda & Minutes  (150th meeting, Feb. 14, 2025)

Table of Contents
* Agenda and minutes
* Appendix 1: Details
* Appendix 2: Transcript (when available)

Agenda and minutes
  • Today: HM informally presents proposed MS project on "Evaluation of Snowflake Data Cloud Data Pipelines and AI/ML Capabilities"
  • Announcements, updates, questions, presentations, etc. as time allows
    • Soon: VK will report on the AI content of a healthcare data analytics conference attended in FL. 
    • Feb. 21: BH informally presents proposed PhD project on "Unveiling Bias: Analyzing Federal Sentencing Guidelines with Topological Data Analysis, Explainable AI, and RAG Integration"
    • Wednesday Feb. 26, presentation on AI at Windstream. Pizza 11:30 a.m., presentation 12:15-1:30 in EIT auditorium and perhaps online. See details in the appendix, below. 
    • Fri. March 7: CM will informally present. His "prospective [PhD] topic involves researching the perceptions and use of AI in academic publishing."
  • Recall the masters project that some students are doing and need our suggestions about:
    1. Suppose a generative AI like ChatGPT or Claude.ai was used to write a book or content-focused website about a simply stated task, like "how to scramble an egg," "how to plant and care for a persimmon tree," "how to check and change the oil in your car," or any other question like that. Interact with an AI to collaboratively write a book or an informationally near-equivalent website about it!
      • BI: Maybe something like "Public health policy." 
      • LG: Thinking of changing to "How to plan for retirement." (2/14/25)
        • Looking at CrewAI multi-agent tool, http://crewai.com, but hard to customize, now looking at LangChain platform which federates different AIs. They call it an "orchestration" tool.
        • MM has students who are leveraging agents and LG could consult with them
      • ET: Growing vegetables from seeds. (2/14/25)
        • Plan to try using Gemini
        • ChatGPT didn't produce enough words
        • Plan to make a website, integrating things together. 
        • Try prompting by asking for limited text, like 1,000 words, but then ask it to give another 1,000, and so on. (BH)
  • News: new freshman level AI course! See details in the appendix below.

THE meeting ended here.

  • We are up to 19:19 in the Chapter 6 video, https://www.youtube.com/watch?v=eMlx5fFNoYc and can start there.
  • Schedule back burner "when possible" items:
    • If anyone else has a project they would like to help supervise, let me know.
    • (2/14/25) An ad hoc group is forming on campus for people to discuss AI and teaching of diverse subjects. It would be interesting to hear from someone in that group at some point to see what people are thinking and doing regarding AIs and their teaching activities.
    • JK proposes complex prompts, etc. (https://drive.google.com/drive/u/0/folders/1uuG4P7puw8w2Cm_S5opis2t0_NF6gBCZ).
    • The campus has assigned a group to participate in the AAC&U AI Institute's activity "AI Pedagogy in the Curriculum." IU is on it and may be able to provide updates when available, every now and then but not every week.
      • 1/31/25: There is also an on-campus discussion group about AI in teaching being formed by ebsherwin@ualr.edu.
  • Here is the latest on future readings and viewings

Appendix 1: Details on (i) Windstream presentation, and (ii) new AI course

Autocorrected to graduation
Conversation opened. 1 read message.
(i) Windstream presentation
From Dr. Pierce:
Windstream AI Presentation
On Wednesday, February 26, 2025, representatives from Windstream will be at the EIT Auditorium to talk about how their company is using Artificial Intelligence.  This event is open to all students, faculty and staff who would like to attend.  
   When: Wednesday, February 26 in the EIT Auditorium
·       Pizza and Soda available from 11:30 am to 12:15 pm
·       WindStream Presentation from 12:15 pm to 1:30 pm
Agenda
AI at Windstream
    Team Structure and Collaborations
        Overview of Our AI Team and Roles
        Partnerships with Other IT Teams
        Collaboration with Business Units for Strategic Alignment
Operationalizing GenAI
    Overview and Implementation
        Key Strategies and Operational Framework
        Integration with Business Processes and Goals
    Projects and Innovations
        Key Projects in Progress
        Strategic Vision and Expected Outcomes
High-Level Architecture and Tools
    Technological Framework
        Core Technologies and Platforms
        Innovative Tools and Techniques
Impact and ROI of AI
    Business and Economic Impacts
        Measuring Return on Investment
        Case Studies Illustrating Value Addition
Please share this announcement with your colleagues and students (both undergraduates and graduates).   This is currently an in person event but we will attempt to record the session so that those who cannot attend in person can benefit from the session.
   Note:  Zoom Link is below (just no promises on how well a remote session would go).  
Topic: Windstream AI Workshop
Time: Feb 26, 2025 12:00 PM Central Time (US and Canada)
Join Zoom Meeting
https://ualr-edu.zoom.us/j/86789504204
Meeting ID: 867 8950 4204
 
(ii) New course
 Department of Computer Science planning to offer:
    •  CPSC 1380: Artificial Intelligence Foundations

       Course Description

       Credit Hour(s): 3

      Description: This course introduces key principles and practical applications of Artificial Intelligence. Students will examine central AI challenges and review real-world implementations, while exploring historical milestones and philosophical considerations that shed light on the nature of intelligent behavior. Additionally, the course investigates the diverse types of agents and provides an overview of the societal impact of AI applications.

       Prerequisites: None

       Course Learning Objectives

      Upon successful completion of this course, students will be able to:

      ·         Describe the Turing test and the “Chinese Room” thought experiment.

      ·         Differentiate between optimal reasoning/behavior and human-like reasoning/behavior.

      ·         Differentiate the terms: AI, machine learning, and deep learning.

      ·         Enumerate the characteristics of a specific problem related to Artificial Intelligence.

      Learning Activities

      ·         Overview of AI Challenges and Applications - Introduces central AI problems and highlights examples of successful, recent AI applications.

      ·         Historical and Philosophical Considerations in AI – Discusses historical milestones in AI and the philosophical issues that underpin our understanding of artificial intelligence.

      ·         Exploring Intelligent Behavior

      o   The Turing Test and Its Limitations

      o   Multimodal Input and Output in AI

      o   Simulation of Intelligent Behavior

      o   Rational Versus Non-Rational Reasoning

      ·         Understanding Problem Characteristics in AI

      o   Observability: Fully Versus Partially Observable Environments

      o   Agent Dynamics: Single versus Multi-Agent Systems

      o   System Dynamics: Deterministic versus Stochastic Processes

      o   Temporal Aspects: Static versus Dynamic Problems

      o   Data Structures: Discrete versus Continuous Domains

      ·         Defining Intelligent Agents - Explores definitions and examples of agents (e.g., reactive vs. deliberative).

      ·         The Nature of Agents

      o   Degrees of Autonomy: Autonomous, Semi-Autonomous, and Mixed-Initiative Agents

      o   Decision-Making Paradigms: Reflexive, Goal-Based, and Utility-Based Approaches

      o   Decision Making Under Uncertainty and Incomplete Information

      o   Perception and Environmental Interactions

      o   Learning-Based Agents

      o   Embodied Agents: Sensors, Dynamics, and Effectors

      ·         AI Applications, Growth, and Societal Impact - Provides an overview of AI applications and discusses their economic, societal, and ethical implications.

      ·         Practical Analysis: Identifying Problem Characteristics - Engages students in exercises to practice identifying key characteristics in example environments.

      Tentative Course Schedule

      Subject to change at the discretion of instructor.

      Week

      Topics

      Learning Activities

      1

      Course Introduction & Overview of AI Problems

      ·         Overview of central AI challenges

      ·         Examples of recent successful applications

      ·         Lecture introducing course objectives and structure

      ·         Reading assignment on current AI trends

      2

      Philosophical Issues and History of AI

      ·         Examination of philosophical issues in AI

      ·         Overview of AI’s historical evolution

      ·         Student presentations summarizing key course takeaways

      ·         Course review session and Q&A in preparation for the final assessment

      3

      What is Intelligent Behavior? I – The Turing Test and Beyond

      ·         The Turing test and its flaws

      ·         Introduction to related philosophical debates (e.g., Chinese Room)

      ·         Lecture with historical context

      ·         Small-group discussion on Turing test limitations

      ·         Reading assignment on classic AI thought experiments

      4

      What is Intelligent Behavior? II – Multimodal I/O & Simulation

      ·         Multimodal input and output in AI

      ·         Simulation of intelligent behavior

      ·         Demonstration of multimodal systems (videos/demos)

      ·         Lab session: Explore a simple simulation environment

      ·         Reflective writing: How does simulation approximate intelligence?

      5

      Intelligent Behavior: Rational vs. Non-Rational Reasoning

      ·         Comparison of optimal (rational) decision-making and human-like (non-rational) behavior

      ·         In-class debate on the merits of optimality vs. human-like reasoning

      ·         Case study analysis

      6

      Problem Characteristics I – Observability and Agent Interactions

      ·         Fully vs. partially observable environments

      ·         Single vs. multi-agent systems

      ·         Group workshop: Analyze example environments for observability and interaction challenges

      7

      Problem Characteristics II – Determinism, Dynamics, and Discreteness

      ·         Deterministic vs. stochastic systems

      ·         Static vs. dynamic and discrete vs. continuous problem spaces

      ·         Hands-on group exercise: Map out characteristics of a provided problem scenario

      ·         Group discussion on design implications

      8

      Defining Agents: Reactive and Deliberative

      ·         What constitutes an agent

      ·         Examples of reactive versus deliberative agents

      ·         Interactive lecture with in-class examples

      ·         Group exercise: Classify agents from provided case studies

      9

      Nature of Agents I – Autonomy and Decision-Making Models

      ·         Autonomous, semi-autonomous, and mixed-initiative agents

      ·         Reflexive, goal-based, and utility-based decision frameworks

      ·         Interactive exercise: Design a decision-making framework for a hypothetical agent

      ·         Group presentations of frameworks

      10

      Nature of Agents II – Decision Making Under Uncertainty & Perception

      ·         Handling uncertainty and incomplete information

      ·         The role of perception and environmental interactions in agent behavior

      ·         Lab: Experiment with a simple decision-making simulation

      ·         Group discussion on sensor integration challenges

      11

      Nature of Agents III – Learning and Embodiment

      ·         Overview of learning-based agents

      ·         Embodied agents: Sensors, dynamics, and effectors

      ·         Group lab: Explore embodied agent models using simulation tools

      ·         Group discussion on design trade-offs

      12

      AI Applications, Growth, and Impact

      ·         Survey of AI applications across industries

      ·         Economic, societal, ethical, and security implications

      ·         Case study analysis: Evaluate the societal impact of an AI application

      ·         Group discussion on ethical dilemmas and future trends

      13

      Deepening Understanding Through Application

      ·         Practice identifying problem characteristics in real/simulated environments

      ·         Additional examples on the nature of agents

      ·         Extended discussion on AI’s broader impacts

      ·         Interactive workshop: Analyze a complex AI scenario in small groups

      ·         Peer review of group findings

      ·         Hands-on exercises using simulation tools or provided datasets



Appendix 2: Transcript
 
 
AI Discussion Group  
Fri, Feb 14, 2025

0:13 - R R  
Good afternoon, all. Good afternoon, Dr.

0:16 - Unidentified Speaker
P. Good afternoon.

0:17 - M M
Yes, everybody. Good afternoon, Dr.

0:19 - Unidentified Speaker
M.

0:19 - R R
How are you?

0:21 - M M
I'm fine. I'm fine, yes. So, yeah. Are you going to present today? It's this guy, H, right?

0:31 - D B
H.

0:32 - M M
But I don't see him here yet.

0:36 - Multiple Speakers
But he's not late, so not yet.

0:40 - M M
I actually invite Some of my students from AI, I can see J here. Yeah.

0:48 - D B
Oh, good. I invited students from the PhD seminar today.

0:53 - M M
So some of them are here too.

0:57 - D B
So we got a bunch of new people.

1:01 - M M
Yeah. So actually for my students and for everybody, I will send announcement for workshop that we're offering again in video workshop Fundamentals of Deep Learning with Certificate. We will start Monday for AI class, but everybody else is invited. If you are interested, just let me know. But for my students, yes, I will send you the invitation and we will start Monday, but you can continue on Tuesday and Wednesday if you don't finish. So it's a little little bit long and with certificates. Oh, the guy's not coming.

1:44 - D B
I expect and I invite students for this presentation.

1:49 - Unidentified Speaker
Yeah. Welcome, H. Oh, D is here. I don't see any other old members. J is here.

2:00 - D D
Hello, everybody. Hello, D.

2:02 - M M
Yeah, D, congratulations. Accept it, I hope there is no errors.

2:09 - Unidentified Speaker
Yeah. Yeah.

2:10 - D D
Yeah, I got it. I found one error. There's some comma business going on, but I was kind of worried about messing with the commas, because sometimes it helps, sometimes it doesn't. But I found one thing, and I'm still double checking it.

2:30 - Unidentified Speaker
OK.

2:30 - M M
If it's a small error, I don't care. OK. J?

2:35 - D B
All right, well, so this is an AI study group. We meet every Friday at 4 o'clock and open to anybody. There's no obligation, and it's free. And if you get tired of the emails, let me know. I'd be happy to delete your email from the list, because I know how much people get more email than they can deal with these days, and I don't want to be part of the problem. So today, H will informally present his proposed master's project on evaluation of Snowflake data cloud data pipelines and AI ML capabilities. Next week, B H will present his proposed PhD project. These are all, you know, They're welcome to be highly informal, although people have slides and so on, they're certainly welcome to bring them, but it's not required. And he'll be presenting his PhD project. Then following the week, we'll do our regular agenda. I wanna mention that a week from next Wednesday is a presentation by the Windstream Company with pizza that in the EIT auditorium. So, you know, some people might find that of interest. Local employer, everything. And then C M will present his prospective PhD topic on March 7th. So we've got a lot of presentations. Normally we don't do so many presentations, we just sort of talk about papers and videos and things like that. But we go with the flow, where there's the is what we do. So H, if you'd like to go right ahead, I'll stop sharing and you can share.

4:43 - H
Sounds good. Let me share my screen.

4:49 - Unidentified Speaker
Do you see my screen? We do.

4:54 - D B
All right, perfect.

4:56 - M M
My name is H. I work as a cloud engineer at Snowflake.

5:05 - H
I'm working at Snowflake for the past four years. So I thought it's interesting to work on what I do. So I primarily work on Snowflake Cloud and recently we have invested in a lot of AIML features. So I thought I would work on a project which would help kind of explore these different features come together to work as a data pipeline. So the basic idea is to kind of have some data in an S3 bucket hosted on an external AWS account and we would have a data pipeline which will ingest these unstructured data or there can be images or there can be like scanned PDF documents which will be stored in an S3 bucket. There are some features which can load unstructured data into Snowflake tables as well as we also support features which where we directly read data from an external bucket using external tables or it can be directly by parsing the files in an external storage using pre-signed URLs. So we will have a data pipeline which will ingest from this external storage into Snowflake tables and there can be files which will be residing in S3 with a pre-signed URL. So we have some AIML features which can access these files using the pre-signed URLs and provide some analysis, which I will explain going further. As you may already aware, Snowflake is basically a SaaS provider. We provide data analytics and data warehousing services hosted on all major public Clouds here. As part of this project, I am primarily focusing on exploring the Snowflake features in data management, storing structured and unstructured data, and auto-ingestion and batch ingestion capabilities. So entire pipeline will be developed using a Snowflake notebook, which can have Python and SQL script. Snowflake has its own. Anaconda has a Snowflake channel, which is basically a set of packages which are publicly available, but security certified and published in Snowflake channel. As part of this, we have some Python packages which are provided by Snowflake for ML functionality, especially Snowpark ML and model registries. We also host container services. Where you can develop an application. So basically, I will be using these features to develop a pipeline and Document AI and Cortex LLM. These two are machine learning capabilities. Document AI is primarily like parsing a document. It does the OCR to read through the document and try to tabilize the data. Or it can either answer questions based on summary functions. Cortex LLMs also has a bunch of machine learning capabilities like summary and transcript translation as well as, sorry, my son is here. But yeah, so these are the features which I want to incorporate as part of the project. Here, this would be the high-level pipeline. So basically, there are files in Amazon S3. And we would have a SQS notification, which will send out an event notification to the Snowpipe whenever files land here. This can be some external service. I'm trying to see if I can incorporate some public API to kind of have frequent files coming into S3. But the idea is that some IoT device or some external application can drop files in Amazon S3. There is a event notification which sends an event whenever a file is landed in the S3 bucket. For a put object call, we will have a SQS notification triggering a snowpipe operation. A snowpipe is nothing but a copy job, so it basically reads the file based on the file I mean, currently we support JSON, Parquet, CSV, and a few other files. So basically, it just takes the files, uploads into Snowflake DB, and we will have some Python jobs which gonna take this data and further process it. So once it lands into the table, we will have something called Stream is nothing but a CDC mechanism where it will basically capture all the change data, and it will supply that into the document AI as well as LLM functions. These two are doing two different jobs. This is for parsing the files and capturing that into structure tables, and the LLM model will take the data, it does some summary operations and some sentiment analysis and it will write into one more table. So this will all publish into a comparison report and that will be published as a web UI inside Streamlet app. So on a high level, Snowflake ML has a lot of features, but as part of this project, I'm primarily focusing on ML functions which are built-in functions and we will use a model registry to have a fine-tune models here. Then the container runtime will provide CPU and GPU hardware resources. This is a trial account which I'm going to use for the project, but still container services has this capability to run any of this model or machine learning data pipelines. Currently, I'm not using any streaming, but this would be more like a scheduled job, which will push data in here and it will be published to the dashboards. Yeah, and the Cortex Analyst is part of the project, which I just want to capture here. So this is like a chat capability. So once the data is evaluated, and load it to the final table, there would be a streamlined application which is like a web UI where we can ask questions against the data to try to get answers.

12:42 - Unidentified Speaker
So this is like the steps which I'm planning to work on.

12:48 - H
But high levels, we will have data coming into Snowflake and there is some pre-processing or model training and some of the models are already provided with Snowflake. So we will have like a comparison with fine tune models versus what is coming by default. And everything will be captured as part of a Python UDF. So we can schedule that. So there is something called Snowflake task, which reads the stream and runs the Python UDF. And it generates the data. This order can change. I mean, this is not the final one. This is something I have captured from one of the paper published, but yeah, things can change. This is like a high-level presentation.

13:41 - D B
Thank you. Well, any questions for H? Well, I have a couple.

13:50 - Unidentified Speaker
So one of my questions is, so fundamentally, is Snowflake a database system?

13:59 - D B
It's a data warehousing.

14:02 - H
So they started as a cloud-based data warehousing system. Eventually, they started adding ETL layer, which which started with streaming, so we support like Kafka streamings or Snowpipe streaming and auto-ingestion capabilities. Now, we have added all this AI and ML capabilities on top of it. Well, I mean, it makes sense.

14:27 - D B
If you have a database, why not analyze the data, right?

14:31 - H
Yep, that's the idea. So I mean, some of these functions are very easy to use for end users because these are like out of box. Provide Anthropic, DeepSeq, New Media, Meta, I mean all the models which are published by major players out there. So customers have these functions like a SQL function similar to like a date function customers can call, select Cortex complete and give like a Cortex summary and give a big page of data and it will summarize for that. So yeah, there are a lot of functionalities are developed recently and I think it's getting attention but it still needs to penetrate.

15:17 - D B
Do you work for an organization that is using Snowflake or do you work for Snowflake?

15:24 - H
I work for Snowflake. I'm an employee of Snowflake and I work with customers and customers trying to develop solutions for them. So if someone who is on an on-prem system or a database system who want to migrate to Snowflake, or if someone want to implement like a AML capability, they try to engage with us.

15:49 - D B
Okay. And you mentioned that Snowflake is an SAAS.

15:53 - H
What is an SAAS? Software as a service. So basically, we provide a platform as a service where they can run their jobs, I mean, all their analytic pipelines. Software as a service.

16:07 - D B
Okay. I have a question.

16:10 - R S
Can Snowflake be used for streaming data?

16:15 - Multiple Speakers
Right, so Snowflake has its own streaming.

16:19 - H
So we have something called Snowpipe streaming which is developed on based off Kafka Apache. So we can stream from multiple sources. So this is I don't know if this is small for you, but I can open it. Yeah, so basically we have an SDK published out there so we can stream data out of either Kafka or any client-side SDK. So this can stream into tables. We can also have some transformation layers as well. We recently acquired a company called DataVolo. Also releasing some new features coming out, which will be like Informatica or Datastore.

17:10 - R S
I mean, you can click and drag, so something like that. Well, I assume the name Snowflake was coined because you're increasing the granularity of the data.

17:23 - H
Right. I mean, the initial product was data warehousing.

17:27 - R S
So the Snowflake schema, that's part of it. Yeah, because, you know, there's star schemas and the snowflake is increasing the granularity of a star schema. Then you can have a collection of stars, you get a constellation. Is there any snowflake software that's available to faculty for academic use?

17:54 - H
Yeah, I think snowflake has free trial accounts for students and it gives a 90-day period or 120-day period actually.

18:08 - R S
So basically, if you have a student account, you can request this.

18:16 - D B
So in a nutshell, you're working with Snowflake, you work for the Snowflake company, you're getting a master's degree in what, information science?

18:30 - Unidentified Speaker
Right.

18:30 - D B
And in a nutshell, in a snowflake, in a nutshell, what are you going to do for your project?

18:39 - H
What's the objective of the project? So my idea here is to, so we have all these features, but there is little information on how they all can work together to develop an end-to-end pipeline. Pipeline for data analytics or in case of there is some real-time data coming out of some streaming pipeline, how can we bring all the features which I listed earlier together to develop a data analytic pipeline?

19:11 - D B
Is this project something that your boss wants you to do for work too?

19:18 - Unidentified Speaker
No.

19:18 - H
Probably, I will publish this once I have as part of Snowflake. So we have something called Snowflake Summits, which we do every year. So my idea is to, once I have this all worked out, I will probably publish this for a summit session, which will be done soon. Oh, okay.

19:38 - Multiple Speakers
And will it be a publicly accessible article? Yes. Right, right.

19:43 - H
So it's a summit where we will have over 10,000 customers visiting us every year, and this will be live streamed.

19:50 - D B
streamed across the world. Who's your advisor for your project? You. OK.

19:56 - H
I'll verify that.

19:57 - Multiple Speakers
Am I able to serve on that committee?

20:01 - R S
You want to be on the committee?

20:04 - D B
Do you have a committee yet, H?

20:08 - H
Yeah, I think I would be happy to have more people on that. I think Dr. P and I think Dr. Yeah, I had the committee.

20:20 - D B
Oh, you already have me, Dr. P and Dr.

20:24 - H
M? Right. Not M. I think Dr. R S can go.

20:28 - M M
I like it, but I have so many students.

20:32 - D B
Well, R S is willing to do it.

20:35 - M M
So that's good news. So now you have a committee.

20:40 - Multiple Speakers
I'm kind of looking at it.

20:44 - D D
Is there any graphs? I don't see any graphing capabilities. Is it just not part of it?

20:55 - Multiple Speakers
So Snowflake has notebooks which can basically run any Python packages.

21:01 - H
So you can call partly or any other package whichever supports graphs. We also have Snowflake which has Snowflake brought the open source Streamlet where you can develop a UI like something like this. I may have better examples here. We can develop an application like a public application where the Python code uses the Streamlet package to publish data or reports so you can do like this.

21:43 - Unidentified Speaker
OK.

21:43 - D D
All right. Any other questions for H?

21:47 - D B
I have some questions.

21:50 - M M
Yeah, wonderful ideas and presentation. But I see that you are using pandas and these libraries. But can you consider, or do you think that it's good to go to RAPIDS, the accelerating libraries for big data. This is one of my suggestions. And another question that I have is, did you try different large language models already? Because, you know, the concept is very general, like you say, but you need so many tasks inside of all of this, even to consider different large language models. With some of them or?

22:39 - H
Right. So the project is primarily focusing on developing the pipeline, but I would be happy to develop some models.

22:50 - Multiple Speakers
So the Snowpark ML has its own, I mean, it was developed on a lot of these publicly available APIs.

23:00 - H
So it supports a lot of regular Python packages which are out there for modeling and training and whatnot.

23:13 - M M
So this is what I.

23:16 - H
Yeah, language models. For the large language models, we have this Cortex. So we have pretty much the major, like if you see Cloudy, Llama Llama, Mr.

23:33 - M M
Raj Rekha.

23:34 - H
We are currently supporting a whole lot of them, but my primary focus is on Snowflake Arctic, which is an in-house developed model and probably I would have the pipeline developed using this as well as one of the well-known packages like Mr. Raj or if not Lama Lama. All right.

24:01 - M M
Yeah, the final result will be this framework, but you have to have some cases to prove it that it's working, yeah? Right, right.

24:13 - H
So the idea is like how to get together all these features to kind of build into an analytic pipelines. So as part of that, I'm using these cortex LLM functions which are published in here. So I may not use every model listed on this list here, but I would use Snowflake Arctic and compare that against probably Mistral or if not, Llama Llama 3.

24:49 - D B
How do you evaluate the performance when you compare?

24:54 - H
I don't have the plan yet, but probably once I have the metrics, I may need to go back and forth on what we can publish because my idea is to bring this out as a snowflake for a snowflake customer base as well. So I need to also have some internal discussions on what we can publish. Versus what we can add. More questions? Yeah, thank you.

25:28 - M M
Oh, awesome.

25:29 - R R
H, fantastic. I really liked what you shared. I think a lot of my questions were stolen by Dr. M. But I'm failing to understand from your presentation what you're trying to accomplish in this project. Because I see a lot of services that Snowflake has that you are trying to use, but what I'm failing to understand is what's the objective? What's the measurement criteria? And what outcome you're looking to accomplish? What is the use case? Things of that nature. So that way I can kind of put my brain to say, okay, this is what your goal is. Data that you're going to use, this is the outcome that you want out of this, and here is the use case based on this, you know.

26:30 - H
Sure. Yeah, so the idea is like, as I mentioned, right, so the Snowflake, the current customer base for Snowflake is primarily using SQL analytics, so they have data which is coming from different sources, and they have some SQL-based reports which are running and it may be publishing system Tableau or BI or whatever dashboards, right? So the idea of this project is to bring all these services to develop a data pipeline, analytic data pipeline, because currently customers may be using like one part of this. So we have a majority of customers who are using structured data. I mean, very few. I mean, there are a good number of customers who are using JSON, but the idea is to provide a comprehensive data pipeline, which someone looking at it can try to adapt for their workloads. So instead of using a database, a ETL tool, some external service for their AI or any other analytics, I'm trying to develop something which will be like a one place shop in Snowflake where they can run all this analytics. The data is in Snowflake. There are features which are also supporting their machine learning or if not Python-based workloads. The idea is to just get that entire end-to-end pipeline so customers, when they read that, they will have a knowledge on, okay, we can bring all these features together. Instead of going to some bedrock or if not SageMaker, So currently we have a big customer base who have data in Snowflake, but they push it to SageMaker for their machine learning workloads. So the idea is to just educate the customers on how we can get all these features in line so they can develop a daily or like a weekly or monthly analytic pipeline. So are you talking like a script?

28:33 - D D
Do you mean you're going to take a script and you're going to, from that script, you're going to be able to press Go and run all these features at one time sequentially?

28:48 - H
So basically, there would be a notebook pipeline. So Snowflake Notebooks has. So all this is developed as part of a notebook.

29:00 - D D
Like a Python notebook?

29:02 - Unidentified Speaker
Yes.

29:03 - H
Snowflake has its inbuilt notebook, which can be scheduled. They can be scheduled on based on Cron scheduler or any other. The idea is have this all developed and as part of a notebook which can run on a schedule. Basically, there can be two or three notebooks for training as well as their record data analysis pipeline. But everything will be a script. It can be Python as well as SQL, because we will have, everything will be as part of a couple of notebooks. One can be scheduled around, one can be on demand or something like that. I don't have the final task.

29:52 - D D
Yeah, that sounds like fun.

29:54 - R R
One of the things that I was thinking for me to grok as a outsider, into this, pick me one use case using structured data as your data input coming in, pick up Amazon e-commerce or eBay e-commerce and see how they resolve for an entity type definition end to end and then kind of see how your use case of outcome that you're looking for. Just doing that at one use case and then going to another use case using the services that you have within Snowflake would help me understand customer by customer or a domain by domain how customers use and then you can tie it all together for a final you know finale type of a scenario that's that's what I'm thinking it would make a lot more sense and it'll also be helpful for me to see an outcome and your measure of that outcome what's your baseline what was your baseline and what is going to be better using Snowflake versus, you know, anybody else's? Yeah, yeah.

31:06 - H
I mean, the basic driver here is like, I mean, I previously worked at AWS before joining Snowflake, right? So currently what we are seeing is these data silos. So we have data in Snowflake, we have data in some on-prem, RKL, IBM, whatever. And we have services which are also serving these BI capabilities, analytics. I mean, on Snowflake, I mean, on Amazon, we have EMR, which is running the Spark jobs. We have Glue, which is also ATL. We have SageMaker for machine learning capabilities. And now we have Bedrock and so on. So what is happening is when you have these data silos, especially for, like, let's assume we have Snowflake, companies have Snowflake, Databricks, Oracle, whatever. We had to move this data across these different services when we have to address these workloads. It can be analytic workloads, it can be machine learning workloads. Customers have to push maybe gigs of data each day across Cloud platforms, across services to get their final reports. The idea of having everything inside Snowflake is the data is local. If the data is local, my belief is that the compute should also reside next to it instead of shipping it across the network. It improves the overall efficiency and we don't see all these network costs. It's also much more secure when we are sitting in only a single VPC or a single platform, right? So that's the idea where I was trying to develop a pipeline which can serve all these workloads inside Snowflake instead of shipping out to some external cloud or Azure DevOps or whatever, right? So that's the basic sense of it. Okay, I think you're just getting started.

33:16 - R R
You have long ways to go, yeah?

33:19 - H
Yeah, yeah.

33:20 - Multiple Speakers
I think you have a great team of advisors. They'll be able to help you to zero in.

33:30 - H
Thank you for presenting.

33:33 - D B
All right. Thanks a lot. I appreciate your time. All right. Well, thank you again.

33:41 - Unidentified Speaker
Let's see. Let's go back to the agenda. Okay.

33:47 - D B
So next week, B H, B H, right, B? Yep, I'll be ready. All right.

33:53 - B H
It sounds like getting a couple of sides together might be beneficial, so. Well, it's totally up to you.

33:59 - D B
I mean, I'm not pressuring people to, you know, have to highly prepare, but of course, you know, if you want to, then it's more of a rehearsal for the real thing when you do your defense. It's totally up to you. We had people do it very informally, just share their actual document, proposal document and kind of step us through it. So it's totally up to you.

34:24 - B H
Yep. Understood. Okay. Let's see.

34:26 - D B
So other things on the agenda. So we have a couple of master's students who are using AIs to try out the process of writing a book or equivalent informational website using AI. And the intent was to sort of, they would sort of do this and sort of identify the problems that occur and what this process is like and how it can go right or go wrong. So I thought each of those students could give us a moment of update, what they're doing and what problems they're encountering, if any, and we can try to help them out. So I see L is here and E. Is here, so why don’t one of you start off with your recent activities? OK, I guess I'll go. So recently, what I've been working on is two parts.

35:31 - L G
So the first part is more like procedural. So I didn't know if we needed to do you know, if I need to go back and kind of put together my research proposal. So that's what I've been working on right now. I did test out TrueAI. I had some real difficulties. It was kind of like, it did do it. Like, it would do it, but it would get kind of stuck or repeat. Hello? Yeah.

36:01 - D B
Oh, for some reason, my screen went black. Sorry about that. OK. It would repeat kind of over and over again. It kind of gets stuck in a loop.

36:10 - L G
You know, so I start a prompt. It's right in a certain section of retirement. So we work it down to like chapters and so forth sections. And then it was kind of communicated back and forth between two different AIs. And we kind of get stuck repeating information that was previously there. And so experimentally, I'm not sure if that's going to work or not. And I've yet to get the link chain working right for me. So I'll probably need to reach out to some other students to help make sure. What are you going to do about this problem where the AI keeps giving you the same stuff? Well, one idea that I had was to, instead of doing, to make, right now, we have one AI set up like an outline. And then the second AI respond to it. And I thought the contextual window would be long enough for it to know what it's already said. Think it's working that way. One idea I had was to switch the prompting, like to change how the prompts work between the two. So like right now, it's kind of just prompting with basic text information, but making it more detailed prompt, including the previous information to see if that worked. But I couldn't get it to work that well programmatically. So you're not using the user interface chatbot, you're using a, you're programming it. Yeah. Yeah. I'm not using, I'm not using like me telling it the career I'm using through AI, like, like you have like, uh, one agent, let's say chat GPT and come up with something and send it to a different agent to kind of work through some of the things. And that's where it's kind of created this loop.

38:01 - Multiple Speakers
Okay. I'm sorry. Yes, sir. Yeah. Do you have any questions? Well, the questions that I had were more procedural.

38:09 - L G
So I just wanted to know if maybe you and E could meet just to make sure we know how to move forward. Do you want us to present here, to present our research proposal here, to move forward with the process?

38:25 - D B
OK, the next step, and this applies to E as well, is to write a proposal document explaining your proposed timeline, what you expect or plan to do, and some discussion of related work, like what you can find about other people who have tried writing books using AIs. And that's a document, and it's got to be approved by me and your committee. And I would put as an appendix what you've got so far on your book, include that as an appendix to the proposal. Proposal, even though it's only partially done. OK. Does that answer your question? Yes, sir.

39:11 - L G
I'm going to have to log out and log back in, though. My computer is locked up somehow.

39:21 - E T
OK. And we'll go on to E. Hello. So last week, I've talked about ChatGPT creating a 500,000 document, right? So after the meeting, I was kind of suspicious and was like, is it actually 500,000 words? So I checked that document and asked ChatGPT to count the number of words, and it wasn't even near that. So what I realized, even if ChatGPT says that it's creating this much word of document, it is not actually So I've asked ChatGPT to explain it more and add some other parts to further to make it closer to at least 20,000, but it failed. I analyzed some, I asked ChatGPT to analyze the documents as well about redundancy and some logical parts going together. And it analyzed some parts and showed me the redundancy. And it unfortunately couldn't show the parts, but it highlighted the parts that the redundancy happened. So far. So as a next step, I'm planning to take the same thing on Gemini and try on Gemini because apparently ChatGPT failed to get even closer to my word target. I'm planning to change the prompts as well. So maybe searching up different prompts will help ChatGPT to get closer to the word target.

41:20 - Unidentified Speaker
Okay.

41:23 - D B
You know, these AIs have, you know, they have limits, right? And they won't, you know, if you gave it a 500,000 word document, it would, even if it ingested it, it wouldn't remember what was at the beginning of the document, because it has a maximum length that it will remember back to. And I don't know what that is for chat GPT. I remember claude.ai used to be 80,000, but, you know, These things change on a daily basis. So I don't think you're ever gonna get one of these AIs to just, you're not gonna be able to say, write a book about X and it will write the entire thing, because there's too many words. Although, who knows, I could be wrong about that.

42:13 - D D
Anyone else have any comments or thoughts? Have you thought about trying to talk to it with, Instead of words, tokens?

42:24 - E T
I haven't. I can try that.

42:28 - H
Yeah, I think the chat GPT-4 has 32,000 tokens for context. How many?

42:38 - D B
32,000 for chat GPT-4.

42:41 - H
I think the public may have lesser tokens. But I think four characters is counted as token, if I'm not wrong.

42:54 - Multiple Speakers
Yeah, there's always the question of, you know, if you'd use the paid version, would you get better results?

43:04 - E T
Well, I do use the paid version, but it still fails.

43:09 - D B
Yeah. So hello, everyone.

43:11 - J O
You hear me? Yes.

43:13 - D B
So I have a question.

43:17 - J O
or what file data are you putting in ChatGPT? Because if something is specialized also wouldn't, because the training, the data training data is very broad. It's not specialized. Even with the higher 200 premium membership, it's not specialized. So you, so that's why ChatGPT now has an option to create your own GPTs based on what you want. You can train them and model them for what you have specifically on your file. I was trying to do something with it with my in-chemical background and it doesn't have any sense because it's only trained on reading and most of the cognitive stuff, not actually chemistry, engineering, that type of stuff. So the content you feed it also I think it's a limitation there.

44:13 - M M
This is a correct point, J, and this is why we have RAC system. System can give you opportunity to connect with your data set and particular documents that are appropriate for this task, retrieval of augmented generation. And in NVIDIA, we have wonderful courses for RAC, particularly in prompt engineering. So check the courses. They are available for you for free.

44:53 - D B
E, do you have any questions you want to get any feedback on at this point?

45:02 - E T
Thanks so much for the feedback so far. I'm always open to feedback, honestly, because any sort of guidance would help me a lot to get to my goal. OK. Hey, A, this is B.

45:20 - B H
Real quick, something you might want to try is when you're prompting it, Sometimes, as Dr. B mentioned, that it runs into word limits and word counts, but if you're trying to have it kick out something like 5,000 words or more, sometimes what you can do is say, you know, prompt it by having it break it up and say, hey, write a 5,000 word book, but give me the first 1,000 words first, and then tell it that you're gonna prompt it each time for an additional 1,000 words. And it actually works pretty well with that. I've written some pretty long programs and had a debug it that were in the thousand thousand line range.

46:11 - D D
And it actually did a pretty good job with it.

46:17 - Unidentified Speaker
Wow.

46:18 - D B
Thank you for that.

46:20 - Unidentified Speaker
Sure.

46:20 - E T
I'll try that. I tried it with, no, not like a thousand, 5,000 word, but I think I tried it with 10,000. It still failed for me, but I'll definitely try that as well, giving less number of words. Maybe it'll get closer to the target.

46:39 - D B
I've seen these AIs just, you know, you ask it to give you a lot and it'll just, it'll start giving it to you, but then it'll just stop in the middle. Well, yeah, and that's the point.

46:55 - B H
So then what you say is, OK, that was great to this point. Can you continue writing it? And then say, give me another 5,000 words or 1,000 words.

47:08 - D B
OK, cool. All right, any other comments? Anyone on either L or E's projects? Are you going to create a paper?

47:18 - R S
to work for the committee? Yeah, I think I'll advise them to get a little further on the document before they worry about the committee.

47:28 - Multiple Speakers
But if you're willing to be on their committees, that'd be great.

47:32 - D B
And I'll be on it. And they need to find a third person.

47:37 - R S
Yeah, because I talk about Snowflake in one of my classes. We don't go into much depth. It's obviously he's done. Obviously, it's something that I have interest in.

47:50 - D B
Yeah, well, I mean, L and E are not using Snowflake, but they are writing books for the equivalent.

47:58 - R S
OK, anything else? Let's see.

48:00 - D B
I just found out that the computer science department is proposing a new freshman level artificial intelligence course. Are you going to be teaching that, M? I have no idea, but it’s a good idea.

48:17 - Unidentified Speaker
Somebody is teaching.

48:18 - D B
I think, not today, but maybe next time or sometime we have more time, together we'll just go over the description in the syllabus a little bit just to see what they're doing. When they will start That was the plan, I think, I heard. We have a new machine learning for beginners. I mean, you know, just like a few years ago, all the programs started getting heavily into data science and having data science courses and data science certificates and even changing the names of the degree programs. The same thing is happening with AI. And, you know, I've never heard of a freshman level AI course before, but nowadays it's different, and now they need it. OK. Yeah, that's crazy, isn't it?

49:15 - D D
Is it a prompting and prompting course? I think prompting is going to be in it.

49:22 - D B
We'll take a look at this. Maybe next week we might have a chance to look through this and see what we think.

49:30 - R S
I thought H was attending directly. Arctic on using Snowflake? Yeah, he is.

49:36 - D B
Yeah, OK. What I'm saying is I'm willing to serve on his.

49:41 - R S
Oh, crew AI even.

49:43 - M M
It's good. Yeah, OK.

49:45 - Multiple Speakers
Well, all right. Oh, they jumped to, yeah.

49:49 - M M
OK, well, I guess we're sort of at the end of our time.

49:54 - D B
And we'll go ahead and adjourn. And I'll see you next week for B's presentation. And we'll continue on from that. I guess next week we won't do the, we won't look at the course unless we have time, but the following week we will do that if we haven't, don’t get to it next time.

50:16 - R R
Could we compare that with Dr.

50:18 - R S
M's course as well, just for my understanding? Can you just take this one?

50:23 - D D
Thanks, guys.

50:24 - M M
How to compare? We're doing 7,000. This is 1,000.

50:27 - R R
Ah, okay.

50:29 - D B
So 7,000 is advanced graduate level. 1,000 is freshmen. OK. OK. Yeah.

50:35 - R R
That's different. OK.

50:36 - M M
Pretty soon they're going to teach it in high school.

50:41 - D B
Oh, that's what it is.

50:43 - R R
That's what 1,000 means. OK, got it. So the number differentiates the depth of the concept.

50:50 - Multiple Speakers
Exactly. Got it. Sounds good.

50:53 - R R
OK, thank you.

50:54 - M M
But the instructor put stuff that is more advanced. Multi-agency is an advanced topic, but anyway, they will present it in a way that they can.

51:05 - Multiple Speakers
Yeah. I think freshmen can learn how to use prompts more effectively, you know.

51:11 - M M
Yeah, they can. They can, of course. So, of course, everybody is using them right now.

51:18 - R R
So, it's an introduction. Yeah. Yeah. It's good to have it.

51:22 - M M
Yeah. People need to learn. Right. Yeah. Send you for new people that are here, please. Yeah, welcome to everybody. Some people from your class and some people from the seminar this morning. Thank you.

51:37 - R R
Yes, please send your email to D.

51:40 - D B
Yeah, if I haven't communicated with you about this, I don't have your email on the calendar reminder, so you wouldn't hear about it like next week. But if you let me know, I'll add you in just, you know, maybe you could put your, put your email in the chat right now and I'll, if you like, and I'll be happy to add you to the list.

52:05 - R R
I think E added me earlier this morning.

52:07 - M M
Okay. Yeah. I did add you.

52:09 - R R
Yeah, I did add you. Okay. Yeah. Okay. Thank you.

52:12 - Unidentified Speaker
Yeah.

52:12 - D B
He gave, he gave a list of people and I added them, but some other people were not, you know, were from Dr. M's class.

52:20 - Unidentified Speaker
Perfect.

52:21 - R R
Great opportunity for me.

52:22 - Multiple Speakers
to kind of get to listen to you all and kind of get in the guts of technology if need be, or as applicable.

52:33 - D B
Great. All right. Well, thanks again, everyone.

52:37 - Multiple Speakers
Sorry. See you next time. I have posted my Snowflake email by mistake.

52:43 - M M
But yeah, I posted my UALR email. OK, good. All right.

52:48 - D B
Bye, everyone. Thanks, all.

52:50 - R R
Thank you.

52:51 - Unidentified Speaker
Bye.

53:02 - J O
Thank you, Dr. L. Bye-bye.

 

Displaying jan10FinishedTranscript.txt.

No comments:

Post a Comment