Evaluating LLMs with Chatbot Arena and Joseph E. Gonzalez

Gradient Dissent

Tuesday, December 17, 202455m

Spotify Apple

Gradient Dissent

0:0055:32

What You'll Learn

✓Chatbot Arena is a platform created by Gonzalez's team to evaluate and compare state-of-the-art LLMs, allowing users to chat with and vote on different models.
✓The team has been studying the concept of 'vibes' - quantifiable metrics that capture the style and behavior of LLMs, beyond just their accuracy.
✓Factors like formality, conciseness, and formatting are found to be important in determining user preferences for different LLM outputs.
✓The team's work on memory management and question-answering over tabular data is also discussed as an exciting area of research.
✓The episode highlights the importance of looking beyond just model scores and considering the nuances of how LLMs communicate and interact with users.

AI Summary

The episode discusses the research work being done by Joseph E. Gonzalez and his team, with a focus on evaluating large language models (LLMs) using Chatbot Arena, a platform they created. They explore topics like model vibes, conciseness, and the importance of considering factors beyond just model accuracy when evaluating LLMs. The conversation also covers Gonzalez's startup RunLM and the team's work on memory management and question-answering over tabular data.

Key Points

1Chatbot Arena is a platform created by Gonzalez's team to evaluate and compare state-of-the-art LLMs, allowing users to chat with and vote on different models.
2The team has been studying the concept of 'vibes' - quantifiable metrics that capture the style and behavior of LLMs, beyond just their accuracy.
3Factors like formality, conciseness, and formatting are found to be important in determining user preferences for different LLM outputs.
4The team's work on memory management and question-answering over tabular data is also discussed as an exciting area of research.
5The episode highlights the importance of looking beyond just model scores and considering the nuances of how LLMs communicate and interact with users.

Topics Discussed

#Large Language Models (LLMs)#Chatbot Arena#Model Evaluation#Vibes#Conciseness#Memory Management#Question Answering

Frequently Asked Questions

What is "Evaluating LLMs with Chatbot Arena and Joseph E. Gonzalez" about?

What topics are discussed in this episode?

This episode covers the following topics: Large Language Models (LLMs), Chatbot Arena, Model Evaluation, Vibes, Conciseness, Memory Management, Question Answering.

What is key insight #1 from this episode?

Chatbot Arena is a platform created by Gonzalez's team to evaluate and compare state-of-the-art LLMs, allowing users to chat with and vote on different models.

What is key insight #2 from this episode?

The team has been studying the concept of 'vibes' - quantifiable metrics that capture the style and behavior of LLMs, beyond just their accuracy.

What is key insight #3 from this episode?

Factors like formality, conciseness, and formatting are found to be important in determining user preferences for different LLM outputs.

What is key insight #4 from this episode?

The team's work on memory management and question-answering over tabular data is also discussed as an exciting area of research.

Who should listen to this episode?

This episode is recommended for anyone interested in Large Language Models (LLMs), Chatbot Arena, Model Evaluation, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode of Gradient Dissent, Joseph E. Gonzalez, EECS Professor at UC Berkeley and Co-Founder at RunLLM, joins host Lukas Biewald to explore innovative approaches to evaluating LLMs. They discuss the concept of vibes-based evaluation, which examines not just accuracy but also the style and tone of model responses, and how Chatbot Arena has become a community-driven benchmark for open-source and commercial LLMs. Joseph shares insights on democratizing model evaluation, refining AI-human interactions, and leveraging human preferences to improve model performance. This episode provides a deep dive into the evolving landscape of LLM evaluation and its impact on AI development. 🎙 Get our podcasts on these platforms: Apple Podcasts: http://wandb.me/apple-podcasts Spotify: http://wandb.me/spotify Google: http://wandb.me/gd_google YouTube: http://wandb.me/youtube Follow Weights & Biases: https://twitter.com/weights_biases https://www.linkedin.com/company/wandb Join the Weights & Biases Discord Server: https://discord.gg/CkZKRNnaf3

Full Transcript

You're listening to Gradient Dissent, a show about making machine learning work in the real world, and I'm your host, Lucas Biewald. Today I'm talking with Joey Gonzalez, a leading AI researcher at Berkeley with a real focus on LLMs. He's the founder of Chatbot Arena, which is the main place that people in the industry go to see the differences between the top LLM models. He's also the founder of RunLM, a product that many people use for technical documentation. He's a world expert on building production LLM systems and an expert on how to evaluate models live in production. And so this is a really practical, interesting conversation that I hope you enjoy. Maybe we start by the question of what's the research that's most exciting to you right now and then we'll go into the stuff I've prepared. Cool. So hi, I'm Joe Gonzalez. I'm happy to talk about some of the research we're doing in my group. I am fortunate enough to have a really cool group of students. So we're doing a lot of fun things. The Chatbot Arena is one of the big projects in my group, thinking about how we evaluate LLMs in the wild. I can tell you the story behind it, which is kind of funny, and how it came to be where it is today, which is pretty influential, actually, for a project that started almost as an accident. I can talk about some stuff. We've been looking at tool use with Gorilla, with thinking about how to evaluate agents. We've been in my group looking at memory and how LLMs manage the memory, their context over time. We've been doing some really exciting recent work around working with large tables and be able to ask questions. So I'd like to talk to my data, which sounds like text to SQL, but I think there's a lot more to it. Often being able to make sense of what's in my table requires AI to look at the rows themselves and how we do that, how we make that cost effective. It's been a lot of fun. One of my students, a big weights and biases teacher actually, has been looking at vibes, which is like crazy to be doing research on vibes. But in 2025, that's what we do. And so the idea is, can I look at the output of an LLM and get a sense of the vibe of that LLM quantitatively with confidence intervals and then use that to figure out, you know, is this LLM going to, this particular vibe going to evoke a positive response from users and negative response, which we find to be situation dependent, which is interesting. So different vibes for different settings. Doing a little bit with vision. So a lot of fun stuff. I was excited to talk about Vibes, actually. I love that you have a sub stack. I've always kind of felt like, you know, the research paper format doesn't quite work for a lot of the interesting, you know, stuff that's going on. So I was enjoying your articles. And I thought one of the most compelling ones, which I totally agree with the conclusion. So maybe that's dangerous. Maybe it's confirmation advice here. But I like, what's the title? It's a great title. It's like vibes in defense of vibes based evaluation. Yes. Which I love. I love it. I love it. Because I actually think, you know, I was thinking about it. Like I remember when I was working in ML, I had a boss that was like, I like to hire ex-biologists, not ex-physicists. And I was like, that's really interesting. Why? And he's like, well, like physicists are trained to sort of just like look at the data and the graph. and then biologists like look at the individual examples, which is sort of like, you know, he liked like vibes based evaluation. He's like, you know, in ML, you know, we spend all this time like kind of looking at the aggregate data and we don't like dig into the individual examples and see what's going on. And I always felt that like a lot of the early success I had in my career was from actually looking at individual examples and just getting a feel for it. And then, you know, people really started to relate to this, you know, testing by vibes is kind of like stupid because I think a lot of new people came into the field and your first thought is like testing by vibes. You're just like looking at the data. How does it feel? And I actually think that that is like an incredibly good perspective of like, let's like not forget to like, not just look at like, you know, F1 score or something and actually like look at, okay, what's going on? But then you did like a detailed, you know, like analysis of where it works and doesn't. So maybe you could just summarize the results and we could talk about it. Yeah, so we've had some tension here in my group. And I will say coming to this, I had a very machine learning perspective. And again, my students that were very used to coming to my group meetings and like, let's look at the actual data, which I will again say they're weights and biases users, like used to zooming in and like, let's look at the pictures where it failed. Like, oh, interesting. I see. So those students kind of were pushing me pretty hard. Like, Joe, we got to get beyond thinking about just the score or the arena ranking. It really matters the style that it answers. And my startup, it's funny, also came to the same conclusion. We had a startup run LM. We're building AI agents to help people solve problems. And solving a problem is part of the process, but how it does it really matters. Like people don't like it when it's over preachy. People don't like it when it's like broadly speculative. They want short, clean answers. And if I don't get it, I don't know it. Just say, I don't know. And, you know, point me at a reference. And so that sense that it's more than just being correct, that it is about the behavior, the style, the interaction, it's neat. It's not where I want it to be. So like coming into this, I'm like, what is a vibe? And my student, Lisa Dunlop, who kind of pushed this idea very effectively, it worked out a definition, which I like. You know, a vibe is something that helps me understand the style of how a model communicates. It should be something that others could understand. This is a very formal vibe or a very friendly vibe. The vibe is it likes to give examples, give narratives along with whatever it says. It should be something that helps you differentiate models. Like this model, Claude, is better at giving these examples. and something that also aligns with preference. So, you know, sometimes I want lots of examples. Sometimes I just want the answer. And, you know, being able to sense in different settings when that vibe is important. And what my students have found is, like, you can use LMs to compare and contrast various conversations to start to extract likely vibe candidates. Given vibe candidates, you can start to use metrics and the LM as a judge approach to sort of model how that LM aligns on a particular vibe. I mean, we can use this across many different LMs, different samples to get an estimate and even confidence on how much a model favors a friendly explanation with lots of formatting or examples versus a more terse and concrete, short explanation, concise explanation. And so, yeah, this idea that correctness is only a small piece of the story is pretty neat. And it gets back to, I guess, human preference, that human preference is about correctness, but it's about a whole lot more. And Vibes give us a tool to understand that a lot more. And any practical results? Can you describe the Vibes of the various models? So Llama is a funny model. I think the best result I've gotten so far is the Llama 3 latest generation models, they're much more friendly. They're entertaining. They're less formal. And we've compared to like ChatGPT, GPT-4O, it's more formal. It likes longer explanations. We've seen formatting is pretty critical. People care a lot about formatting in the answer. So using Markdown, using, you know, LaTeX when needed, it seems to be important. So formatting is predictive of human preference. Formality is predictive of human preference, but it's very situation dependent. Some people want a friendly answer and that's preferred. In some situations, I want to, again, a concise formal answer. I don't want to joke around. So yeah, it's pretty neat. There's a paper vibe check with more on this. What about conciseness? I feel like, you know, my experience like open-air is a little less concise than the other models. And I can kind of feel it when people, you know, write to me and they've like obviously used it. I feel like the tell to me is like being more verbose than necessary. Is that, is my vibe accurate? Your vibe is accurate. And we found that, in fact, one of the early kind of inclinations in the vibe work was Lisa would go, you know, Joey, I can guess which model I'm talking to the arena. It's got a vibe. It's got a sense to it. And when it's giving me the most verbose answer to a simple question, that's probably GPT-4. And it's true. There is this implicit signal. It differentiates models very effectively. And in our startup, we find that giving a concise answer is preferred. When I may be trying to explain something to a student, if I'm using AI to come up with an explanation, having a long, thought-out process can be helpful. I think it connects to another trend that we're seeing in AI today, which is kind of interesting. this movement towards test-time-compute, which I think is a silly term. It's prediction-time-compute, but in the AI world, we only have training and testing. So test-time-compute it is. But the idea that if I think more out loud through my process, this chain of thought, but push to the extreme, that I will get ultimately to a better answer. And I think the folks at OpenAI are pushing that perspective hard. And it is leading to, in many cases, better final answers, but it also leads to a very drawn-out explanation. Oh, I see. So the length of explanation is like helping the accuracy. That is actually, right, I got you. It's a behavioral trick, and it's a good one. Like teach the model to think through what was the question you asked. Let me restate your question for you. Like, I don't want to hear it, but yeah, let me restate your question. And how should I think about it? Here's how I'll think about it. And then here's your answer. And that process is important for accuracy, but it comes at the cost of the vibe of being concise. All right. Well, I was hoping to hold off on Chatbot Arena to save something for the end, but I feel like that ties so directly into Chatbot Arena. So maybe start for people that don't know what it is, who've been living in Iraq, like what it is, and tell me the story of it. And then I want to get it to have a vibe. Maybe I'll tell you the story behind Chatbot Arena. Tell me the story. I love it. Yes, please. So let me get some brief context. Chatbot Arena, very, very widely successful. I'm lucky to have a project like this in my group. Totally. A service that we've created where people can, anyone can go and chat with the state-of-the-art models today, open source and commercial, and some of the models that even aren't available on the commercial website. So, you know, the next generation of OpenAI and Google's models. And you get to talk to two models at once and, you know, ask your question. You get two answers for free and you can choose the better of the two and please do vote. And when you vote, which one's better? We use that information to build a leaderboard that ranks models. And we do really cool stuff now where we can take those conversations and say, you know, you were asking about homework. It looks like math homework. We can now segment the feedback you give us into different categories. So we can say, look, these models are better at math homework. We're not so good at, you know, writing a funny story. And so that kind of segmentation allows the AI community to better understand where models are headed. It allows the user community to go, you know what, if I want a funny story, I think I need Islamah 3 to get a better sense of, you know, where models rank on different kinds of tasks. And it's helpful for a lot of the big model developers because it's hard for them to run these studies across all the commercial models. And same for the open source communities. We develop models at Berkeley and at other places. We want to know how those models compare. And getting this kind of broad community feedback, if you will, kind of the Wikipedia of mock evaluation is important. And it helps kind of level the field. In fact, I'm getting ahead of myself. One of the things that we need to do better at the arena and be great to get more help on is getting it out to more people. I think the technology world has picked up on the arena, which is great. But we need the rest of the world to give feedback to because, you know, getting a perspective that broadly represents, you know, human perspective is important to us. Totally. Now, the arena story is kind of funny. It didn't start out with such a grand ambition. Long ago in 2023, we had two models coming from Berkeley and one model from Stanford. And, you know, as good grad students do, we wanted to know, is our model better than their model? And so we had started this project to evaluate the Vicuña model from Berkeley, the Koala model, also from Berkeley, and the Stanford Alpaca model. And we had actually started using GPT-4 to evaluate the models. We found out that thanks to GPT-4, the Vicuña model is slightly better than the other two. And so we launched a website with our really cool model and a way to chat with it. It was a little single chat box, and it was neat. And some user had a great question. hey guys, can you put two side-by-side so we can put Stanford and Berkeley side-by-side and chat with both of them? And so we did that. And I think at the last minute, at the end of the meeting, we're like, hey guys, what if we had it so it was just random too? You could also just chat with two random models. We wouldn't tell you which ones. You could pick which one was better. Call it battle mode or something. Nice. Yeah, that'd be funny. Let's put it on the website. So we put it on the website. And people used it. And two weeks later, we had enough data to actually build a ranking. And we argued how should we do it, Elo, Bradley, Terry, all these techniques. We just picked one. We put it out. And people were like, wow, this is a great way to just look at human ordering of models. We started adding commercial models. And they were like, oh, wait a second. I can now run GPT-3, GPT-4. Vicuña is moving up or down. And it became a place that people came to see where models are moving. Again, also to chat with models for free. So we're giving some value back to the people who work with the service. And we also say when you use this, you're giving data to the world. So that data you contribute, your conversations will be opened potentially someday. And we have open sourced about a million already. We do a lot of work to try to remove anything you accidentally put in that conversation to identify you or other people. But this is an opportunity for us to build a data set that will help the open community build the next generation of chat AI. So it's not just owned by big tech. So it's become a bigger mission, but it started out with a pretty small, like, wouldn't it be fun story? Totally. Okay. And so, I mean, what have you learned? I actually didn't know that you were able to categorize the content. And where do I find that? I somehow have missed it. Yeah, so if you go to the chatbot arena, which is now lmarina.org, I think we've moved the domain to make it easier. Is that AI? It looks like. AI. Yeah Yeah Well yeah AI makes more sense Yeah So if you go to that domain you can chat If you a leaderboard there a little pull on the leaderboard that lets you pick Maybe I bring it up so we can take a look at the latest results Oh, categories. Sorry. I don't know how I missed this. Cool. What's kind of neat is it changes a lot and pretty quickly. So we're kind of surprised at how the things move over time. But if you pick a category like math, you can see how rankings move up or down. 01 preview does better. If you go to, let's say, coding, all of a sudden, I think 01 mini does better. So in any one of these categories, we're seeing differences in models. We build these bootstrap conference rules, so we can see that they're really statistically different based on the votes that we've collected. If you look at different languages, models, some models are better in other languages than others. So we see a lot of variability. The other thing that's been kind of exciting is the open source models have come up really quickly. Like, you know, a while ago, I was like, man, these will never succeed. It's so hard to compete. And they have, and they've moved up. So, yeah, the open world is also really competing on the leaderboard. So we started adding a vision arena. So we'll have a vision leaderboard soon. And you'll see how like when you start to introduce images, you get different rankings of models. Pretty interesting. Yeah. So a lot of cool stuff is happening now. And so, okay, going back to like vibes and style, I would imagine that, you know, the style of response, you know, for just a human side-by-side evaluation is like incredibly important to which one feels better. So how do you control for style? Ah, so we've started to do this on the leaderboard itself. And it's kind of neat. If we go back to this Bradley Terry model, how we build the rankings. So it's kind of a neat mathematical question. You have an A and a B. Which one's better? You know, we say A is better, B is better, A is better, and so on. So we get just these binary votes. And we use a model that's kind of like logistic regression. Right now, there's just, in the simplest sense, one feature for model A, one feature for model B. And then, you know, if model A wins, it's a plus one. and if Model A loses, it's minus one. That's a logistic regression problem. The weights attached to each of those features are the rankings on the leaderboard. We scale them so they look like ELO numbers for the chess enthusiasts out there, but it's essentially a logistic regression coefficient. You can add another term, which is like the presence or absence of formatting for Model A or Model B. And then we can start to see how those weights explain away some of the value of Model A. And so we extend the logistic regression model the Bradley Terry modeling framework to account for these extra factors. Oh, interesting. So that lets us account for it, yeah. It's not like the style of writing. It's like, is there, it's like the style of formatting. Yeah, so right now when we look at style, we're looking at formatting, use of markdown, length. We started to control for length. So some people like longer, shorter stories. Is the length what's really determining the ranking or not? And so we started to add control for that. In fact, I think you can choose style control right now, which I believe is just kind of markdown. I guess we don't have length control in here yet. But yeah, we're adding support for breaking down things like length. Interesting. Longer queries. Yeah, it's not up yet. From your modeling, do you have advice to, like if OpenAI wanted to win the side-by-side comparison, should they go longer? Should they go more format? It's a good question. So the interesting point is it depends on the context, right? And so if I'm looking for an essay or a story, I might want longer. If I'm looking for a code snippet, I might run shorter. It's very context dependent, the style or vibe factors. But in your logistic regression, do you have second order variables? Great question. So right now, the way we approach logistic regression is we do one calculation for all things. That is the arena overall score, which in some sense is silly in retrospect, since there's a lot of variability within a score. But people want to know, what's the ranking? just on average of all conversations. But we also now, if you say coding, for example, we just throw away all the conversations that didn't mention coding or didn't involve some coding component to it, which we use another model to assess. Is this a coding conversation or not? And then we just look at those conversations and recompute the leaderboard score with style control on and off. And you can see, if I add style control on, how does that affect the rankings? And I think actually it came out when LAMA came out, The Lama 3 model, people are like, oh, Lama's winning, but it's because it's so good at markdown. It must be it. It's like, oh, maybe we need to control for that. And this is why this emerged. Yeah, it's funny. I mean, I feel like on one hand, it's like maybe that's cheating. On the other hand, it's like kind of giving people what they want, at least in a context where they're choosing between two things. It almost feels like useful product feedback for a model. Yeah. Certainly, if you were designing a model today, you should incorporate style if you're doing that. Do think about how you present your answer. Across the board, style seems to help. And models that take advantage of it will do better. But we want to understand if that's the only factor. And there'll be more that, you know, as time progresses, we do more analysis, other factors that we could try to control for on and off and see how they affect the rankings. And so you've kind of also put together these kind of benchmark sets, right? Like it's not just the arena evaluation. Can you talk about those? Yeah. So part of running the arena is collecting this data. And we want to give this data back in a way that will help people make better decisions. Putting them on the arena is a great way to get a ranking, but it's a hard way to develop a model. Totally. So you want more immediate feedback. And so having benchmarks on the kinds of conversations that we see is important. What we've been doing is we use models to try to identify which conversations are challenging. We collect statistics and we break it down by different categories, then try to turn those into new tasks. We calibrate an LM judge to evaluate that task so that you can run your model on that specific task, that conversation or that initial prompt, and then assess your model's performance using the judging frameworks. What's your sense, I guess, for models these days? I mean, you're spending all this time looking at model evaluations. I feel like when I ask people how models are doing, I get wildly different takes on it. Like, what's your feeling? Like, what is the boundary of what models do today? So I guess I already touched on the open source models are doing a lot better. Model quality is just getting better quickly. In some sense, you know, the middle is converging to something that's very usable. So both open source and commercial models are at a point where I think a lot of innovation now will happen on how we use them, what we build on top of them. the compound AI composition, how we transform data as we feed to them. At my startup, we're definitely seeing that. It used to be we'd run one model. I think we run many, and we change. And in fact, one of the things that we've learned is you've got to be agile because tomorrow there's another model out there. You probably need to try that one too. And for us at the company, having good benchmarking, having good processes for, hey, there's another model. Let's plug it in and see what happens. Does it improve our analysis of our rubrics at the downstream tasks? And we look at things like, is the explanation concise? Is it friendly? So being good at how we evaluate things became important because models are changing so quickly and a best model for one task can change overnight. And that it's exciting. It's also challenging when you're building products. So totally. um what about like uh like are there are there benchmark tasks that um are still hard for for models to get superhuman performance like when i look around at um you know the benchmarks and if they're like if they're old enough it almost always seems like you know they look impossible models are struggling and then you know quickly the models like finds a way like like what's like left that we can actually um you know sure automatically judge that doesn't work it's funny I've been collaborating with the psychology colleagues at Berkeley, and there's things like theory of mind tasks that are a little challenging. And it's funny, you could think of them as a psychological, like a neat thing to study, but I actually think they're a real problem. And the reason why is we build agentic systems and we have agents working together. We will make better agents by not having every agent know everything every other agent has said. The idea that your team, your company runs in one giant Slack channel called General is not a good if you run a company. And so you want to be able to silo or constrain the context to the models that you're working with so they can be more informed and more focused about the things that are important and not the things that aren't. But that means that they need to be able to work together and know, you know what, you don't know what I know. And I need to tell you that so that you can make progress on the tasks that I asked you to do. And you also, I had a great conversation with another person here, and I know that, but again, you don't know what I said to that person or what they said to me. And so this idea of being able to work through an understanding of what other actors in the system know is important to build complex multi-agent systems. And I'm not already ready yet. So it's something we need to work on. I think, yeah, there's a lot of opportunity in, again, how we build on top of these things to address some of these problems like theory of mind, but also how we just manage context, memory to make them better. Totally. Cool. um okay another another uh area of research that that you started to allude to but you have some interesting work on is um lm as a judge i feel like you were talking about this you know really early on and and sort of looking at um you know how how well that works i feel like lm as a judge went from research topic to like best practice that literally every organization i talk to under the Sun does. So what's your current take on the state of the art of LM as a judge? And I guess maybe you should describe what it is in case someone doesn't know. And again, I can tell the funny story of how it came to be. You have two models, a Berkeley model, a Stanford model. They're trained on the same base models. They kind of know the same things. So you ask them to do a standard benchmark, they get the same score. So how do you know which one's better? Well, I mean, you talk to one and it seems a whole lot better. It's Berkeley. It's got a great story behind it. It's you know, entertaining, engaging. We needed a way to tell that. And, you know, we have a few weeks and we're in a race to get this out. So we could just pay a bunch of people, which is hard, and figure out a benchmark. Or we could ask GPT-4 to look at the responses of two models on open-ended tasks with follow-up questions. And so the students, not me, they're brilliant, proposed this idea of let's just ask GPT-4 to rank pairs of models. We give them a rubric, you know, score them because we can't, you know, get humans to do this quickly enough. And it kind of worked. and we started to see pretty interesting differentiation. We flipped things around a little bit and saw that it was kind of consistent. And so that's how the first, like Vicuña, the first result came out. It wasn't later until we tried to formalize this process. And at that point, we realized there's some bugs. So when you look at LMS as a judge and you say, hey, what's better, A or B? GBD4 goes A, probably A. It has a preference for the first thing it sees, which is a bias that humans have as well, which is kind of interesting. So the order in which you present things matters. You have to try both directions to get a good measure. Wait, was it GPD-4 the one that didn't have the bias? I was just... So GPD-5. I forget. We looked at a bunch. It was... I think it might have been GPD-4. Anyways, this was a common bias that we saw in the models we evaluated, which was frustrating because it turns out we had always put our model second. So we actually did even better. We just hit... Oh, wow. And our initial experiments did it in a way that wasn't to our advantage. Nice. Yeah, so there's a position bias that was interesting. Length bias at that point, and maybe it was GBD4 again, it preferred longer explanations. Big surprise. And so if you had a longer explanation, you would get a better result. So you need to try to count for length bias. We also looked at self-preference bias. So the models prefer their outputs generally, which is, again, not super surprising. So we had some issues initially. It's expensive doing lots of binary comparisons. So we started looking at rubrics as a mechanism to collect more signal when you read a document. and well-defined rubrics turned out to be a pretty useful strategy and we use that in other areas as well. I think the binary comparison is still maybe a more consistent way to get signal and like the latest MT bench, latest hard benchmarks use these binary comparisons as well. Yeah, it started out as a neat solution to a problem that we had that I think a lot have adopted. And it's a good strategy to get quick signal on how something's performing with good rubrics. Totally. And I mean, now you have like an incredible amount of data to look at, right? I mean, so have you gone back and like checked how well it's working? You know, it's a good question. So in the latest benchmarks release, we started using or continued to use the judging mechanism. We've gone back and looked. Have we gone back and rerun some of those experiments with newer models? I don't know. It's an interesting question. Like if we had today's judges, would they do better? We started looking at panels of judges to see if that would improve stability and get some lift. One of the other weird things I had noticed is that models seem pretty consistent in how they behave when given open-ended things. What's your favorite color? Blue. And models share some of those biases, too. And so it's possible that even though we're putting panels of judges together, we're not getting as much diversity as we think. And that's something we're actively looking at now. It's like, are models really that different? and this idea that if you generate a bunch of answers that they'll be different. You look closely, they're not actually that different. And so understanding diversity is an important part of maybe where these things are headed. You mean asking the same model, the same question? Yeah, so both. Asking the same model to answer things differently. Like we looked at this question of like, name a state in the United States. California, okay, do it again. California, do it again. Texas do it again California okay get enough It hard You start saying don use these and start to one of those So they are not as diverse within a model And then also we see across models some similarity as well and their preferences, which we sort of expect if trained on the web as a whole, there's probably some distribution on, let's say, states. And those models sort of converge to that distribution. But that might affect how we use sampling techniques to get a broad range of answers in our confidence. But if models only have one opinion, you're not going to see that. Yeah, that makes sense. And I mean, is there a sense that using an LLM as a judge on itself could, I mean, the obvious thought is I could prefer what it's doing. I could think of lots of different reasons that could end up being the case. So it doesn't work that well. They do prefer themselves. In our benchmark, it was hard to use the judge as a participant in the benchmark itself. So the panel was one way to try to get around that, but I don't think that actually works. So tell me about tables and talking to your data. I actually hadn't seen that when I was researching youth. So I'm coming in here cold to catch me up on what you're doing. Absolutely. So it's actually a fun collaboration with my former thesis advisor, Carlos, at Stanford and Matei Saharia, who's formerly at Stanford, now at Berkeley. Is this Carlos' question? Question. Yeah, exactly. No way. Yeah, he was in my lab when I was a baby at Stanford. Oh, wow. You guys are right. That's very cool. Yeah. Yes, this is a really fun project. And maybe the simplest way to describe it is I'd like to be able to ask questions at my table. But here's the thing. I don't want to just ask questions about the data in my table. I want to be able to ask questions that spans data in my table and human knowledge. And I want to be able to ask questions that dig into unstructured parts of my table, like the text. I think the example we've been kind of playing around with is like, what's the best cult classic romance film? And what were the general reviews around that film? and anyone who knows Titanic, right? What's your table? So my table doesn't have cult classic, for example. It just has romance films. Oh, a table of romance films. Yeah, sorry, sorry. I need some more context. So suppose I take a standard table of movies. I have like all the films. They're gross. So I like, what was the most popular? I can get that information. I have genre. It's like, it's romance or not romance, but I don't have cult classic. Like that's just not in a table. And I have the reviews as text, but I want to know, what do people think of the best cult classic romance film? And that's a question that spans human knowledge, tabular information. That's a question that requires calculation. The best means I want to find the max over gross, like best selling, most profitable. LMs are terrible at that, by the way. Things, you know, like if you want to, don't use an LM to do math over large amounts of data. It's expensive. It's not accurate. And it's slow. So like, yeah, we should use the database for that. But, you know, ask a database to summarize the text in a column is not a great strategy either. Like you can group it, but, you know, make sense of it. That's tough. So they don't have kind of natural language reasoning. And then, you know, being able to incorporate knowledge that's not in the database, but it's general knowledge. That should be something we should be able to do. So we've been thinking about how to bring LMs and databases together where it's not I just suck all the data out and stick it in an LM. and it's not where I just run the LM on all the rows or just on a final tuple. It's like, how do they work more synergistically? And so this is table augmented generation. If you're familiar with RAG, retrieval augmented generation, this is tag, the analogy for tables. And it sort of has two, three main parts. One is like, I ask you a question, you should be able to take that question in natural language and generate a program that will run SQL program or sequence SQL programs that will generate the data to answer that question. And those programs themselves will also potentially call to an LLM to be able to reason about the data that the database cannot reason about. And then the final answer might be a table, but what's the gist of Titanic, right? So it should be able to take that final table and then extract a natural language response. You really think Titanic, before we go for the Titanic, is the best co-classic romance? It's the most grossing, according to IMDb. We actually ran this experiment. Yeah, so I think it's the largest grossing film. It is a cult classic and it is romance. And the reviews are mixed. The reviews are like, we liked it, but the plot was kind of predictable. Yeah, so it's a great example of a thing that a database can't do and an LM can't either. But together they can. And I think this is a pretty exciting future because it means that people, humans, without sophisticated tooling or knowledge of data systems, can ask really tough questions against large collections of data and hopefully get answers that make sense. In fact, you build a demo right for the election that lets you look into investment in various donors, and you can ask natural language questions, and we'll mix analysis of their political affiliations, interests, and stuff with the text about their bills and stuff. So yeah, be able to demonstrate these ideas. Well, interesting. What an interesting project. I think it came from this, the movie thing was really the, um, the motivating example here. No, unfortunately Titanic wasn't, uh, the motivating example for me was like, I want to know how students are doing in my class. I have people to ask questions another day and I don't know how to do it. And I think I need to be able to read the reviews and like, yeah. But it's an important question, I think. Yeah, totally. You know, we, it's funny. We have, um, questions like that a lot of like, you know, I have a kind of long list of, um, you know, customer responses. and I kind of want to interrogate that. And that certainly is tricky to do, like you're saying, because you kind of want different cuts to the table and then different kinds of summarization. And I've even wondered about kind of summarizing where models fail. Weights and Bias obviously tries to surface that data, but it seems like we could do better to be like, summarize what are these examples where the model is struggling, which is kind of what a human, the first thing a human would be doing. So yeah, it'd be fun to plug that in and see what we can do. Should. Cool. Okay, tell me about tool use. Tool use. Yeah, so tool use is a trendy topic and has gotten way more exciting in the past few months for the AI world. I mean, abstractly, why should an LM do math? Humans, we don't add square roots of numbers. We use a calculator. So an LLM should too. You use the web to figure things out. You don't just think it up on your own. And an LLM should do that too. So there's a lot of excitement, has been, even before large sigmas really took off around tool use. My group was particularly interested in tools as they are APIs, like services that exist in the wild. Rather than like a small set of very specific tools, imagine being able to interact with the world as thousands, millions of services, be able to discover the right tool for a task and then be able to invoke that tool. And that was sort of the vision of Gorilla. But if we're like really honest, it's bringing kind of rag ideas to tool use. We just did it a while ago before it was trendy. And part of that was like being able to find a common way to describe a tool. Like what's a JSON spec for a service that I could digest into my list of things you could use model and then let the model decide which one and how to call it or what arguments to give. And that's become standard now. Most modeling frameworks, most services, in fact, the latest Slama models have pre-training tokens for how to express these kind of lists of tools. and then the model said like to invoke a tool with a very specific response that allows the software system running that model to then select the actual API, make a Python call out to a web search engine, run some calculation on a calculator, which really, really augments what these models can do. And so that's been really exciting. When we first did this kind of rag stuff with Gorilla, we had a Llama 1-based model that could do better tool calling than the Llama was competitive with the GPT-4 models, which very soon after added very similar tool calling specifications. Recently, we've been thinking more about how to evaluate them. So there's lots of models in the wild. Can I compare them and their ability to call tools, reason about multiple tool calls in parallel, being able to chain tool calls to accomplish more complex tasks? And so my students have gotten pretty excited about how to evaluate the use of tools in my group. Do you have any suggestions? I mean, this is something that I get asked a lot. Yeah, so great, great question. I mean, we have a leaderboard. It's pretty clear which ones are going to be in a lead right now. One of the things that I think is exciting for me is what makes a good spec? How do I make it clear what this tool should do so that my models can call it effectively? And I think having examples we found early on was helpful in our specs. Giving detailed descriptions of what's going on is pretty helpful. So putting effort into how we specify our functions makes a big difference in tool use, as one expects. So it's not super surprising. Although we had to discover that the hard way. I was actually just talking with our kind of designer who was building some agents using tools. And yeah, he was showing me how he improved the spec and that improved the output quite a bit. So a lot of things obvious in retrospect, but there's so many things to try. It's kind of good to sort of know what the best practices are. Yeah, I guess one of the things that's kind of weird is early on we found chain of thought to be really helpful. Models should think through why it's using a tool before it goes to set the arguments. And classically, that makes sense. The latest fine tunings of models don't do that. Tool use is just call the tool, which saves on tokens. And if you're just trying to look up the weather, it's a good way to go. But it seems that they moved away from that, which is kind of interesting. And I wonder if that's something we'll see TikTok back as we move to long token sequences before doing things. So maybe smarter tool use. One thing I saw that's pretty neat is this tool use by clicking on the Anthropics computer. It's another way to look at tool use. It's kind of cool. More interactive, more human-centric. And being more human-centric seems to be a good strategy as we build these systems. Being able to work in a human world where there's a browser and you click on stuff, and that's how you interact with tools instead of calling a specific API, which I hadn't expected. I didn't see that coming, and it's a cool direction. What about, you had another interesting topic on your sub stack around reducing hallucinations. Do you want to talk about those results? Seemed very practical. I'm trying to remember which one this was. The challenges in reducing hallucinations. You're talking about using like domain specific data. Ah, yeah. So I guess there are kind of two cuts to this. So one, when we think of hallucinations in our models, I don't like the word, but I just, you know, people like hallucinations as a way to describe you generating the wrong thing. Maybe more for me, I like to think like you're in a specific domain and there's ways to write very credible things that are wrong. And those are things I want to avoid. I want to like, I want to align with the domain, but I also want to be correct in that domain. One of the things that struck me, in fact, one of the things that I thought, but I turned out to be wrong about, was fine-tuning. So we found that fine-tuning our models definitely helps in this particular context. So I can adapt to the kind of vocabulary style of domain to fine-tuning, and that does help reduce these errors from hallucination. Context obviously matters, so getting the right stuff in context matters a lot. Filtering your context down, we found to be pretty useful. This came out both in the tool use work, but also just more generally in kind of RAG, our work on Raft. So being selective about what you show to a model makes a big difference in the quality of the answers. So it doesn't try to copy the wrong information into the answer. What's the intuition behind, I guess, like classically, you know, from my ML background, I sort of expect fine tuning to be like generally helpful. But I've seen people have like a lot of different intuitions here, like different than mine. And like, what would be the pro and con case for fine tuning and improving specific results? Let me, I'll argue why I didn't want fine tuning to be good. Totally. So in context learning is beautiful in that I don't need to adjust my model to use it for new tasks. In some sense, like we're in a race to build AGI, but we already have it. Like the idea that I can take one model and do anything with it. I mean, maybe not human level intelligence, but, you know, I can do a lot of things with one model. It's pretty amazing. I don't need to update the model to do new tasks. In some sense, I wanted to believe that we would fine-tune to change the behavior of models, but to cook new knowledge in, we could do that just by reading a book, by inserting the knowledge in the context of the model. I was wrong. And partly because I think the way I read something, having practiced doing that for a while and having practiced the style of, you know, answering questions in a particular Java syntax can make a big difference. Being able in this kind of one interesting example of fine-tuning that probably in our perspective should have made sense is I've practiced a lot of hugging face function calls. If I don't retrieve the right documentation, but you asked me to do something, I kind of wing it because I've gotten a lot of practice fine-tuning against that particular task. And so this fine-tuning helps when there's kind of a very specific domain that's not out in the world. it is critical that you don't just fine tune blindly on your data, that you construct new tasks to reflect the kinds of things you want the models to do in the long run And this is kind of a you know it a cautionary tale too So if I want a model to generate JSON for a function call I don give it a bunch of documentation say predictive documentation I give it a question documentation and the output JSON that I want it to produce. The consequence of that is it no longer does anything else. So I have a model that just tells JSON and I want to joke, JSON, that's what you're going to get. And so you do tend to fit a model to a very specific domain in the process of doing that, which could be problematic if you want them all to do other things. And that's where this movement towards compound AI systems, to me, is kind of exciting. So it means I can fine-tune to have a really good function caller, but I can still have a nice, friendly chatbot that will still talk to you, and the function caller can sit there and call tools when the main chatbot needs help with the tool. And so this idea of decomposing things down specializing my models. I think it was mentioned in this blog as a way to kind of control hallucination but also allow me to specialize and perform better at different tasks. Can you describe how that system works? There's something on top like deciding which model to send the questions? We don't have great answers. At RunLM, we have a very specific pipeline. So given the task we forgot what kind of code, do I look at code, is there documentation that I might is a different model that's specialized for dealing with that kind of thing based on a decision I can make earlier on from the question. More generally, that's an open question for me. Like, how do I abstractly mix a function fine-tuned model with a general chat fine-tuned model with a RAG-optimized model? And I think when I get back to this kind of world of multi-agent systems, I think this is where the exciting next frontier of AI is going to be is how do I compose these models that are specialized in different domains? We have very different system prompts to get them to work together effectively. I will say again, what we've done at the company, what I think a lot of people are doing right now is they have like some routing model that makes some basic decision and delegates to models accordingly. But, you know, that's step one. It would be so much cooler if they could like talk together and say, you know, that would not me, pass that on to this other person. Or, you know, I'd like some, you know, feedback from the code expert before I answer this question. And so having more direct interaction between models is, you know, pretty exciting. Totally. A little, that's a little bit of a segue into um to route lm did you want to talk about how you set that up and and what that was ron lm was a project the students had built out of the arena which in retrospect was pretty obvious and brilliant uh like we're figuring out which model is best answering a question based on you know different types of questions uh we should be able to use that to select between models um we weren't the first colleagues uh at stanford had also started looking at you know how to build deluxe chatter version you know selection tactics uh for models In fact, I think back in like 2015, 2016, maybe? Maybe it was 2017. We were doing work on like routing inside of computer vision models. So the field of routing across models is, you know, pretty well studied. Route LLM was playing to the advantage of the arena. We have all this amazing data to, you know, help make decisions about which model to use to get better answers. And you could imagine doing even cooler things in the future. You know, the students looking at that now is like, can I like predict an entire leaderboard based on your question? and do things, you know, maybe I have a cost performance trade-off that I want to play. Maybe I should mix models into waves. So routing is kind of selecting models. Can I combine them or have them work together to answer a question? And if you just wanted the optimal performance, like how much better can you do? Like I have the sense that lately, you know, O1 is really dominating the leaderboard, although I'm looking at it and seems like Gemini is at this moment in time, late November. Gemini's coming back. Yeah, it seems like it's coming back. Google OpenAI competition's awesome. But how much higher can you go by like optimal routing? You know, I don't think we got a significant boost over say the art. A lot of the opportunity comes in maybe saving costs right now. But I do think in the future, there is a chance that you could refine questions. You could have models think and work together to get to a better answer. So I don't like, I don't think we've gone as far as you could with that. And again, this is one of my students said this, like if all we, if we were stuck with the models we have today, I think we'll see a decade more of exciting research. And that student, who's now the CEO of LIDA, another starter from the lab, was right. I think he's going to be right. He'll be proven right in 10 years. Even the models today, how we put them together will make an enormous difference in what they can do tomorrow. Wow. A decade of improvement is a bold statement. Maybe it'll be tough, but I bet we could do a few years. I think he'll be right about that. It's hard to make predictions in this field. Totally, totally. Where do you think of the productive lines of inquiry? I didn't see a lot researching on prompt optimization. Do you think that's an interesting area of research? I'm not doing a lot there. We had done some early work, Project Tempura, using RL techniques to do soft prompt optimization. I haven't done as much since then. My colleague Matei has been working on Project DSPy that's pretty cool. doing kind of prompt tuning, doing few-shot optimization. I think there's a lot we could do. There are things people could do there. For me, I'm a little more excited about how we manage the context in a kind of repeated conversation, which is a form of prompt optimization. But the idea that, you know, I might filter what information is presented to the model each step, the MemGBT kind of led a project. It's more exciting to me to think of, like, maybe there's like a memory management system that's consciously modifying my prompt to make it better so that I, as a model, can make better decisions. Practically, maybe a human, too. You have too much information on your plate. Here's what you need to know for this email. Great. That would be wonderful. And so servicing the right information at the right time makes a big difference for humans and for AI. And that sort of flavor prompt optimization is kind of exciting to me. I guess the memory management kind of breaks the model of your arena. mode, right? Like I do feel like it's hard to tell how much customized results I'm getting at this point, but clearly there is a fair amount. Yeah. Great, great question. So the arena right now treats people as, as, you know, stateless. So you come to the arena, you're a new person, chat GPT, Claude, like all these services start to have a memory and they even tell you, remembering something. That's cool. That's important to doing better in a, you know, a long-term conversation. That context that you develop over time is pretty critical. How, how we surface that is important. Um, the arena is certainly not testing that right now. It will be maybe, uh, we're looking at adding agents, uh, and kind of accounts so we could start to track these things over time and help you have a better experience with the arena and also, you know, better model how these, you know, how context, uh, plays a role. Um, I, it's, it's, it's a good point. Uh, you know, we're not playing a lot with memory in the arena. I think it is an area where these models will evolve, where we will get wins. I think when we build agents and they start to work together and they have to remember the context, what they're doing, it's important. Yeah, it's an exciting next direction. Are there other leaderboards that you look at and admire for other topics? That's a good question. So I still turn to the basics, MMLU. I want to know how things are performing. The recent math benchmarks have become pretty useful for me to get a sense of how things are performing, especially through this test time compute, the long chain of thought reasoning. Where else are we looking these days? For me, the company, we actually don't usually run around leaderboards internally. That's a good thing. I think their leaderboard is the arena is a great way to get an overall picture. But if you're looking at the arena to figure out what your company should be doing, you might be doing it wrong. You should be looking at what your internal leaderboard is showing. People should start collecting this data and start to try to understand how models are performing for them. It's critical because it will change. Yeah. So, yeah. We look at the arena, but also for a lot of the tasks I'm looking at, we have very kind of specialized boards we'll look at. What about Sweebench? I feel like that has a lot of purchase. Yeah, we started looking at that. Sweebench results are terrible. It's exciting because it's a hard task. I think, you know, working on agent systems, they don't work. The multi-agent systems agents, like we're not there yet. There's a lot to do to make them better. And in fact, right now for me, I'm trying to understand how do they fail? Like, do you have a good understanding of like why they don't work? When they don't work, what do they do wrong? And like our best hypothesis right now is that a lot of failure is getting lost, getting stuck and then doing the same thing over and over again until you time out. or like I figured out what the bug is, done. We didn't solve it. Oh, I forgot. I was supposed to do that too. And this idea that like I've worked all this, you know, this context up to figure out the problem and I figured out the problem and I forgot what to do is sort of getting back to the story of memory. I think like being able to say, here's the steps I'm in. I finished that step, success, but actually now I have to go back to the beginning and be able to guide the model, you know, through its thought process is pretty critical. Look, I thought we'd end with RunLM, where you're actually putting these things in production. Do you want to talk about what the use cases are and how far along this product is? Yeah. So RunLM, we're building essentially, right now it looks like a support bot, but actually it's pretty exciting. We're building something a little bit bigger than that. We're building an AI that both interfaces with customers to help them solve technical problems, help them understand the product that you are developing, you know, use, say, weights and biases. So I want to know, how do I use weights and biases to do this? Can I use weights and biases to do this? What is weights and biases? And be able to ask the dumb questions that you wouldn't ask a real person. Be able to ask the questions that you were like, I'm trying to do this with this data and this endpoint. What's the code to do that? Which you wouldn't ask a support person. But an AI is like, yeah, I'm happy to work with you on your code. And so with the RunLM, we've been very focused on doing that right now. But what I think is super cool is weights and biases can go, hey, what are my customers confused about? I want to ask you, you know, where are they confused? And when they're encountering this problem, can you encourage them to look at this other thing too? We have a new thing coming out. They should try that. And so being not just an interface for customers to a company to understand the problems, but also an interface for a company to its customers to interact, you know, with that experience. and filling more of the role of a support engineer. They can talk to customers. They can track issues. We're adding the ability to interact with your support system, with your development interfaces. We've done a lot of work with code, so you don't have to document everything amazingly well. We'll actually just check your code and see how does it work. Oh, yeah, here's how to answer that question. It's funny. At RunLM, we've added support for some of the research projects in my group. I run another project, VLM, which is a serving platform for models. There's a lot of users out there. And my grad students recently told me, like, RunLM is supporting their support interface. They're running their docs. That's awesome. And when people get stuck with how do I do stuff with VLM, they go to RunLM, and it actually answers better than we were answering. It's like, we didn't document a lot of stuff. It was a research project. And so, like, they don't need the documentation. They just get, you know, it goes and looks at the code and says, here's the thing you need to do differently. And, like, wow, that's amazing. So yeah, this idea of being able to do more than just answer questions, be able to engage customers, provide a voice to the customers back to the company, make sense of not just the documentation, but the code. We also recently had the ability to find bugs in your documentation. So your documentation disagrees your code, and we can flag that. And people are getting stuck on this thing. You should take a look at that. And this kind of bi-directional voice is something I'm pretty excited about that we're doing that's been a lot of fun. No, it's super cool. Is there anything different about developing a customer-facing product than a research-oriented project? Oh, yeah. There's a lot. Certainly, the core elements have a lot of the flavor of the research. We are trying lots of stuff and trying to collect data constantly. We have, again, a whole suite of things that we test internally so we can say, hey, one mini came out. Should we use that one instead? and being able to agilely try new things, kind of like research, but also having good metrics in place, being a little more conservative about what we ship to customers, takes some work. I think part of maybe what we haven't done well at Run Elm is communicate our bigger story. And that's something, I guess, is ironically something we tend to do more in research, but in the company we have done less of and something we could probably do better, like telling this kind of new way of thinking about how an AI product is more than just a support. interface but an interaction on the technological side I mean we're certainly following what's happening in the research there's a sense of skepticism a lot of things like even if my research that we tried they don't always work and so we do try stuff that like yeah it worked in research didn't work for this thing and so being able to try and fail at stuff is a big part of you know building the product awesome well this is really fun thank you so much for your time thank you I think this is like one of those practical interviews I've done. So thank you very much. It's been fun. Thanks so much for listening to this episode of Gradient Descent. Please stay tuned for future episodes.

Share on X Share on LinkedIn