

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759
TWIML AI Podcast
What You'll Learn
- ✓Traditional pre-training on static benchmarks is insufficient for building AI agents that can interact with environments and accomplish multi-step tasks.
- ✓Agentic AI models need capabilities like planning, reasoning over long contexts, and learning new tools on the fly.
- ✓The attention mechanism in transformers is a key component, but may need to evolve to support long-form reasoning.
- ✓The training data and loss objectives used for pre-training are crucial and may need to be rethought to develop the desired agentic capabilities.
- ✓Scaling and debugging pre-trained models is a complex art that requires deep experience, and radical changes are often better suited for post-training rather than pre-training.
Episode Chapters
Introduction
The guest, Aakanksha Chowdhery, discusses her background in building large language models and the mission of her current company, Reflection.
Rethinking Pre-Training for Agentic AI
Chowdhery explains why traditional pre-training on static benchmarks is insufficient for building AI agents that can interact with environments and accomplish multi-step tasks.
Key Capabilities for Agentic AI
Chowdhery outlines the key capabilities needed for agentic AI, such as planning, reasoning over long contexts, and learning new tools on the fly.
Evolving the Attention Mechanism
Chowdhery discusses how the attention mechanism in transformers may need to evolve to better support long-form reasoning for agentic tasks.
Rethinking Training Data and Loss Objectives
Chowdhery emphasizes the importance of the training data and loss objectives used for pre-training in developing the desired agentic capabilities.
Scaling and Debugging Pre-Trained Models
Chowdhery shares insights on the complexity of scaling and debugging pre-trained models, and why radical changes are often better suited for post-training.
AI Summary
The podcast discusses the need to rethink pre-training for AI models to enable 'agentic' capabilities, where models can interact with environments and accomplish goal-oriented tasks over multiple steps. The guest, Aakanksha Chowdhery, argues that traditional pre-training on static benchmarks is insufficient, and models need to be able to reason over long contexts, plan, and learn new tools on the fly. She suggests that while architectural changes may help, the focus should be on the training data and loss objectives to instill these agentic capabilities.
Key Points
- 1Traditional pre-training on static benchmarks is insufficient for building AI agents that can interact with environments and accomplish multi-step tasks.
- 2Agentic AI models need capabilities like planning, reasoning over long contexts, and learning new tools on the fly.
- 3The attention mechanism in transformers is a key component, but may need to evolve to support long-form reasoning.
- 4The training data and loss objectives used for pre-training are crucial and may need to be rethought to develop the desired agentic capabilities.
- 5Scaling and debugging pre-trained models is a complex art that requires deep experience, and radical changes are often better suited for post-training rather than pre-training.
Topics Discussed
Frequently Asked Questions
What is "Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759" about?
The podcast discusses the need to rethink pre-training for AI models to enable 'agentic' capabilities, where models can interact with environments and accomplish goal-oriented tasks over multiple steps. The guest, Aakanksha Chowdhery, argues that traditional pre-training on static benchmarks is insufficient, and models need to be able to reason over long contexts, plan, and learn new tools on the fly. She suggests that while architectural changes may help, the focus should be on the training data and loss objectives to instill these agentic capabilities.
What topics are discussed in this episode?
This episode covers the following topics: Agentic AI, Pre-training, Attention mechanism, Training data, Loss objectives, Model scaling and debugging.
What is key insight #1 from this episode?
Traditional pre-training on static benchmarks is insufficient for building AI agents that can interact with environments and accomplish multi-step tasks.
What is key insight #2 from this episode?
Agentic AI models need capabilities like planning, reasoning over long contexts, and learning new tools on the fly.
What is key insight #3 from this episode?
The attention mechanism in transformers is a key component, but may need to evolve to support long-form reasoning.
What is key insight #4 from this episode?
The training data and loss objectives used for pre-training are crucial and may need to be rethought to develop the desired agentic capabilities.
Who should listen to this episode?
This episode is recommended for anyone interested in Agentic AI, Pre-training, Attention mechanism, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Today, we're joined by Aakanksha Chowdhery, member of technical staff at Reflection, to explore the fundamental shifts required to build true agentic AI. While the industry has largely focused on post-training techniques to improve reasoning, Aakanksha draws on her experience leading pre-training efforts for Google’s PaLM and early Gemini models to argue that pre-training itself must be rethought to move beyond static benchmarks. We explore the limitations of next-token prediction for multi-step workflows and examine how attention mechanisms, loss objectives, and training data must evolve to support long-form reasoning and planning. Aakanksha shares insights on the difference between context retrieval and actual reasoning, the importance of "trajectory" training data, and why scaling remains essential for discovering emergent agentic capabilities like error recovery and dynamic tool learning. The complete show notes for this episode can be found at https://twimlai.com/go/759.
Full Transcript
I'd like to thank our friends at Capital One for sponsoring today's episode. Capital One's tech team isn't just talking about multi-agentic AI. They already deployed one. It's called Chat Concierge, and it's simplifying car shopping. Using self-reflection and layered reasoning with live API checks, it doesn't just help buyers find a car they love. It helps schedule a test drive, get pre-approved for financing, and estimate trade-in value. Advanced, intuitive, and deployed. That's how they stack. That's technology at Capital One. started. Head to ai.studio slash build to create your first app. For the longest time, we were measuring pre-training on static benchmarks. If we want these models to be useful as agents, they need to be able to interact with environments. And when we start caring about those agentic tasks, pre-training needs to rethink from fundamentals. This is not just a post-training problem to achieve these set of capabilities that we want in the next generation of models. And the kind of benchmarks we need for measuring this kind of intelligence is sometimes not available today. All right, everyone, welcome to another episode of the TwiML AI podcast. I'm your host, Sam Charrington. Today, I'm joined by Akansha Chowdhury. Akansha is a member of technical staff at Reflection. Before we get going, be sure to hit that subscribe button wherever you're listening to today's show. Akansha, welcome to the podcast. Thank you, Sam. You have a really interesting background. You've trained some of the earliest large language models, including Palm and Gemini 1.0, Gemini 1.5. Tell us a little bit about those experiences. I got into large language models at Google while building one of the distributed systems that led to the training of Palm, which was our largest language model of the time. It had 540 brilliant parameters. And people stopped publishing the number of parameters after that. And that led me to be at the forefront of pre-training solving one set of problems after the other that come with scale in first two generations of Palm models. and then the first two generations of Gemini models. And I think the one thing that you learn when you do pre-training is that at scale, every problem magnifies and things go wrong at every possible part of the stack. So it's always fun and it's always exciting. And you said you were on the infrastructure side? I'm not on the ML side, but I've done both. One thing that is super interesting about pre-training is that you have to be able to think across the stack. So otherwise, if you're going to train something for two or two and a half months, you need to be able to think about every single part of the system. Nice. And so tell us about Reflection. What is Reflection focused on? So the mission of Reflection is to build frontier open intelligence for agentic capabilities. And the company has been focused on building a post-training stack for agentic tasks. And with the most recent fundraise, we are doing training end-to-end. So we're building the Frontier Open Ingenetic Models, which are both pre-trained and post-trained in-house. And that's really going to be a focus of what we're talking about today. Some of the reasons why you think pre-training, a different approach to pre-training is key. Is that kind of the way you think about it? Well, the way I really put it is that for the longest time, we were measuring pre-training on static benchmarks. For example, LU is a popular one or GSM 8K or math Olympiad problems like Amy and math or you name it, but extremely static benchmarks. If we want these models to be useful as agents, they need to be able to interact with environments and be useful to us in the workflows where we can use them. The simplest version of that that we are already starting to see today are coding agents and then deep research agents. The coding agents are extremely useful in the sense that you can put a coding agent to help you understand a large code base, or they can help you, for example, to refactor and apply a fix across multiple files. They might not do it correctly. They're not perfect yet. Or in deep research agents, I think we moved away from just a search bar interface to more like, here is what you want, and then putting a language model on the job of finding multiple articles that investigate. that kind of workflows. And you can imagine that any such goal-oriented task can be given to the language models and the models can do achieve these goals over multiple steps as opposed to just being chatbots. So those are the kind of agentic tasks that we start caring about. And when we start caring about those agentic tasks, pre-training needs to rethink from fundamentals. This is not just a post-training problem to achieve these set of capabilities that we want in the next generation of models. And the kind of benchmarks we need for measuring this kind of intelligence is sometimes not available today. So I'm happy to talk about that as well. And to maybe underscore this point, a lot of the incremental progress we've made over the past three years since, you know, ChatGPT, you know, whether we're talking about reasoning, whether we're talking about tool use, all these things that contribute to our ability to build quote unquote agentic systems. This is all come from not, you know, thinking radically differently about the way these models, the core, the base LLMs are pre-trained, but by kind of tacking on new ways of thinking about how we tune them or post-train them, you know, through reinforcement learning and other techniques. Why can't we keep doing that? Oh, we totally can keep doing that. I think it's limiting in the sense that it fundamentally limits you in what you can achieve in terms of capability. So the way I think about models is what are they capable of and the capabilities can be anything from do they have long context capabilities, can they do retrieval over longer context, or are they good at, for example, solving natural language understanding problems, are they good at natural language generation problems, are they good at reasoning? I think what is especially important about agenting tasks like coding tasks is that if you're trying to read a code base and then execute some parts of it or write your title about it, you will accumulate context over time because you're reading a lot of content. And then if you execute something, then you also have some additional feedback from execution. So what you need a model to be able to do is have planning as a capability, have ability to reason over its context length, which might be very long. So what went well, what did not go well. It needs to be able to fail from trajectories that went wrong, for example. So you can have short-term setbacks where it went and did something wrong, but then it needs to correct course for itself. And then it needs to be able to learn new tools on the go. So it needs to be able to explore tools and figure out what it should use if it's for a new environment. So all of this is possible to some extent, but it's sometimes limited. And then the fundamental problem where you see it's very limiting is the context engineering problem. So when you have companies build systems today, they are building it by doing a lot of context engineering where they tack on a lot of like, it's literally limited to this is what I can fit into context length and that's the length my agent will do and then the next thing will be in the next sub-agent. What gives you the confidence that we can achieve the goals that you're trying to achieve just by changing pre-training? Do we need a more fundamental set of changes, a new architecture, kind of a post-transformer approach? How do you think about that? When I go and talk about rethinking pre-training, I always think about it in terms of what are the capabilities we want out of the models. And to me, architecture is just one part of the equation. A big part of the equation is what is the loss objective and what is the training data. So as I was alluding to before, what we want are these models to be able to do planning and reasoning over multiple steps, look at what they did in their context before and refer to that and use that as a way to inform their next steps in terms of taking actions. And for them to be able to do that, the fundamental mechanism in transformers that enables this is the attention mechanism. Whether attention mechanism works well over long context or not is a question that we're definitely taking a very strong look at as to whether that's the best way for us to achieve long-form reasoning. We do know that long-form retrieval, like if you want to retrieve over millions of context length, that is achievable today with large language models. Reasoning over longer context is a harder problem. The second bit that is extremely important is the training data that goes into these models. And the third bit that's extremely important is the loss objectives. So believe it or not, we have actually trained these models to probabilistically predict the next token. What is the most likely next token? And out of that, we have managed to get these models to be extremely strong reasoners. Is that the best law subject that one could go after? I'm betting on the fact that we need these models to have more fundamental long-form reasoning capabilities and need to be able to learn how to learn from their context. Yeah, what I heard in there was a little bit of like you include a little bit of architectural shift in pre-training. You know, when you talk about rethinking pre-chaining that is inclusive of maybe tweaking the way attention is done, but it's not, you don't believe that, well, you believe that we can do better with what we have and you're not necessarily requiring state space models or some kind of funky next generation thing. Like you think, you know, you're riding the horse that you have, so to speak. That is fair. And I will add, I think you have to have gone through the pre-training experience for multiple generation of models to kind of know where to make the bets. And pre-training is one of the sciences, which is both fun and exciting and high risk and high capital expenditure. So when you make those bets, you want to be as sure of your bet as possible. And that's why the retraining sees most incremental changes and post-training sees more radical changes, is my experience. Otherwise, you spend a lot more time debugging through what you've changed as opposed to getting a model that works. There are a lot of interesting proposed architectures that I come across where the researcher is leaving scaling to someone else. Because they don't have the resources, but it still, it leaves a big question as to, you know, whether you'll see the expected results at scale. I think for folks who are experienced in scaling and have had the fortune of being one, it's truly an art and a science to get it right. And it's not just like, we changed this thing and we changed this other thing and it just works. It's a fun adventure every single time. And so when you think about the first of those elements, attention, like how do you think attention needs to evolve in order to support agents better? When I think about agentic capabilities, I think the fundamental bit that we need out of attention mechanism, and I'm not proposing exactly the solution, is that the ability to look at not just things around its context length, but also being able to refer to things that were further in the past. because as you can expect that for specific examples of safety research agents you might have like collected hundred articles So and then you need to somehow organize them them into topics So your attention mechanism needs to have some form of a summary of these articles and then it needs to somehow organize them into content that is a synthesized report that you will give to the user. So a lot of the heavy lifting in this case is being done by the attention mechanism here. So how do you think about that long-form reasoning? What is the token attending to? Is that enough or is that not enough? Are the kind of questions we ask. There's a lot of work happening in research as well as in labs looking at memory architectures and expanding or augmenting current LLM approaches with different ways of thinking about memory. How do you think about that relative to, you know, this conversation about enhancing attention? Thank you. I see memory as an additional tool in addition to the attention mechanism. So at any given point in time, what we want the language models to be able to reason well about is what tool to use and memory being one of them. And as long as they are strong reasoners, they should be able to refer back to an index and say, this is where I need to look up content. And this is what I need to go back and read from the context lens. But today, the way some of these systems end up being engineered are as multi-agent systems where you put like sub-agents and like something is going to go fetch that content. But effectively, what you're referring to is sort of the ability of the large language model to use a tool which can look up a database or memory or something that's perhaps stored away. That's how humans do it, right? So that type of memory approach doesn't necessarily remove the requirement to have a different approach to attention that enables different recall models, I guess. Yes. Clearly, there's a lot of work being done looking at the attention mechanism, both from the perspective of increasing computational efficiency, increasing long context performance. What research out there do you find kind of inspiring and in the direction of what you think needs to happen to better support agents? You'll find me extremely conservative on architecture stuff because I've scaled too many models and debugged and been in the trenches for too many models, basically. Like I've trained five scaled up models and every single time I have to debug them. Like I'm not the only one that's usually a war room and there's like a few other people, but like I'm the person who has to be in the trenches. What has that informed for you? There has to be a good reason and one has to be extremely rigorous about evaluations. One has to be extremely rigorous about the way you do scaling. And innocuous changes that feel like fun and exciting don't work at scale. And? So there has to be a good reason to make that bet. So every single bet is extremely well vetted. Consequently, transformers have been extremely well vetted relative to the next thing, the next best thing. I think for a startup like Reflection, we also have to make the first bet to be as strong a model as possible. So, yeah, there are ways that we might want to tweak attention, but it's only one part of the pre-training recipe. there's also fine-tuning the loss. You talked a little bit about the loss and some of the ways that people are thinking about that. Can you elaborate a little bit on how some of the loss tweaks tie directly to agent performance? Like if you think about the next token prediction for NLLM, and think about the initial tasks that we expected LLMs to do, like generate texts, like that makes sense. It's intuitive. Like when we think about agents, there are so many more things that we need these agents to be good at. Reasoning is one thing that you mentioned. Tool use is another. What are some of the ways that you think that we can kind of tune the losses to support those types of behaviors? So some of the things that have been done in the past, and this is kind of extremely well known, is that when we were training coding models to just be autocomplete models, we don't just train them on, you know, give them code files and then train them on next token. We have this fill in the middle where we actually break up the file, put the first on the second half, and then the middle part goes at the end. So the model has to go predict that. When we train these models for tool use today, we actually mask out certain portions so that the model learns which tool to use. So it's not just about or which search query to use. I think those aspects get masked out in how you train the model so that you can teach the model by masking. And if you go to the history of language models, for example, BERT and so on, masked language modeling was an objective that was heavily explored and leveraged in meaningful ways. So the way I look at it is that what you pay attention to also is important in the loss objective. So masking is one particular way and then how the training data is augmented is another way in getting the model to pay attention to specific parts of what is there in each sequence. and that teaches the model to pay attention to that part or predict that part more versus the other. So from a loss perspective, it sounds a little bit like what you're saying is that a lot of what we're doing in post-training, I guess not coming up with new pre-training things necessarily, but it's finding ways to pull some of the things that we're doing in post-training into pre-training in a scalable way so that they're more fundamental capabilities of the models. Yes. I would say it unifies the two paradigms. I think the pre and post is fundamentally coming from the world of we had to start out with pre-training because that's where we saw the scaling gave us the largest set of gains in terms of amplifying the natural language capabilities, while post-training allowed us to make these models more accessible as chatbots. and they were extremely strong when you got them working with, say, reinforcement learning with human feedback. As we move into this agentic paradigm, is reasoning a capability you want to endow in the model in the last stage or does that move up front? And do we need a very strict divide between the two paradigms is a great question to ask. From a training data perspective, talk a little bit about how you think that needs to evolve. Yeah, we've right now we're training on all the data that we have. Well, training data. So pre-trained models are fundamentally dependent on large volumes of data that is high quality and is diverse. I think the bit that we have exploited the most is go for as much scale as possible. So more and more data is always better and go for as diverse data sources as possible. I think what we have seen from our counterparts in the East, say, QAN models or DeepSeq models or Kimi models, is that I think the quality of curation in training data matters as well. So I think that is one axis in which you get a lot more compute efficiency out of pre-trained models when you get your training data to be extremely high quality. And the second axis that matters is more along the axis of reasoning. So how do you get really good reasoning traces? It's almost like asking the question, if you train on internet, reasoning emerges kind of by default, but then how do you get those expert traces of people working on problems in all possible domains? And today we are doing them by getting these models to explore and say reinforcement learning environments and whatnot. But once we have some semblance of we understand how these models can generate those traces, what should be put and what is available at scale and pre-training starts to become a question. So this kind of alludes to the fact that what would be good reasoning traces that can feed back into the models at large scale is a question that is something that we're thinking a lot about. Do you have a sense for the, if I were to guess, like try to compare the, you know, token volume of reasoning traces to the token volume of found, you know, core training data, like it's off by several orders of magnitude. It's a very small fraction of the total training data. Like how do you, let's say you had, you know, the, you know, these reasoning traces, how do you even make it so that it makes a difference in the training process if it's so outnumbered by Reddit chatter in pre-training? That's a fair point. I think the question that is worth asking is that what is the dominant pre-training data source and how can that be augmented to generate that kind of volume of data is the question. And there is some pointers in that direction, but that's not... For example? For example, the dominant training resource would end up being, say, articles on the internet. How you use articles on the internet to, but augment them in interesting ways. And like there have been some results, but show that you can formulate them into question answer pairs or into conversational sort of things or into, or if you have problem solving tasks, then you can like augment them in interesting ways. So there is starting to be work in this direction, but you have to tap into the fundamental volumes of data to generate similar volumes of data before we can go down the path of completely new generated data sources. I hear you talking about the volumes of data and I'm also in my, I'm also trying to piece together like, you know, if we're talking about needing to, you know, do an inference on, you know, some significant fraction of all of the training data that we use for pre-training in order to extract, you know, question answer pairs or reasoning traces or things like that. I'm trying to decide if I think that's tenable or not. It sounds expensive. How expensive do you think it is? I mean, DeepSeek was trading 15 trillion tokens. I can't say the numbers for Gemini, but let's assume that it's tens of tolerance of tokens on the pre-training side. That is not horribly expensive if you're doing like, you know, a fraction of a dollar per million tokens in and some small multiple of that. Maybe not. Exactly. Yeah. I think the question that is worth answering is that for the longest time, that has not been a focus area because that was not where a lot, you always get better performance out of natural data that is representative of human workflows. And synthetic data has challenges in the sense that the models don't perform as well because you're not actually expanding the data distribution in any meaningful ways. So what is the fundamental bet that you're making in this particular case? And I think what I'm alluding to is that that is the axis along which we are thinking as to, here are the model capabilities we fundamentally care about. And why you get those capabilities would be to go down this path. I guess the other challenge that's often raised raised in the context of training on, training on synthetic data is kind of the smoking your own exhaust problem or like, do you have a sense for how, especially if you scaling up the amount of synthetic data do you have a sense for how that overcome At the end of the day the fundamental underpinnings of how you determine what is going to help and what going to hurt is dependent on what the underlying data distribution And if you are changing the distribution substantially to be all synthetically generated, I expect the models will not do well. So the goal here is to very much keep the underlying data distribution to be very representative of the natural data distribution that you want the models to be trained on. And so I think that leads us naturally to talking a little bit about measurement and kind of benchmark construction for these types of tasks. What have you, what are you seeing so far that you think is kind of in the right direction? That's a great question. I think the fun thing about benchmarks is that every year there is a few new ones that come out and then they get saturated. And then you have to build a new set of benchmarks. I think for a lot of the frontier development for these models, the current paradigm that we have kind of settled on is that everyone builds benchmarks in-house to measure what is the set of model capabilities they're going after. So I was alluding to the fact that, for example, we want these models to be able to reason better in long-term reasoning over their context length. We want them to be able to do multi-step problems. So they need to be able to tackle not just the next step, but what their actions might be in multiple steps from now. We want these models to be able to recover from, say, failed trajectories, plan better, to be able to learn how to learn how to use tools. All of these capabilities require very, very specific ways of measuring, even though I'm giving you overarching concepts, like when you break down and put them, for example, in actual, say, coding tasks or in SWE bench style tasks. And we're starting to see some of these tasks with varying horizons come into picture. Like METR has this software atomic actions thing that allows you to have tasks of varying lengths. Then there is a few other benchmarks. What you will see is that you basically want tasks that are varying complexity and also varying along these dimensions of capability to test the models. So it's almost like if you were to break down your real workflow in a day, what are the sub-problems that the models could go after or what are the sub-sub-problems that the model could go after? And even if those are hard, then why is it hard? And making that a challenge for the model is roughly how a lot of these benchmarks are constructed. And just to give you a sense of why this is super important is that back like three, four years ago when we were training POM, it was a large model. But we tried this on a new benchmark that was crowdsourced through researchers who were basically coming up with reasoning tasks that they found large language models could not do back in the day. And we tried our freshly trained, actually halfway trained model on this benchmark. And what we found was like, suddenly this benchmark had a step change and that led us down the path of figuring out, oh, the model was actually starting to showcase reasoning capabilities. And that's fascinating to some extent, but that would not really have been available had we not had that plethora of benchmarks to look at, like including understanding emojis or like how does the model reason over, say, a movie script or whatnot. So I think these questions, sourcing these questions in the right ways and understanding what is hard with the current generation of models is often the way to make progress towards the next set. To what extent do you see these? And I think this question applies to, you know, all of the kind of implications on pre-training that we've talked about. But, you know, for measurement in particular, like, how do you think about which of these things you think the community will provide or like everyone's moving in the same direction by the time you need it or get there versus, you know, what a small lab like reflection takes on to try to incorporate into your own models? the fundamental set of capabilities that i talked about which is roughly three or four that i can count on one hand are things that if we believe those are fundamental paths we'll definitely prioritize them uh in addition to uh the usual uh array of benchmarks i also believe that we are just at the start of getting reinforcement learning to work well with um with lm so the first reasoning models where the reinforcement learning really came out a year ago. So I do believe that a lot of the workflows that are being tried out in real world will make it into benchmarks over the next year. So we will have more benchmarks that are representative of real world workflows that will test the models in challenging ways and help us learn where the models are challenged versus not. I think the fun part about planning pre-training is that it ends up being a longer term endeavor. So you have to kind of think ahead of time of what the fundamental model capabilities you're shooting for and plan for that. When you think about the direction the field is heading relative to larger models and scaling laws, you know, the, you know, kind of the march towards ever-increasing models, and then we've seen a resurgence in interest in smaller, more efficient models, Like which of those are you aiming for? So when you go for larger scale models, you often find capabilities much sooner than you will in smaller models. So typically, model scaling is one way in which we notice model capabilities that we haven't noticed before. And often the model capabilities that become available at scale, then also with the right training data and the right set of things and training longer and whatnot, become available in smaller models as well. So, 04 Moon came out after, say, the largest reasoning model was constructed out of, say, I don't know the specifics, but let's assume GPT-4, right? So, you take the largest models, you post-train them for reasoning, and then the next generation of models had that capability even in the smaller models. So, what I would say and what we are shooting for is the most capable models and then the most cost-efficient, most capable models. But oftentimes, to discover that capability, you do have to go to the right scale. And then you can come back and make them cost efficient. And often that involves an additional step of distilling, which helps. So that's the way to look at it. I think you kind of got what I was getting at. And maybe the question is, as a small lab, given that there is a lot of interest in small models as a small lab, can you just target building a small model? And what I'm hearing is not really because a lot of these capabilities are emergent in the sense that you need to hit a certain level of scale to see them. And so therefore you want to go as big as you can, given your resources, and then worry about making smaller later in order to, you know, achieve all the benefits of smaller models. Yes. Oftentimes, at least in my track record and history of playing with large language models and training them, emergence is something that you find fun and interesting things in the larger models. And then you often distill them or get them in the smaller models in the next generation because you've figured out exactly the recipe that it takes to get them. Yeah, we've talked about this idea of long-form reasoning kind of casually throughout this conversation. But dig into what that means, like what is long-form relative to short-form? So long-form reasoning, I'll give you very specific examples of what is long-form. So, for example, if you wanted an AI model for the longest time when you might want an AI to understand your entire code base, including the parts of the codebase that another team wrote or perhaps if you have a large enough organization then you basically wanted to go look at different parts of the codebase and anytime you tap on that particular subset of the codebase you want to be able to understand it well. What Longform would allow you to do is that for this part of the codebase that your team is working on you can reference that part of the codebase and use it like a library without thinking about it because AI can help you understand it and use it seamlessly. So you don't actually have to walk up to that team and spend time finding team members there. So that's one form of long-form reasoning. What that roughly means is if you put a lot of these things in context land, the model is able to reason about this parrot set of things that are pretty far apart in some ways, so where they are in position in the context land. Another way to look at it is that, and that is a hard problem because for the longest time, if you just want to retrieve relevant parts of the code base, that's an easier problem for the models to solve. And needle in a haystack or multiple needles in a haystack is a problem. But when you go to benchmarks that have multi-hop reasoning, say MRCR V2 or LOFT, these benchmarks have only recently started to show, signal that models are getting better. For the longest time, these were not the benchmarks where you were seeing strong capabilities, even in very long context models. So even though you have millions of tokens in your context length, you're not necessarily able to think about all of them. Or if you put things that are very confusing set of documents, like let's say you want to do deep research and you want to go on a vacation and you have 10 articles about your favorite vacation spot. the model will not necessarily always be able to think about each one of those articles and give you a synthesized form from each one of them. That's a harder problem for the model to solve. So that's one form of long-term reasoning that we're talking about. The second form of long-term reasoning that I'm talking about is more along the side of trajectories. So when you want the models to do multi-step reasoning, what you're asking really of the models is to almost think ahead. So it's like the game of chess where you are thinking ahead and the next action that you take should move you or at least have some sense of projection of where it will take you towards your goal. And that bit has not existed because they're realistically just generating the next token. So how do you reason about where this next set of tokens will take you? What does that mean in terms of trajectories with the next action that you take? Those are also bits that have, Or how was the past affecting up to this point in time? If you're looking at an existing trajectory where there are failures and then recover from that, often the models will get stuck in repetitive loops today because they don't really know that they made that failure. So they will go try that same thing again because that's the most probabilistic thing to go through. So those are some examples of long-form reasoning that are pretty pertinent in today's generation of models. I thought the distinction you were going to make was between, you know, the model spitting out an answer and the model kind of going through thought traces and like inference time scaling, that kind of, you know, that kind of thing. but it sounds like at least part of what you're pointing to in describing it as long-form reasoning is like abstract reasoning, meaning not necessarily you're going from A to B to C, but you know about A, you know about B, you're working on some separate problem C that might benefit from that knowledge. How do you incorporate that into the solution? Yes. And I see the first thing that you brought up is a natural evolution of reasoning. So I'm talking about like, where do we go from there? In your description, you talked a little bit about failure and the model's ability to recover from failure, the agent's ability to recover from failure. Dig into that a little bit and what that implies and requires. Typically, when you think about trajectories, and let me define the word trajectory. So let's say you are solving a coding task in, say, TerminalBench or in SweeBench, or you're solving a goal-oriented task in general. Then you will have come up with a plan of, like, these are the relevant set of files to go look at. Here are the set of changes that would help achieve the objective And then you would go try those changes And then you would say have some form of a verification process where you would go run tests for example to verify whether these changes will solve the problem or not, right? When you collect such traces where here was the relevant context, here was my solution to the problem, here is what I tried, and here is the output, the execution feedback from trying this, what you'll find is that over multiple steps, the model might need to figure out what it learned from the past step should not be redone or like it should choose different set of actions in the subsequent set of steps. And that's a harder problem for the model, both from the perspective of paying attention to like, okay, these are discrete set of things. So it needs to choose new action space. So that's partly a reinforcement learning problem, But it's also a pre-training problem as to what was the model paying attention to? Did it understand that these were failed stops? Did it have the right, were the trajectory path to it in the right format so that it noticed that these were things to correct on? So it's both a reinforcement learning objective problem, but it's as much a problem of like, what is the length of context the model is able to pay attention to and learn from? It sounds like you're talking about kind of inference time failure and an awareness on the part of the model of these types of failures. Yeah, we want the model to learn how to learn and how to correct itself. And I guess the question was going to relate that to hallucination as a property of LLMs and the degree to which, you know, often these failures come from, you know, hallucination of one type or another. And if that's like this fundamental property of LLMs, like how do you imagine that you would overcome it without fixing that? I try to rephrase I mean depending on the business objective hallucination it's as a positive or a negative word and I'm not I'm not making a judgment about it I'm just saying that it's something that happens and so like if the if the model if you know if these failures that we're talking about are when the model thinks something is true and it's not true like but the model doesn't have an awareness of that. How do you give the model that awareness? And if we could do that, haven't we fixed hallucination? So I see hallucination as more of like what probabilistically was the highest probable next set of tokens that the model thought it should output. So to me, was there anything else that could have made it into its context or could have gotten additional feedback from interacting with a set of tools to change course in certain ways, right? So that's the question I am asking. Like given the goal and given the history of all the things that it has tried to achieve that objective, is there a next set of probabilistic tokens that it can output that would have a higher probability of moving it forward towards the goal? And so when you talk about this recoverability, is that that's at inference time, right? The goal is to get it at inference time. And are you? It's not a solved problem. Let me just say that. Yeah, no, I get that. I get that. Are you envisioning like the model, you know, having some awareness of, you know, when it doesn't have enough information, for example? and asking for more information or like halting as opposed to continuing the ideas in that direction? So, yes, it doesn't have enough information to verify what it proposed is the correct solution. So it goes and tried a few other ways to like verify the approach that it was trying and then comes back and it's like, okay, now I have enough information to make an informed decision. Okay. Or ask the user for help, point me in the right direction. Right, right, right. As opposed to a lack of awareness of, you know, not having enough information and just powering on through with wrong information or information that it made up. Yes. Or constantly telling the user that it came up with the right answer. How do you see this particular facet, recoverability, tying back to the pre-training, you know, the pre-training levers that we talked about, attention, training data, loss measurement? So I think it comes down to training very strong reasoners. And there are some levers both on training data side and loss objective side that we think will help here. So it really comes down to really strong reasoners that can reason well over context length and in general across domains. When thinking about building agents that can learn new tools, talk a little bit about that one. MCP immediately comes to mind and like this drama that we've seen with, oh, hey, MCP is this great thing. Let's all use MCPs. Whoa, MCPs blow up the context and use more of it than the question that the agent is trying to solve. Let's instead, you know, delegate it to software programs that the agent can write so it doesn't have all that context. And then, oh, no, well, that is kind of hard. Let's actually make it so that the agent can search its tools. Like if there's anything on this list that cries for a more foundation, foundational solution, maybe it's like tool use and an agent's ability to learn new tools. Is that kind of the thing that you're talking about when you describe this? Yes. So I think what we have taught our models to kind of learn so far is really what we put in context is what they know. if we could point them in the direction of here is say n tools and by interaction they are able to learn how these tools are useful versus not that that is something that has been tried in the past in the form of say continual learning or in in the world of robotics like watch by imitation so watch by how this tool is being used so the question comes back to how can we get the models to meta-learn, like learn how to learn, what exactly is the format in which this exploration of the new environment needs to be done so that the model learns to take actions in that space. And one of the promising directions there has been some form of unified domain-specific language, perhaps in code format that would allow for this kind of reasoning to emerge. There have been a few papers in that direction. But overall, I do believe that the ability to explore new tools and learn by exploration and feedback from trying those tools allows for more flexible and capable models than like literally either having to fit them all in context length or having to fine tune them with all the right traces or like pruning the set of tools that goes into the context length so that you have the perfect system built to go in. So I think there But there is a fair bit of work to happen to get that bit going. It doesn't like, it's not off the shelf available today. Throughout the conversation, you've returned to coding agents, code development agents as a touchstone. To what degree do you think that the ideas that you're generating for these coding agents, like apply to like workflow agents, you know, agents for enterprise oriented workflows or consumer workflows, but, you know, not code generation, but you've got a bunch of tools, generate an itinerary or, you know, solve some workflow tasks or set of tasks. Like, do all the same principles apply? Do they differ a little bit? So even when we are talking about coding agents, I do believe same principles apply, though slightly broader. Even when we are building coding agents and even before that, when we were building large language models, one of the fundamental capabilities that we get at pre-training scale is that the models learn to reason across modalities. And what that means is they learn to reason across languages, they learn to reason across, say, code and text, they learn to reason across different languages, even within code, for example, or even like modalities that are not text, right? So to us, the fundamental principles for building anything that requires reasoning across crossover set of domains comes from this kind of capability. So to me, the fundamental principles still very much apply, and it's mostly a matter of maybe the last stage of post-training will be different, but the models need to be able to reason across domains, across, say, text and language, across different domains or languages and so on. Cool. So, you know, we've talked a little bit about, you know, what you're working on at Reflection. for folks that, you know, for whom these ideas resonate and they, you know, want to kind of further explore this. Are there papers out there or, you know, nurse workshops or various fora? How do folks like dig into these problems? Come work with us. I take it you're hiring? Yes, we are hiring cross-pre-training and post-training. We recently closed your founding ground with backing from NVIDIA. And our mission is to build Frontier Open Intelligence. We are a team of about 60 researchers. So very, very small teams still hoping to build a frontier lab for open agentic models. and we have researchers from XDeepMind, OpenAI, top universities who had leading contributions in Palm, ChadGPD, Gemini, AlphaProof, AlphaGo. So I think it's a team you would enjoy and feel challenged by. And besides from coming to work with you, how else can they dig into these problems? So if you're trying to really make a dent in the community at this point in time, a lot of the work that starts with building good systems of measurement. So a lot of what I talked about are capabilities that don't really have fundamentally strong benchmarks that actually measure. And so the best way to think about measuring intelligence is that if we wanted, like oftentimes you think about, oh, if I could use this large language model for this particular thing, and then you actually find the model is not good at it, that's a starting point of like, okay, why is it not good at it? What is missing in the models today? And being able to create that in a package form, and these are evaluation benchmarks that don't work well. I spend time at Stanford as well. Oftentimes I'll work with students to figure out where are the gaps in the current set of models? How does that lead to an evaluation benchmark that we can use? is often a very good starting point. And from there, you can start with post-training, which is a relatively inexpensive shift for you to go try and see what you can already start to close in terms of gaps for the models. And if that excites you, then come join us, or any of the Frontier Labs, for taking on a bigger adventure to build these models from scratch. Awesome. Awesome. Well, Aconcia, thanks so much for jumping on and sharing a bit about what you're seeing with regards to agents. Thank you.
Related Episodes

AI in 2025: From Agents to Factories - Ep. 282
The AI Podcast (NVIDIA)
29m

Why Vision Language Models Ignore What They See with Munawar Hayat - #758
TWIML AI Podcast
57m

From Hiring to Growth and the Future of Workforce Strategy - with Meghna Punhani of Eightfold AI
The AI in Business Podcast
35m

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757
TWIML AI Podcast
48m

Proactive Agents for the Web with Devi Parikh - #756
TWIML AI Podcast
56m

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755
TWIML AI Podcast
54m
No comments yet
Be the first to comment