

Is It Time to Rethink LLM Pre-Training? with Aditi Raghunathan - #747
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
What You'll Learn
- ✓Benchmark performance does not necessarily translate to real-world performance, as models can fail in surprising ways when deployed in slightly different contexts.
- ✓Increased pre-training data and compute can sometimes make models worse at fine-tuning, as the relationship between pre-training and fine-tuning performance is not straightforward.
- ✓The community needs to focus more on understanding model adaptability and how to ensure reliable performance, not just raw capability.
- ✓Existing large language models like LAMA 3 have been found to be harder to fine-tune compared to earlier checkpoints, despite their higher benchmark scores.
- ✓There is an inverse relationship between the ratio of pre-training tokens to model parameters, and the model's performance after fine-tuning on specific tasks.
AI Summary
The podcast discusses the limitations of current large language models (LLMs) and the need to rethink their pre-training. The guest, Aditi Raghunathan, shares her research findings on how increased pre-training data and compute can sometimes lead to worse performance when fine-tuning the models on specific tasks. This highlights the gap between benchmark performance and real-world deployment, and the importance of understanding model adaptability beyond just raw capability.
Key Points
- 1Benchmark performance does not necessarily translate to real-world performance, as models can fail in surprising ways when deployed in slightly different contexts.
- 2Increased pre-training data and compute can sometimes make models worse at fine-tuning, as the relationship between pre-training and fine-tuning performance is not straightforward.
- 3The community needs to focus more on understanding model adaptability and how to ensure reliable performance, not just raw capability.
- 4Existing large language models like LAMA 3 have been found to be harder to fine-tune compared to earlier checkpoints, despite their higher benchmark scores.
- 5There is an inverse relationship between the ratio of pre-training tokens to model parameters, and the model's performance after fine-tuning on specific tasks.
Topics Discussed
Frequently Asked Questions
What is "Is It Time to Rethink LLM Pre-Training? with Aditi Raghunathan - #747" about?
The podcast discusses the limitations of current large language models (LLMs) and the need to rethink their pre-training. The guest, Aditi Raghunathan, shares her research findings on how increased pre-training data and compute can sometimes lead to worse performance when fine-tuning the models on specific tasks. This highlights the gap between benchmark performance and real-world deployment, and the importance of understanding model adaptability beyond just raw capability.
What topics are discussed in this episode?
This episode covers the following topics: Large language models, Pre-training, Fine-tuning, Model adaptability, Benchmark performance vs. real-world deployment.
What is key insight #1 from this episode?
Benchmark performance does not necessarily translate to real-world performance, as models can fail in surprising ways when deployed in slightly different contexts.
What is key insight #2 from this episode?
Increased pre-training data and compute can sometimes make models worse at fine-tuning, as the relationship between pre-training and fine-tuning performance is not straightforward.
What is key insight #3 from this episode?
The community needs to focus more on understanding model adaptability and how to ensure reliable performance, not just raw capability.
What is key insight #4 from this episode?
Existing large language models like LAMA 3 have been found to be harder to fine-tune compared to earlier checkpoints, despite their higher benchmark scores.
Who should listen to this episode?
This episode is recommended for anyone interested in Large language models, Pre-training, Fine-tuning, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Today, we're joined by Aditi Raghunathan, assistant professor at Carnegie Mellon University, to discuss the limitations of LLMs and how we can build more adaptable and creative models. We dig into her ICML 2025 Outstanding Paper Award winner, “Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction,” which examines why LLMs struggle with generating truly novel ideas. We dig into the "Roll the dice" approach, which encourages structured exploration by injecting randomness at the start of generation, and the "Look before you leap" concept, which trains models to take "leaps of thought" using alternative objectives to create more diverse and structured outputs. We also discuss Aditi’s papers exploring the counterintuitive phenomenon of "catastrophic overtraining," where training models on more data improves benchmark performance but degrades their ability to be fine-tuned for new tasks, and dig into her lab's work on creating more controllable and reliable models, including the concept of "memorization sinks," an architectural approach to isolate and enable the targeted unlearning of specific information. The complete show notes for this episode can be found at https://twimlai.com/go/747.
Full Transcript
You know, we measure performance on a benchmark, and if that's all we care about, it seems like we can do really well, because if you collect data that looks like the data you want to do well on, you can just, you know, throw a lot of compute at it. But like, does that actually solve the task if we, you know, just test it in a slightly different way that is also meaningful from a deployment perspective? But like, when does that break the models and why does that happen? And how does the, you know, the training dynamics or the data curation, like what aspects of these actually influence this behavior? and what's the right way to intervene and make these problems go away. All right, everyone. Welcome to another episode of the TwiML AI podcast. I am your host, Sam Charrington. Today, I'm joined by Aditi Raghunathan. Aditi is an assistant professor of computer science at Carnegie Mellon University. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Welcome to the podcast, Aditi. Yeah, thanks for having me here. Excited for the conversation. I'm excited for the conversation as well. We are going to be digging into one of your recent papers, which won the outstanding or an outstanding paper award at ICML 2025. That paper is called Roll the Dice and Look Before You Leap, going beyond the creative limits of next token prediction. But more broadly, your lab has been really digging into some of the limitations of current LLM architectures and opportunities and, you know, what we need to understand to better make use of AI models. And I'm excited to talk through those with you. And to get us started, I'd love to have you share a little bit about your background and, you know, what got you excited about AI and machine learning. So I guess in my undergrad, I was always excited by complexity theory. And I really liked the elegance of thinking about what's possible and what's not. But at the same time, like maybe like several others, I also had the itch to work on something that had immediate impact, was practical. And when I started doing research was around 2015, 2016, when deep learning was really taking off. and this was post-ImageNet and so on. And so naturally I got excited about it. But really the moments that shaped my thinking around this was when I was at Stanford, I was attending a couple of talks and one of them was on adversarial examples, which showed me how these really capable models can also fail in spectacularly kind of surprising and maybe seemingly dumb ways. And they also had some immediate practical questions around security, reliability in the wild and so on. So that's sort of what got me thinking about, like, how do we think about, you know, these really capable systems that also have these failures? And it also tied into my itch on, you know, complexity and think about abstractions and making precise statements about these things. And it kind of felt like a nice combination because a lot of these aspects cannot be captured by just specific numbers and a benchmark and really need to go one step beyond to think about how do we, you know, what do these models learn? When do they work? When do they fail? And so on. So that's sort of what got me into all of these questions. And of course, the field's really been changing rapidly. And so I've kind of gone along for the ride. And what's kind of interesting is that progress has really, you know, models have become very capable in a lot of ways, but some of these fundamental failures still remain. And so we can keep asking the same questions about various different models. And maybe it's also sobering that it's not that we've actually pushed the reliability as we have pushed the capability of these models. And so in some ways, the same questions have remained over time, but also a lot of the context of these questions has been changing. It's pretty spectacular what we can do with the models. And yet we're still talking about not really understanding how they work and how it's a little bit of magic. Yes, absolutely. And I think the part that's also concerning is when this lack of understanding becomes a real issue. When, you know, when people think about getting models to be safe in some way, to not utter toxic content or to release dangerous information to people or manipulate people in some way that's dangerous. And a lot of our guardrails around these things are very brittle. And we currently don't have a way to do better because we don't really understand these systems. And so I want to really use our understanding in a way to shape actually making these models more reliable in all of these contexts. So talk a little bit about with that context in mind, how you've kind of crafted a research agenda. What are your core focus areas? Yeah, so I've always been interested in thinking about, like, this is maybe a catch-all phrase of distribution shifts, but it does capture this idea that, you know, we measure performance on a benchmark, and if that's all we care about, it seems like we can do really well, because if you collect data that modally looks like the data you want to do well on, you can just, you know, throw a lot of compute at it, and it seems like that just works really well. But at the same time, like, how does that, does that tell us anything about, like, how the model works in any situation that's slightly different, but you still expect the model to work well. So that's sort of the angle that you've taken in a variety of different questions that we ask is like, we can minimize some loss or like try to target a certain benchmark, which is what a lot of these models are trying to do. But like, does that actually solve the task if we, you know, just test it in a slightly different way that is also meaningful from a, you know, from a deployment perspective, but like, does like, when does that break the models and why does that happen? And how does the, you know, the training dynamics or the data curation, like what aspects of these actually influence this behavior and what's the right way to intervene and make these problems, you know, go away. This growing gap between benchmark performance and the experience of using these models, I think, is kind of a growing concern and one that we're hearing a lot of, you know, this particular moment of time, I think, because of the recent release of GPT-5, which, according to the benchmarks, is the smartest model around, but the user experience for many has been lacking in a lot of ways. And I think that kind of exemplifies this idea of benchmarking, as we think of it today, being somewhat inadequate. Yes, and maybe one concrete way of thinking about this that I feel like is very important, but people haven't thought too much about, is many times we want to use this as a starting point and do some kind of fine tuning. And so this could just be for safety alignment, that kind of stuff that we can say, oh, maybe one company just takes care of that. But also a lot of companies have their own propriety data that they want to fine tune on, or like you want to personalize it in some way to your context, or you want it to improve over time and things like that, or the world is changing. So we don't have a good way of measuring this adaptability of these models, which I also think is actually a very fundamental question. And as an example, I was starting with a colleague, Graham Newbig, who kind of asked me this question several months ago on like, which model should we start off if we had some data that we wanted to fine tune on? Should we take the model that like as such on our benchmark or like on this data works really well, like the zero short performance of the model as is? Does that automatically mean it's better after fine tuning? And yes, I mean, the answer is like, really, no. And for a lot of things, actually, model performance on a benchmark kind of tracks performance, including reliability, actually, or robustness to a lot of distribution shifts. What the community has found is, in general, the more data you train on, all of these numbers go up. But this one aspect of how easy is it to adapt these models after you frame them on data, we actually found the reverse thing happens at some point. So if you take a small model and keep training on a lot of data, we see that eventually the model that has been trained on more data that you've thrown more compute on is worse as a starting point for fine tuning than an earlier checkpoint that you had. So this was actually a really striking result because it's one of the first realistic cases where more compute in a very non-contrived setting, like on high quality data, more compute is actually kind of making a model worse, not just that it saturates, but it's actively worse. And another context in which this happens is, you know, when people are trying to serve quantized models to improve efficiency, there also we find this kind of trend where at some point, like as you throw more data at these models and then you quantize them, the model trained with more data still is better. But then you see this U-curve where at some point showing more data actually means that this model is worse after quantizing, which is again a case where your downstream model is strictly worse, even though you've spent more compute at the best of intentions and shown good data. So I think this was like a really interesting kind of finding that I think people should think more about is like one important aspect of using these models is actually fine tuning and adapting. And our current, you know, push towards just doing what we're doing now by improving this pre-training or like the first step of this process, but really optimizing that doesn't mean it's going to be good after doing your fine tuning or post-training. And this is not just a theoretical limit. We actually see that, you know, a lot of people talk about how LAMA 3 so this is before the LAMA 4 release. So people are just talking about LAMA 3 versus LAMA 2. And a lot of people found that LAMA 3 was much harder to fine-tune. You know, like academics regularly fine-tune models. And so, you know, this was sort of Graham's question too. And it kind of ties into this idea that it's because we have trade on so much data that at some point LAMA 3 is still really good at benchmarks, but it's just worse if you want to kind of use it for your task. And similarly, we ran experiments on the Olmo checkpoints because, you know, great work. They're releasing all these checkpoints so we can do analysis on them. And we found that the model, the smallest size model, 1B, I believe, that was trained on 3 trillion tokens is actually worse than the model that was trained on lesser tokens, fewer tokens after we do this kind of fine tuning or post training on, you know, very realistic benchmarks that people care about. So this was kind of an exciting result that shows that, you know, we might need to sort of one axis of rethinking pre-training is like, how do we get a good starting point for fine tuning? Is it specifically the ratio between the number of tokens that the model is trained on and the number of parameters that you found to be inversely proportional to performance and tuning? Yes. So for the precision result, that's exactly what we found. And, you know, we try to give different exponents, but it turns out that in our experiments, the exponents turned out to be the same. So it's literally the ratio. But in our fine tuning, we kind of didn't do precise like curves because a lot of that depends on the exact distribution of interest. So there's no like a clean mathematical result that would hold or the trend won't be the same for all data sets. But in general, we find that it does like a larger model can take in more tokens before it shows this sort of like inverse effect compared to a small model. So in some sense, more like a larger model can absorb more tokens efficiently, but it's not always exactly the ratio. It could be a different exponent. In some ways, that strikes me as like an intuitive result in the sense that, you know, when that ratio is larger, the model is potentially more overfit and it would be harder to unlearn and learn things as you're trying to fine tune. I feel like another way to think about this is usually when a model, when we train models and, you know, some of, I've looked at learning dynamics and things like that for a while now. And the converging result in that space is models learn simple things first, and then they learn increasingly complex things. And so like my colleague had this kind of visual image of like, you know, trying to like, you know, build something with cards. Like usually you start off with something really solid and then you kind of add more and more complicated things. But then like that also means that that structure gets less stable. And so if you actually try to like adapt these models in some way, then like everything collapses in some sense. And so I think we see something similar that like you're forcing the model to learn more and more complex things, which is good because it's fitting the data. But then that also means the model is brittle in some way that like you try to push the model to be a little, you know, in some direction by minimizing some gradient steps in a direction. But that introduces so much noise or like kind of breaks the model and causes a lot of forgetting. So that's sort of what and it's same thing for precision, too. Like, you know, we add you can think of that as like adding some kind of noise by changing the weights. And so the models become like less like what like they cannot absorb that noise and they just break. It strikes me that looking at this ratio is a fairly coarse-grained metric, but that it would be interesting to go even further. Like if you could, for example, understand the relative distribution of the training data on the model relative to the direction you want to fine-tune it to. Like you might have a model that is, you know, benchmark, benchmarks worse, you know, is smaller or other reasons why you might think it wouldn't do as well. But because of some like distribution overlap or something might do better. Is that a reasonable direction, you think? Yeah. So there are different ways to think about this. So like one is purely kind of is the model more stable in that it can like move in more, can move in different directions without like losing too much. right and so that's this token to parameter kind of you know could like roughly correspond to that and then the other questions like what you said is like how much do we actually need to move and we have some experiments in our you know this catastrophic overtraining icml paper where one proxy for how much to move is just the learning rate um you know so if if the model gets good performance even with a small learning rate then that roughly means the model hasn't moved too much and it kind of works. The distribution is closer in some sense. So that is that we do find some interesting trends over there as well. So if we start measuring or fine tuning on things that are very close, then we actually don't see this effect. The models are able to take in more tokens and still show good performance because we don't really end up updating the model much. And in the limit, if the fine tuning matches a pre-training, then we just get back the usual, like things go down as we show more data. But then as we have larger changes, then surprise, like these kinds of trends start happening where like at some point the model is actually getting worse. And one challenge is that it not very easy to say which distributions are close or hard because we might have some intuition for what this is but like how the model stores information might be different which is sort of why we look at learning rate itself as like a proxy for how much the model changes as a way of saying like how different the distributions are And then when I hear catastrophic overtraining, I think of not just that there's like this inverse relationship, but also that there's maybe like a cliff. Like you reach a point and then your fine-tune ability kind of falls off a cliff and it's much worse. Did you kind of characterize what that point is? Yeah. So I guess like the way we, so what actually happens is like at some point, I mean, like overall in the, in there is a regime where showing more data does help, like no matter what the model size, because the model is just learning stuff. And then at some point the model gets so brittle that whatever it learns from the more data kind of just like is overcome by its brittleness. And so we start seeing this jump. So there is a point where more data starts hurting you. So it's like a U-shaped kind of situation. And the point at which this U really depends on, kind of like we discussed, it depends on the data that you're fine-tuning on. And yeah, so we are able to characterize these precisely in the quantization setting because there's just fewer factors to count for because you just add noise. But in this fine-tuning setting, it's a little bit more tricky because we can't exactly say what the distribution of interest is. But if we fix a certain distribution or a certain learning rate that we care about, then again, this point actually becomes predictable. So we didn't go into modeling this functional form too carefully in the fine tuning setting in our paper. And I think that would be very interesting future work for once we decide, you know, what kind of distributions you want to fine tune on and you want to make decisions on how to allocate compute and do the scaling laws and so on. But we do exceed trends where it feels fairly, it looks fairly predictable with a nice mathematical form. So having done this research and, you know, identified these results, if you were to, you know, approach a future task and need to fine tune a model, how would it change the way you think about base model selection? You know, beyond the generalities, like would you, it's good to know because it helps you understand, but you would still like test everything and see how it performs on your data? Or is there just a whole set of models that you would no longer look at, for example? Yeah, that's a great question. I think if you are using a really, I guess what we found is the 1 billion model size, which again, you might want to use for efficiency reasons. So if it's trained beyond 2.5 trillion tokens or maybe even true trillion tokens, because we didn't have too much granularity in the public checkpoints, I would say like, okay, maybe that's a point where you don't want to use that model unless you're sure that you don't want to change the model too much. And in terms of like, so that's sort of like one thing that's clear. But in all the gray areas in between, I think we do have to do some preliminary experiments where we take a bunch of different checkpoints and see what the fine-tuning performance actually ends up looking like. And that would give us some sense of which part of the U-curve are we on. And so I would say that's the understanding it gives us, is that we kind of expect this shape and we can try to probe to see which regime current models are. And so that would guide us where the optimal could be. And this is the paper, overtrained language models are harder to fine tune. We'll link to all the papers that we discuss in the show notes for folks. And so just kind of contextualizing this, and we'll get to creativity in a second, but you published a really interesting blog post that kind of surveyed all of your labs papers at ICML. and it was kind of broken up into limitations and opportunities. And one of those limitations is overtraining. And that's kind of what we just talked about. The next one was unlearning and the challenges associated with unlearning, which is related to this idea of fine tuning. Talk a little bit more about what you've seen with unlearning. Yes, that's like, yeah, it's actually very related. And so like when you think about what are the use cases of fine tuning, One is just to specialize and push a few numbers on your use case rather than the benchmark. But actually, another important use case of all of these post-training or fine-tuning methodologies is safety in some way. And so a lot of the alignment work is actually trying to teach the model post-hoc what is good and what's bad. And similarly, we might try to unlearn harmful knowledge or unlearn private information and so on. So that's a very safety-specific use case of fine-tuning. and we tried to look at why is it so hard to do unlearned English there's like so many papers that are published and it very much reminded me of adversarial examples from my PhD where people had all these defenses ideas for defenses but then you know Carlini would break all of them and so and so kind of yeah so that so yeah so it was basically it kind of felt like that and if you look at some of the assumptions or kind of how the unlearning field has progressed it's it people have tried to take the base model as it is, or like trick the starting point as it is, and then try to assume certain things that might be happening in these models and use that to get algorithms. And to maybe interject and to be more concrete about unlearning, you mentioned safety, but, you know, example might be, you may have in the training data how to create a chemical weapon, and there's a whole, you know, line of work around like building guardrails to like detect that and suppress the model from talking about that. But another direction is to try to just extract that information, erase it from the model. And that's unlearning. Yeah. And it could be harmful information. It could also be private information that they shouldn't have trained on or someone wants to remove that information. And so it's something that exists in the model and you kind of want to remove that. And so I guess maybe the privacy angle also tells sort of why the guard real feel sort of, you know, you kind of really just want to remove it from the model and want it to be like it wasn't ever trained on this data. So, you know, that's the other use case for all of these things. And so what people have been trying to do is, so you can start the first, like the first attack would be, or like, you know, first way to try to address this would be to fine tune so that the model has high loss on all of these things that you want to forget. And then people realize that you can't actually do too well because then you kind of, you have to like really change everything a lot and that destroys a lot of information in the model. And then there's another sense that maybe that this information could be localized or maybe we can find specific neurons or specific subspaces and try to just erase those parts. And even that has actually limited success for two reasons. One is we have to first find a way to figure out where those neurons are, which store this information. And second is even, I mean, and even after we find that it's not clear, like what's the right way to erase that without destroying everything else. But the assumption here is that there exists such neurons in the first place. And what we show is that that actually is not true, that there's no reason in how we frame these models that should allow this information to be disentangled in this nice way. And it seems like maybe we see such separation, but because the separation is not very good, that's sort of why our unlearning methods don't work very well. And so we instead say, instead of constraining ourselves to work with the starting point that is not very good in that. It's not really disentangled all of these things. What if we gave ourselves a flexibility of like, what if we could train our models in a way that enables this kind of downstream unlearning because we know that that is a use case which we might care about. So that inspired this work on memorization syncs, which is kind of taking the same idea or assumption that people implicitly try to make about models that maybe information is isolated to neurons, but instead of waiting for it to happen by magic, we are like, let's try to actually enforce that by design. And we find in our paper, both for experiments and analysis, that normal training does not actually lead to this assumption to be true. Because, like, yeah, we have analysis in the paper, but kind of the main idea is, like, we don't really encourage or force this sort of disentanglement. And it really depends on the bias of the training algorithm. And the current training algorithms don't have a bias to actually enable this kind of separation. But we can encourage this separation. The assumption that we're talking about is that knowledge is localized, essentially. Exactly. Yes. Yes. That knowledge is localized. Yes. So there's no, like, if you look at the training objective, you're just passing gradients to all the parameters. There's nothing to do with that. So there's nothing about it. There is some, you know, separation that seems to emerge, but it's not perfect because it doesn't have to be. It's not trained to be. But what if we instead trained to encourage this sort of separation? And that's exactly the idea behind memorization sinks, where we're trying to say for every document, let's say that these are some specific neurons that are only updated on that document with the hope that all the information that is specific to this document goes to that neuron and those neurons are not touched or updated on other documents. And so that's the main idea here. And we, of course, want to have shared neurons because we also want the models to actually learn from all of this information. And we don't want to train completely decentralized models because those won't be as capable. So when you're doing pre-training, we want to model to learn something that is shared. And this is where, again, the beauty of the training or the inductive bias comes in, which is with this architecture, the model actually is incentivized to keep the shared information that is shared across all the neurons, sorry, across all the documents in the neurons that are updated on all the documents. That's just a strictly better solution. And the stuff that's very specific to a particular document is in these memorization neurons. And since those are not updated on any other documents, that information is sort of preserved, disentangled, kept aside. And we find through our experiments on a somewhat small scale, but we're scaling that up now, is that this architecture actually enables this kind of nice separation between what is special or like what is unique to your particular documents. That's all kept in a specific neuron, whereas what is shared is allowed to be learned in this shared neurons. And at test time, you can just drop out these memorization neurons and then you're good to go. A couple questions. So are you at train time, are you identifying the information that you will later want to pull out of the model? Yeah, that's a good question. So the way we're setting it up right now is that we are just having the units that we might want to remove. So if you feel like this document has information about a unit who might want to remove this information, then we kind of associate this entire document to a specific set of neurons. So it could also be like a topic. Like let's say we say that we don't want to be at the document level, but we want to be at the topic level. Then you want to have neurons for all the documents for a particular topic. We want to selectively kind of activate only those neurons. So we need to know kind of that abstraction that we might want to remove later. It seems like the number of these memorization sync neurons would be a hyperparameter. and like the ratio of those to the total number of parameters is kind of an interesting... Yes, exactly. To what degree did you experiment with all of that? Yeah, we experiment with all of this. So we do need to, and like, so the models have to be a little bit bigger in this way because we do want more neurons. But one other kind of nice trick is we don't need them to have completely separate neurons. We just need to make sure that the neurons are somewhat orthogonal. And so we can just pick random high dimensional directions and they are almost orthogonal. So that kind of means we can have, so we can, instead of operating on individual neurons, we take these neurons, but activate different subspaces that are fairly orthogonal. And that also works as well. So that's a trick to prevent having a really large model size if you were to otherwise have a specific neuron for like every document or something like that. So that's one way in which we're able to like make sure that the model size, you know, doesn't go up way more than what it is. So we have ablations of paper where we find that for moderate increases in model size, we're actually able to kind of encourage this behavior. Are these neurons like localized in the architecture, like to a particular layer, or are they distributed? Are they, is kind of the topology totally learned, or is it a priori like set up where these neurons are? Yeah. So what we do in our paper is we pick a random, for every document, we have a hash or like some kind of random neurons that that particular combination of neurons is activated. And so this effectively models like some one dimension in like one direction, a high dimensional space that's like left to this new document. One could consider smarter or like there might be improvements to this on like kind of which layers should we look at. So we only looked at the MLP layers and we only introduced these things within the MLP layers, sort of building on the intuition people had that facts or factual information is generally stored in the MLP layers. But like I could imagine like there's a lot of research that one could do to like, you know, be more intelligent about this. But we kind of learned this and right now it's just random. We could also try to kind of have say like related documents to maybe share similar subspaces in some way or something. So we could even have like a softer version of this where instead of everything being orthogonal, we can maybe put like things that are slightly close together to be closer together or so on that might further push up performance. I think there's a lot like a lot of things that one could do here. And I'm very excited about this because I think it really kind of tells us ways in which we can get more control over these models by design. It also makes me think about kind of the overlap with the anthropic circuit tracing work. and if there is some way to combine these techniques to better localize concepts. Yeah, that's a great point. So one thing that could be interesting is, so maybe natural training already has a propensity to have certain kinds of structures, but like it's not perfect. So maybe we can start with like, okay, these are structures that seem easy to enforce. So let's try to like actually hard enforce them. And so maybe that's a way to like take these fuzzy kind of interpretability stuff and like actually train a model to make those more hard and more concrete to enable better control. Yeah And to a first order we kind of do something like that which is people have kind of thought that knowledge is isolated And so we were like OK let try to really isolate that because it seems plausible So we could imagine doing other things like that That paper is called Memorization Sinks Isolating Memorization During LLM Training It also brings to mind for me, you know, maybe it's just this word overlap of memorization, but a lot of conversation now is talking about the role of memory with LLMs and, you know, in particular how, you know, memory is not a very robust feature of like attention-based LLMs and how memory architectures are a promising way to increase performance. Like, does that relate to this in any way? I mean, I can think of a philosophical way in which I think that's somewhat related. In that, I think in some sense, we want to like have, like another way that memorization things actually could be useful is not just for privacy, but it tells us we can disentangle things that should be constant or kept in the model and things that should be updated. So for example, facts change, but we want to still preserve the ability to reason or the linguistic capabilities of the model. And so if we had ways to disentangle those, then that helps with some of these things. And so you can think of memory in the context that you were saying as stuff that we want the model to actually remember. And so the more we can kind of disentangle stuff that should be kept around versus stuff that's independent of that, I think in that sense, they are sort of related. And maybe there are some ideas that we can cross share on how the exact architectures are developed and so on. Yeah, yeah, yeah. I don't know what the kind of neuroscience implications would be or what we've learned from neuroscience, but it strikes me that like in humans, like fact memory and concept memory are different. And so we should have different things on the AI side. Exactly. And I think architectures that are scalable, but still kind of try to enforce something like this, I think are very promising. And so lastly, on the list of limitations you profiled is creativity. And that's where your big paper comes in. You know, talk a little bit about the motivation to explore creativity. So I guess like every AI researcher, I try to also use LLMs as much as I can to automate and make my life easy. And, you know, sometimes like may or may not have happened in real is that you prompt the model to generate homework problems, right? Like, especially because you want the model to generate something new or like, you know, I'm like, oh, like can the model come up with something really clever, like a bit, like to test students that I couldn't come up with or something. Or new research ideas to, you know, write some new grant proposals or something. And I've almost always found that, or maybe always, like the models have never been able to give me something that's truly like, aha, I hadn't thought of that. So they're great at almost every other task that I try to use them for. Like they're great at summarization. They're great at like, you know, looking at what's common across a bunch of different things and drawing some stuff there. But they're not really good at these open-ended tasks that I give these models. And so that was sort of like the kind of motivation or like just, you know, that's always something that was lingering as like, yeah, I just don't feel like, or like, how do we think about that? And there's also work that people have been trying to do, you know, in the community about analyzing research ideas, like actually running human studies to see can models generate ideas or not. And there's been a lot of back and forth. So it's something that like is very fuzzy and like seems like things people are thinking about. And this was in collaboration with Vaishnav, who's a researcher at Google. And so we were having this conversation And he was also an author on this next limitations of next token prediction paper. And so I wanted to do some kind of a similar research of like, like, let's try to identify like, what are the core principles that we might need for creativity? And can we test like, can that actually emerge from these training, like training objectives that we have from these models? And that's like a way that we can get an answer to this because having actual benchmarks and testing creativity and who knows what's in the training data, all of that makes it harder to answer this at scale. So we tried to take a different approach, be like, okay, let's just like start from first principles here and think about creativity, try to devise simple tasks and see in these concrete tasks, like what is the right objective? Like what does next token prediction do? How do we extract creative solutions? So that was sort of how that got in, how we started thinking about this. At the same time, I also had another project in my lab that looks at this from slightly more realistic settings of problem solving, like reasoning. And there again, we found that models, especially after you train them more, they started kind of collapsing in their solutions. So for people who are familiar with inference time settings, you can either give the model one shot to answer or you can query it multiple times. And what we find is that when we try to train models, in general, their performance at this one shot setting does go up. But they also get worse at this, like, give it multiple times. So what it kind of means is that when models are trained too much, it seems like they start giving the same solution and try the same incorrect thing rather than actually trying diverse solutions. So I was kind of convinced that this is actually a problem and that models are not very good at being kind of trying out creative, diverse solutions, even on realistic tasks. And we kind of wanted to study this from a first principle. So that's sort of really what motivated us to think about creativity. And then once we had some really simple tasks to work with, that actually allowed us to make formal statements or give intuitive statements about what do different objectives do? Like, how do they perform? Can we do something better? And we found some really nice, interesting alternatives to the current paradigms that people have to improve creativity. And we are now trying to run these at scale and put some of these ideas in an actually training larger scale models. and we're kind of excited to see where we can take in terms of making models more creative by changing the way we train them. So talk a little bit about what an objective for creativity means. That seems very difficult to capture in an objective. Yeah, so we drew inspiration from a lot of work in cognitive science, particularly Bowdoin's work that tries to formalize some notions of creativity. So we are in no ways kind of, really not thinking about what's the right definition of creativity, but more like let's lean in on the definitions that cognitive scientists have taken about creativity and use that. So it's a very concrete example. There's this notion called a combinational creativity. And I think a nice example of that is like wordplay. So if we think of a joke, for example, let me, why did this scare, what's, let me try to pull up exactly, I missed a punchline here. Yeah, I feel like I've seen this so many times that I'm like, it's not funny to me. So, yes. Yeah. So I guess the one example is, why did the scarecrow win an award? And the punchline is because he was outstanding in his field. So if we can think about why this is creative or why this is funny, is because scarecrow and award are two kind of seemingly different words, like unrelated. but there is this actual connection like outstanding that is somewhat novel or something that you hadn't thought about that like actually links these two words together so that's an example of next token sorry that's an example of this combinational creativity where we have like we're trying to see whether the model can find unexpected connections like through a graph or like find two words that like have a common parent like this out two different interpretations of outstanding. And so like, can the model actually give new sort of ways of doing this? And so the way we abstract this is, now let's say we have a graph and we teach the model all the edges in the graph. Can the model actually discover new siblings? So like two nodes that actually have a parent. So the two nodes don't look connected, but they have a parent. So can the model actually see some of these and then generate more of these? Like, can it find new connections in the graph? So that's one example of creativity that's abstracted really following, you know, Bowden's work on combinational creativity. And they have a lot of examples on how a lot of things that we do that we think are creative are actually kind of finding these connections, unexpected connections between things. The other kind of creativity we consider is just exploratory, which is we just want to freeform find interesting things in the world. But of course, they still have some structure to them, but it's just sort of a latent structure. But we just want to find new things with that structure. And like Borden explains, a lot of the work that artists do, for example, is this kind of exploratory creativity. And we try to write this down in like math and symbols that we can train models on as, let's say we train on a bunch of, like can the model like generate circles, for example, it's same in the following the graph kind of setting, can the model find new circles or new triangles in the graph that the model wasn't trained on? So these are the ways we think about creativity, like they're not perfect by any means, but I think they capture like some of these core principles and they already show why certain training objectives might actually be good or bad for these notions of creativity. And how do we kind of, and we can, you know, dig more into that, like what's the right way to train these models or what could our current training be missing and how should we be thinking about, you know, maybe new ways to pre-train these models to encourage this kind of creativity. Taking a step back and maybe getting a little bit philosophical, How do you distinguish the kind of creativity that you are thinking about broadly, meaning not necessarily, you know, these particular constructs, but, you know, more broadly, the direction that you'd like to see LLMs go with, you know, I asked LLM to, you know, tell me a joke with a funny accent. And like it can do these kind of creativity-ish kinds of things. Like how do you, for someone who says, oh, I use JetGPT all the time and it's very creative. Like how do you distinguish? Yeah, I think, I feel like one notion of creativity is just, is it generating something that's not in the training data? Right. And this was all it and something meaningful. And like this was already for a long time, like not easy. and so it's like it's very impressive that models can do this and can generate poems in the style of blah blah blah right like we can do all of that stuff i think the difference becomes what i feel is like open-ended versus like closed-ended so in these cases we're still kind of saying like roughly where we want to go like if i say write x in the style of y i kind of i'm telling the model what to do and then the model can fill out the path but what i'm thinking about like creativity is like take me to things that i've just never seen or thought of before and i think that's the part where I feel like all our constructs as well as Chinese say, and even from our own use of these models, I feel like that's maybe where we want to think more. So I think as another example, when I look into these benchmarks or people trying to use models to, for example, kernel bench to come up with solve problems in new ways, it feels like we still have to carefully kind of prompt the model on the strategies or the things it has already done and things it should explore and so on. So I think that part is still coming from our explicit instruction, even though it's generating new things. And so I kind of want to see, like, can they actually do this part on their own too? And I think that's what it would mean to meaningfully kind of go beyond like what we are able to do and find new things. And if we are able to do that and we're able to do that at scale, then like we're actually going to really discover a lot of interesting new things and find new connections. And then how about the idea that, hey, I've already got a creativity dial, it's temperature and I can crank it all the way up. and wow, that's doing things that I totally didn't expect or didn't prompt it to do. Before we jump into that, I want to also say, like when I was trying to prompt these models for this iClear, we were working on the talk and we tried to just get, can the model generate jokes that I can use, like the scarecrow kind of thing, like new jokes. And it was actually really bad. And I think we were joking that it's because models were trained with, you know, this kind of objectives that we are just showing in our paper are not very good. And so that was actually a case where I was trying to like get a model to like say something that was new, that was a new joke and like, but still have this structure and it wasn't able to, but maybe, I mean, and to the best extent that I could prompt engineer it. Meaning that, you know, maybe if you're thinking about it casually, like in just asking the model to tell you a joke, maybe you, you know, it'll come across some that you haven't heard before because there are a lot of jokes out there. But like, you know, if you really dig into it, you know, they're probably not novel and, you know, probably not funny. Yeah. And we couldn't get one that obeys the structure of like connecting. like for example the prompt would be like tell me a joke that's funny because it connects to unexpected entities right and like that was the kind of wordplay that we were trying to get at like so it's finding a new connection that like i hadn't thought about but like it was kind of bad at doing that so you could think of many different structure of jokes but like this particular aspect of like kind of this hidden punchline that the model discovered uh it didn't and i think there's a very deep reason like we explained in the paper is that a lot of the training kind of like the training data that doesn't really have supervision on sort of this, like it doesn't say first the punchline and then the joke that the model can do that. But instead, the punchline is sort of latent. And so that actually means that the models are not able to kind of learn the right structure and they end up learning some things local that are like sort of memorization. And they don't learn this true process that we want where think of something that's search over all possible things and find something new and then generate them. And so that's sort of the limitation that we try to show in the training of these models. Two things jump out at me. One is that I'm not sure that a human would do very well with that kind of instruction, like generate this joke on the fly, given, you know, this criteria funny. But also that, you know, there's something maybe orthogonal to the idea of, you know, structure and creativity or structure and like impulse and like trying to convey both of those maybe is confusing to the LLM. like rules and like create. Yes, actually, that's a good point. So like a lot of the creativity is not just simply saying something crazy, but we want it to be still structured and meaningful. And if you think about like use cases like molecular biology or like, you know, drug discovery it not like we want to generate like random new things We actually want the model to infer sort of the right latent structure and all of these and generate new things according to that structure So that really the kind of kind of creativity that I thinking about and like you know yeah that we try to model in these tasks. And that's a great point that like, how do we get a model to try new things while still obeying some structure? And I think you were just asking me a question about like the temperature. And I think that really leads into that where like, you know, just crank up the temperature, you can start getting crazy things, but then the model is also going to be less structured. So we really want this sort of structured exploration from these models. And similarly, when I try to get a model to generate a homework problem, I don't want it to put together something like that's like meaningless. So that's, you know, creative, but like really, I wanted to follow some certain logic or some structure there. And it's an interesting idea that the most valuable examples of creativity are, you know, a balance of structure and impulse. Absolutely, yes, yes, yes. That paper is roll the dice and look before you leap going beyond the creative limits of next open prediction. Where does the roll the dice and look before you leap parts come into that? Okay, so first I'll talk about the leap because I think that ties in closer to what we just chatted about, which is sort of like this wordplay or like this connection, we are looking at tasks where there is a leap of thought that has to be made, which is not often spelt out in the training data, which is why models struggle to actually infer that thought. And so what you're saying is that models should actually be trained to kind of learn how to take those leaps or like take the structured leaps, especially, and rather than just showing the outcome of the leaps, which is kind of what our current training data is, because those just have, they don't have the thought process that goes behind these. Which sounds a little bit like training on thought traces as opposed to training on answers. Like, do you think there's a path there? Yes, I think that that's certainly one way to do that. But I think another part that we show in the paper is actually different training objectives, like teacher-less training. So that could be things like multi-token prediction, where we are trying to discourage the model from getting the right answer by just looking locally. But if it has to actually generate the entire pattern or the entire set of things, then that actually encourages the model to have this global understanding. And people are excited about diffusion models lately. We find that diffusion models also kind of have a similar thing, where they don't show all the tokens and they kind of mask out different things. And so that actually encourages the model to have some sort of planning or like some, you know, the ability to like do more global things. And so we find these two objectives actually work a lot better in our experiments. One is multi-token prediction, where we just force a model to produce multiple tokens at a time, rather than just one token, see the correct answer, and then the next token. And so in multi-token, we're kind of the model does not get that local supervision, has to get everything right before it gets reward. And diffusion models can be thought of as having these different masks or different orderings. And so that also encourages the model to have more global understanding. So both of those are alternatives that might be worth pursuing more seriously if we care about getting these kind of diverse generations from the models. So that's the part about look before you leap. The part about roll the dice is also really interesting. I think there's an under explored aspect of like how to get diverse things from the model. Right now, there are two ways. Like one is we think about the diverse stuff ourselves. And like, you know, when people prompt the models, they're like, you give more specific instructions. And then you're like, actually, I want to do something else. And then we tell the model, go do this instead. So that's one way. The other way that, you know, machine learning people may be thinking about is like, just increase the temperature. And that like gives you more, you know, tokens. And like I said earlier, that also tends to really destroy the structure. And so we try to think about like, okay, how do we think about creativity? Like what's the right way to kind of elicit this randomness from the model that can be structured? And one way that seemed more natural than temperature sampling is first generate like a random idea and then follow that idea and generate all your tokens. So the right place to introduce randomness is probably at the beginning where the model commits to exploring something. And then once it decides to go down that path, that's like, if you keep increasing the temperature, then it's just going to go all over the place. Instead, we pick something and then we stick to it. And so at test time, if you want more diverse generations, we would sample new prefixes or new starting points, and then let the model kind of do its thing. So that's where we're trying to say we should introduce randomness, which is like the roll the die. So we roll the die first. And then once we are, you know, based on the outcome of the die, we kind of pick that thought and go with it. And so this, of course, it requires training the model to be able to do that. And we have a very simple way to do this in the paper, which is we just, in the training data, we just have random prefixes that we prepend. So the model conditions on a random prefix that you can think of that as like some proxy for a random idea, and then it generates things. And at test time, we can get new ideas or new generations from the model by changing this prefix. Sorry? Random nonsensical prefix or a random sensical prefix or? Yes, great question. So in our paper, because our experiments are so simple, a random nonsensical prefix actually just works, which is surprising. We don't really understand why it works. That's very interesting future work too. But as a proof of concept, it seems like, okay, like there's maybe some meat here. And what we are looking at now in my group is having more meaningful prefixes, which actually capture like what exactly is the idea, like some semantics of the idea. And then condition on that to generate things. And if this paradigm of pre-training ends up working, this would also give us more controlled, diverse generations. Instead of prompting the model with explicit instructions, we can actually just change this prefix that we condition on in some nice way. And then we can get diversity in that sense as well. And so that's what we are trying to say that maybe we should think about is rather than doing temperature sampling, maybe we can tell the model to condition on new ideas that can then be randomized or can be made more diverse. And that way of training as well as inference might actually be a better way to get structured diversity from these models. And we mentioned GPG-5 and also this idea of inference times compute, inference time scaling. And one of the things that I think we're learning about this model is that at least at the higher tiers, like it does a bunch of parallel inferences and kind of integrates those together. And in some sense, you might think that, okay, assuming this stuff in parallel, like you're going to get this diversity of thought, but also there seems to be this averaging effect that kind of like dries out the expression of that thought. And it strikes me that it's connected to kind of your work and maybe they need more random seeds in their parallel threads or something. Yeah, it also depends on how these like parallel thoughts are actually coming. I mean, we don't, like, I guess, I don't know how exactly those thoughts are being generated, you know, behind the scenes there. So two comments. So one is like, if they're doing RL on sort of a starting point, the question is like, can we, how do we leverage the starting point appropriately to get the right diversity? Like, can you actually go beyond what the base model can generate? Like, are some things more likely than the others? in which case, like, can we correct that in some way? Like, what if we want to generate more of one kind than the other? So all of these kind of things are still going to be quite challenging with, like, kind of the current paradigm. So hopefully if we do pre-training in a way that allows this kind of, you know, diverse control sampling, then there's more that we can get. But yeah, I think the thing that, like, is, like, basically unclear is, like, what, like, how do we get that diversity? And, like, how do we actually span things? Like, if we just let it all be end-to-end from the model, then again there might be some collapse and like the model might not be generating all the diverse things that we need um and yeah and there's some of that that you can simulate by carefully collecting training data but then of course like if you want to do stuff that's like very different from your training data like how do all of these things work like yeah that's hard to say again with the parallel traces kind of thing but i think morally it is sort of what we are saying is that we want the model to um you know try a new idea and then stick to it rather than like you know introducing randomness at every step. And the random prefixes, the random prefixes in your case are at inference time. So it's like you're prepending this random prefix to the prompt. Yeah. So in our toy settings, there's no like notion of a prompt. I guess there's no instruction, but yeah, you can think of that as like prepend. Yeah, you're prepending. So you're giving your question and then you're adding the prompt and then letting, sorry, adding the random string and then letting the model generate. That's how it would work in practice. Yes, but you also have to train a model to be able to do this. At training time too, we have to simulate this where the model has random strings in the beginning that it uses while training. Meaning, so don't expect that you can just add some randomness to the beginning of a prompt at ChatGPT and it's going to give you better answers. Yes, yes, yes, yes. And so you kind of talked about RL is the do you think that the approach is compatible with uh rl training yeah this is fascinating i think overall any improvements that we can do to the base model in like terms of like how we sample that directly translates to rl because a lot of the rl is like you try a bunch of things from the model and like you up like you know you make the model do more of what it's doing well and so if you just have better starting points like then that you just like will automatically improve all of these things. And the way I think a lot of this could be useful is in ideas like structured exploration and so on. And people keep bringing up, I guess, exploration is kind of one of the biggest challenges to the next frontier. And so if we want to do some kind of structured exploration over many different spaces, the more control you have over the kind of diversity, the better. Yeah. So all of that, I feel like that's very compatible and would actually be really useful for further audit training. So I've referred back to this blog post a few times. And so there's these limitations and opportunities and we've actually covered the opportunities. It's this idea of memorization syncs, seed conditioning, which is like prepending the random text and multi-token learning. You've hinted at some future directions for your research, but talk a little bit more broadly about like the direction that you see these various efforts going. Yeah, I think a lot of like we really want to scale up a lot of these things and to see and to both, you know, for ourselves, but also convince people that these are the interventions that are worth investing for training their future models. And especially this adaptability part, like, you know, I often think like, how do people keep their models fresh? Like right now, we are at a point where everyone is training new models every year. And so like maybe we haven't thought about this issue, but at some point we're like kind of not going to be able to do that. And I think the straw man answer here is like just do retrieval, like that gets you the latest facts and everything's okay. And we had this other paper that was an oral at the previous conference where we kind of show how that models are not going to be that good at using the context and overriding its parametric information. So this is not really a magic solution to keeping the models sort of updated. So I think we're trying to like, I feel like the aspect that's like kind of really interesting thing about is like, how do we actually make models that are easy to update? Like what's the right way to decompose or disentangle these aspects of what should be preserved, like what should be updated? So I think I'm really excited about that aspect. And I'm also, you know, really excited about scaling up these ideas for improving diversity, creative generations. And we're all like we all have like, you know, we're looking at things like theater improving and, you know, trying to build systems that like help mathematicians. And I feel like every time I talk to all of these people who want to use LLMs, I keep coming back, like, how do we make sure models can actually search and find creative things? Like, for example, finding counter examples, like we can use LLMs, they would be great. But how do we make sure they find kind of struck that kind of structured exploration, like finding creative counter examples in some way? So I would be very excited about using these ideas in all of these different applications that I think are top of mind for people right now in terms of the interesting applications that can really push the frontiers here across various notions of science. So those are kind of the things that we're thinking about. And of course, also all of this, I think we spoke a little bit about the fact of trying to understand these models. And I think a lot of our experiments are also trying to less be like, here is a state-of-the-art method that gets high numbers in this, but really hopefully give some objectives that stand the test of time in terms of showing you new understanding about the fundamental properties that go beyond specific data sets or specific training decisions that you might make. And that would help in guiding the next generation that we try to train. So we are continuing our efforts along that axis as well. Well, Ajitji, thank you so much for sharing a bit about your research. It's very interesting stuff. Yeah, thank you so much. And I kind of wanted to especially sort of plug a little bit that I want more people thinking about trying to understand these models because we really need a lot of work there. And I think it's like a scientific question where these are complex systems and we really have to think a bit like a scientist setting up the right controlled experiments, make formal hypothesis, test them out. And there's like a huge sort of opportunity here. And I think that will really unlock like how we further push models, especially as people are starting to see maybe diminishing gains from, you know, just pushing our current paradigm. And so like, so kind of, I think there's a lot of opportunity there. And so hopefully we can have a bigger community looking at these aspects. Awesome. Very good. Very good. Well, thank you very much. Thank you. Thank you.
Related Episodes

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759
TWIML AI Podcast
52m

#228 - GPT 5.2, Scaling Agents, Weird Generalization
Last Week in AI
1h 26m

Exploring GPT 5.2: The Future of AI and Knowledge Work
AI Applied
12m

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner
Latent Space

Why Vision Language Models Ignore What They See with Munawar Hayat - #758
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
57m

What We Learned About Amazon’s AI Strategy
The AI Daily Brief
26m
No comments yet
Be the first to comment