New top score on ARC-AGI-2-pub (29.4%) - Jeremy Berman

Machine Learning Street Talk • Machine Learning Street Talk (MLST)

Saturday, September 27, 20251h 8m

Spotify Apple

Machine Learning Street Talk

0:001:08:27

What You'll Learn

✓Jeremy Berman is a research scientist at Reflection AI who recently achieved the top score on the ARC-AGI-2-pub leaderboard using an evolutionary approach.
✓Berman's approach generates descriptions of algorithms and iteratively refines them, rather than generating explicit code.
✓Berman believes that language modeling with reinforcement learning is crucial for developing AI systems that can synthesize new knowledge and understanding.
✓The ARC challenge is designed to test a machine's ability to extrapolate transformation rules from a few training examples, which is easy for humans but difficult for current AI systems.
✓Berman's initial approach was inspired by Ryan Greenblatt's work on generating and refining Python programs, but he found that language models struggled with small errors even on easy tasks.
✓The second version of the ARC challenge features more compositional tasks that require multiple iterations, which Berman's evolutionary approach was able to handle better than his previous solution.

AI Summary

The episode discusses Jeremy Berman's recent success in the ARC-AGI-2-pub challenge, where he used an evolutionary approach to generate and refine descriptions of algorithms rather than explicit code. Berman explains his background in AI research and his belief that language modeling with reinforcement learning is key to achieving generalization beyond current AI systems. The discussion also touches on the importance of compositional and symbolic reasoning in AI, as well as the use of human data to evaluate and fine-tune AI models.

Key Points

1Jeremy Berman is a research scientist at Reflection AI who recently achieved the top score on the ARC-AGI-2-pub leaderboard using an evolutionary approach.
2Berman's approach generates descriptions of algorithms and iteratively refines them, rather than generating explicit code.
3Berman believes that language modeling with reinforcement learning is crucial for developing AI systems that can synthesize new knowledge and understanding.
4The ARC challenge is designed to test a machine's ability to extrapolate transformation rules from a few training examples, which is easy for humans but difficult for current AI systems.
5Berman's initial approach was inspired by Ryan Greenblatt's work on generating and refining Python programs, but he found that language models struggled with small errors even on easy tasks.
6The second version of the ARC challenge features more compositional tasks that require multiple iterations, which Berman's evolutionary approach was able to handle better than his previous solution.

Topics Discussed

#Evolutionary algorithms#Program synthesis#Language modeling#Reinforcement learning#Compositional reasoning#Symbolic AI

Frequently Asked Questions

What is "New top score on ARC-AGI-2-pub (29.4%) - Jeremy Berman" about?

What topics are discussed in this episode?

This episode covers the following topics: Evolutionary algorithms, Program synthesis, Language modeling, Reinforcement learning, Compositional reasoning, Symbolic AI.

What is key insight #1 from this episode?

Jeremy Berman is a research scientist at Reflection AI who recently achieved the top score on the ARC-AGI-2-pub leaderboard using an evolutionary approach.

What is key insight #2 from this episode?

Berman's approach generates descriptions of algorithms and iteratively refines them, rather than generating explicit code.

What is key insight #3 from this episode?

Berman believes that language modeling with reinforcement learning is crucial for developing AI systems that can synthesize new knowledge and understanding.

What is key insight #4 from this episode?

The ARC challenge is designed to test a machine's ability to extrapolate transformation rules from a few training examples, which is easy for humans but difficult for current AI systems.

Who should listen to this episode?

This episode is recommended for anyone interested in Evolutionary algorithms, Program synthesis, Language modeling, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

<p>We need AI systems to synthesise new knowledge, not just compress the data they see. Jeremy Berman, is a research scientist at Reflection AI and recent winner of the ARC-AGI v2 public leaderboard.**SPONSOR MESSAGES**—Take the Prolific human data survey - https://www.prolific.com/humandatasurvey?utm_source=mlst and be the first to see the results and benchmark their practices against the wider community!—cyber•Fund https://cyber.fund/?utm_source=mlst is a founder-led investment firm accelerating the cybernetic economyOct SF conference - https://dagihouse.com/?utm_source=mlst - Joscha Bach keynoting(!) + OAI, Anthropic, NVDA,++Hiring a SF VC Principal: https://talent.cyber.fund/companies/cyber-fund-2/jobs/57674170-ai-investment-principal#content?utm_source=mlstSubmit investment deck: https://cyber.fund/contact?utm_source=mlst— Imagine trying to teach an AI to think like a human i.e. solving puzzles that are easy for us but stump even the smartest models. Jeremy's evolutionary approach—evolving natural language descriptions instead of python code like his last version—landed him at the top with about 30% accuracy on the ARCv2.We discuss why current AIs are like "stochastic parrots" that memorize but struggle to truly reason or innovate as well as big ideas like building "knowledge trees" for real understanding, the limits of neural networks versus symbolic systems, and whether we can train models to synthesize new ideas without forgetting everything else. Jeremy Berman:https://x.com/jerber888TRANSCRIPT:https://app.rescript.info/public/share/qvCioZeZJ4Q_NlR66m-hNUZnh-qWlUJcS15Wc2OGwD0TOC:Introduction and Overview [00:00:00]ARC v1 Solution [00:07:20]Evolutionary Python Approach [00:08:00]Trade-offs in Depth vs. Breadth [00:10:33]ARC v2 Improvements [00:11:45]Natural Language Shift [00:12:35]Model Thinking Enhancements [00:13:05]Neural Networks vs. Symbolism Debate [00:14:24]Turing Completeness Discussion [00:15:24]Continual Learning Challenges [00:19:12]Reasoning and Intelligence [00:29:33]Knowledge Trees and Synthesis [00:50:15]Creativity and Invention [00:56:41]Future Directions and Closing [01:02:30]REFS:Jeremy’s 2024 article on winning ARCAGI1-pubhttps://jeremyberman.substack.com/p/how-i-got-a-record-536-on-arc-agiGetting 50% (SoTA) on ARC-AGI with GPT-4o [Greenblatt]https://blog.redwoodresearch.org/p/getting-50-sota-on-arc-agi-with-gpt https://www.youtube.com/watch?v=z9j3wB1RRGA [his MLST interview]A Thousand Brains: A New Theory of Intelligence [Hawkins]https://www.amazon.com/Thousand-Brains-New-Theory-Intelligence/dp/1541675819https://www.youtube.com/watch?v=6VQILbDqaI4 [MLST interview]Francois Chollet + Mike Knoop’s labhttps://ndea.com/On the Measure of Intelligence [Chollet]https://arxiv.org/abs/1911.01547On the Biology of a Large Language Model [Anthropic]https://transformer-circuits.pub/2025/attribution-graphs/biology.html The ARChitects [won 2024 ARC-AGI-1-private]https://www.youtube.com/watch?v=mTX_sAq--zY Connectionism critique 1998 [Fodor/Pylshyn]https://uh.edu/~garson/F&P1.PDF Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis [Kumar/Stanley]https://arxiv.org/pdf/2505.11581 AlphaEvolve interview (also program synthesis)https://www.youtube.com/watch?v=vC9nAosXrJw ShinkaEvolve: Evolving New Algorithms with LLMs, Orders of Magnitude More Efficiently [Lange et al]https://sakana.ai/shinka-evolve/ Deep learning with Python Rev 3 [Chollet] - READ CHAPTER 19 NOW!https://deeplearningwithpython.io/</p>

Full Transcript

This episode is brought to you by Indeed. You're ready to move your business forward, but first you need to find the right team. Start your search with Indeed Sponsored Jobs. It can help you reach qualified candidates fast, ensuring your listing is the first one they see. According to Indeed data, sponsored jobs are 90% more likely to report a hire than non-sponsored jobs. See the results for yourself. Get a $75 sponsored job credit at indeed.com slash podcast. Terms and conditions apply. We've got gifting all wrapped up at Sephora. Gift more and spend less with our value sets, packed with the best makeup, skincare, fragrances, and haircare-ville love. This year's Showstopper gift sets are bursting with beauty products from Rare Beauty, Summer Fridays, Glossier, Amica, and so much more. Shop holiday gifts at sephora.com and give something beautiful. You can describe every single ArcV2 task in 10 bullet points of plain English, most of them in five bullet points. And I think this actually gets to the heart of ARC, right? Everything is quite simple. It's not very hard. And I think this is also how we do it too, right? Like we, when we look at these ARC graphs, we're coming up with these bullet points in our head and we're, you know, checking them. Okay, this was right, this was right. And Python doesn't have these features. It's just not as expressive as natural language. MLST is sponsored by Cyberfund. Link in the description. I get actually, even more fundamentally, like the ideal system would be we have a set of data. Our language model is bad at a certain thing. We can just give it this data and then all of a sudden it keeps all of its knowledge and then also gets really good at this new thing. We are not there yet. And that to me is like a fundamental missing part. Really what you want is a more expressive program. And so that's why I switched from Python to English, which is a much more expressive program. You can always teach a language model skill, right? But it's the meta skill. It's the skill to create the skills that is AGI. And to me, that's reasoning. Reasoning is that meta skill. And so to put it another way, I think if you fundamentally learn the skill of reasoning, you should be able to then apply that skill to learn all the other skills. That is the meta skill. You know, kick whatever weights out you need to align the model to reason. And then from there, you have a foundation from which you can actually build general intelligence. Okay, folks, hot off the press. many of you would have seen last week that Jeremy Berman who is a research scientist at Reflection AI is now the winner of the ArcGi v2 leaderboard the public version of the leaderboard he's using an evolutionary approach now remember last year in December he published a similar approach generating Python functions and then refining those functions in a kind of iterative loop His new architecture is generating descriptions of algorithms rather than code and iteratively, in an evolutionary sense, refining those ones and discarding the ones that don't work. He's now at the top of the leaderboard. It's a really, really cool and elegant algorithm. And by the way, he works for Reflection AI. So he's doing reinforcement learning with verifiable feedback. And he's trying to address the biggest gap in AI at the moment, which is that we want systems that can synthesize new knowledge and new understanding. Current systems just get trained with a whole bunch of data and they only know what they've been trained on. They can't kind of think outside the box by creatively synthesizing new knowledge. Prolific are really focused on the contributions of human data in AI. And the reason this is important is actually the dirty secret of Silicon Valley, the extent to which human data is used to evaluate and fine tune AI models. The reason for that, as we discuss in today's show, is that current AI does not understand the world in a grounded way. It doesn't have a deep, abstract understanding of the world, which is why the only way that we can make AI work effectively is by grounding the generation and supervising the training of AI models with human data. prolific are putting together a report on how human data is being used in ai systems and they need volunteers you can just go and fill out this form to help them produce this report and you will get privileged access to see the report before anyone else the link is in the description or there was an amazing part in i think it was in your first paper where you said a parrot that lives in a courthouse will regurgitate more correct statements than a parrot that lives in a madhouse thank you thank you my sister who doesn't know anything about language models or ai she pointed that out and said that was a great line so at least i have that i've already used it i i credited you but i'll be using that quite a lot um well jeremy it's amazing to have you on mlst i've wanted to to have you on ever since you um released your first blog post um you know it was December last year. I was at NeurIPS at the time. And at the time, you actually got the highest score on the public ARC v1 leaderboard just before the famous 03 launch. Do you remember when they did this ridiculous $200 per task thing and they knocked you off the board? But Jeremy, can you just tell the audience a little bit about yourself and maybe we should start with your with your first arc solution? Yeah, sure. So I actually have only been working in research for about eight months. Before that, I had a company right out of college. I got into Y Combinator. And so I've been running a company for the last four and a half years as CTO. And I've always been very interested in reasoning in the brain. And I actually picked up Jeff Hawkins' book, A Thousand Brains. And I read that. And at the same time, I was kind of coming into language models. And something just clicked inside of me and I just knew I had to be working on this. And, you know, I believe that general intelligence, artificial general intelligence will be the most important invention of hopefully my lifetime. And so I decided to drop everything. I stepped down as CTO company is still going well. So it was a difficult decision. And I actually had gotten in touch with Mike and Francois because I thought Arc AGI was such an elegant way of describing the problems with current language models and their difference between the human brain. And so I just kind of dug in. That was my first research project independently. And yeah, I ended up getting the top score on that. And that was really great. And after that, I got recruited into Francois and Mike's AGI lab, India, where I was working on program synthesis. And as you described earlier, over time, I've become more convinced that language modeling with reinforcement learning will yield generalization far beyond what we see today. And so I decided to move to a company that was focused purely on language models. And that's where I'm now. So I'm currently working on reasoning and post-training at Reflection, which is we're building frontier foundation models. Very cool. Maybe we should save that bit for a tiny bit later. But, you know, one of the take home messages in your well, no, I mean, it's super interesting. And one of the take home messages on your new approach is that rather than producing explicit programs, you are evolving descriptions of programs. And Francois is a neurosymbolic guy, he thinks that we need to have a symbolic substrate where we, you know, represent the kinds of problems that we can do. And we need to do this kind of compositional form of intelligence. So we need to kind of be working in the symbolic layer, but perhaps guided by, you know, deep learning models. But maybe we should get to that in a minute. So in your first solution, it was an evolutionary approach. It was using Sonnet 3.5. And you had about four iterations, I think. And essentially, you know, you were working on the ARC challenge and you are producing these programs through evolution. Maybe just for folks that don't know about the Arc Challenge as well, could you introduce that and get into your solution? Sure, yeah. So the Arc Challenge is kind of like an IQ test for machines. It's a set of input-output grids. And the whole point is to be able to figure out how to transform input grids into output grids given a common transformation rule. And so what's interesting is these are really easy for humans, right? The average human gets around 75% accuracy on ARC v1. And at the time, the best language models, GBT4, Sonnet 3.5, was getting maybe 5%. And so, yeah, basically the idea is you have a few training examples, and then you're trying to extrapolate the transformation rule on the final test example. And so I approached this, I was actually inspired by Ryan Greenblatt, who had a solution earlier, which was to generate a ton of Python programs that would encapsulate the transformation rule. And Python programs are great because they're deterministic, and you can pretty quickly check whether or not the Python program works or not, which is really cheap. So it's cheap to verify. And you can be relatively sure if the Python program works on all of the training examples, that it'll work on the test example. So I started with his approach, but then I noticed that the language models actually struggled on first attempts. Even if you ask the language model a thousand times to generate Python programs, they were always off by small amounts on easy tasks, which I thought, you know, presumably it's in their distribution. They should be able to solve this. So what I found is that actually by taking the top programs, the top performing programs, and then running that in a revision loop. So asking Sonnet 3.5, hey, here's what you got wrong. Here are the cells you got wrong. Here's your original Python program and prove it. that started to really work well. And then I thought, well, why not just increase the depth, right? Why not ask it 10 times, you know, to revise until I'm happy with the solution passes some sort of accuracy threshold. So that's kind of how I was inspired by it. And, you know, I didn't think of it as evolutionary at first, I was just thinking about, you know, broadly what would work. And over time, I kind of understood, you know, there was something a bit deeper going on here, which is that evolving solutions is a powerful technique generally. And I think it's actually going to play a role in future technologies. But yeah, that's generally guided by just intuition. Yeah, I had Ryan Greenblatt on the show. I'm a huge fan of his. He's a very, very smart guy. And I asked him a similar question because he did this iteration, right, where you have a certain depth of iterations. And I guess one approach is that you have a shallow method, right? So you just try 200 different variations. And in your blog post, you kind of said there's a Goldilocks zone where you want to have a certain number of tries of different variations of things. But you also want to be able to refine your solution because that allows you to do this kind of composition. And composition is very, very important for problems that require iteration. and indeed the second version of the arc challenge and i think the tasks were selected so that they had at least a couple of iterations in them which meant that they needed to have this depth can you talk about that trade-off first of all the second the arc v2 is in a sense fundamentally different than arc v1 because of what you're describing they're compositional there are many rules that you have to go through and this is partially why i found that my solution on arc v1 did not perform well. Yeah, so there's a constant trade-off between how deep you go, how many revisions you take, and then how broad you start out. The problem with going deep and not going so broad is there are some edge solutions where you'll never get to, right? But then, of course, most solutions end up being somewhere within the bounds of your first broad attempt. So that's generally the trade-off, and the trade-off is different for ArchV1 or ArchV2. Interestingly, I found that for ArcV2, it was more important to be broad. And I think this is surprising to a lot of people. And partially this is because the models now think, and that's great. So the models actually do a lot of the deep revision for you in their thinking block. And this is a fundamental change from when ArcV1 existed. And when I just started out in the field, I think I'm a bit embarrassed by a lot of the things that I wrote in my first post because it was two weeks before 01 was released. And, you know, everything about 01 changed how I think about these things, which is before you could kind of simulate, emulate thinking to the quote you described with the parrot, the stochastic parrot. I think before you had reinforcement learning over, before you actually taught the machines, the language models to think with reinforcement learning, you were almost doing this stochastic guessing. That was not a very efficient revision loop, basically internal revision loop. And so you needed to artificially create that revision loop with code. But in v2, I was able to use a very powerful thinking model, which has a lot of the deep revisions for it. So I found it was best to kind of increase entropy, let it explore the space itself. And then I'll add a revision loop on top of that. But the revision loop is less important in v2. Yeah, so the first one was Sonnet 3.5. So that didn't have this thinking thing built into it so in your prompt you told it to think step by step and there I think you're inspired by Ryan Greenblatt's uh Greenblatt's prompt right so you had a whole bunch of ways in there for representing the board state and you said I want you to think now and I want you to you know that this is a an abstract reasoning challenge and I want you to think from first principles and it would kind of go through that and then it'll give you the answer but you're saying that on the um the the RL trained models it was significantly better at doing that right Exactly. You could think of RL trained models as having inbuilt revision loops. They are trained to explore the space in a deep way, thinking generally for themselves in a way that – so you really don't need to prompt thinking models to think step by step. They already do it. Yeah, I mean, I wanted to challenge you on this a tiny bit, right? So you kind of said in your, I think it was in the second version of the blog post that you just released last week, that at the moment, the models can do domain specific thinking, so they can do math thinking, and they can do code thinking. and what we want to do is imbue like the core machinations of thinking into these models and I'm a little bit skeptical I feel that these models because they're not Turing complete because they're not symbolic you know similar to what Francois believes that I'm sure you read that LLM biology paper as well they were talking about these circuits that we can find in papers that do things like multiplication and addition and what we what we saw was that they are quite patterned they're quite templated. They're not thinking in a very general sense. And my suspicion is it will always be that way because the models don't have semantics, they're non-symbolic and so on. Do you think we could ever make them truly think in a general way? Yeah, I think fundamentally taking a step back, the fact that our brains can do it and our brains are generally running similar algorithms, to me, this means that we will eventually be able to inject general reasoning into the language models. I don't think there's a fundamental reason why neural networks can't behave like biological neural networks. So that's, I guess, the higher level point. And then zooming in, right now, the models are as bad as they're ever going to be. There's generally more compute going into pre-training and there is reinforcement learning and of the compute going into reinforcement learning, you know, a subset is going into specific general reasoning. And so I think that over time, you're going to see the models get better and better at general reasoning. But I guess a question I would have for you is, do you think there's a fundamental difference between the way the brain works, where there's some sort of symbolic nature to the brain and it's not possible to inject that type of nature into an artificial network? Yes. Yeah. I mean, you mentioned Jeff Hawkins. I interviewed Jeff. He's absolutely amazing and of course his htm algorithm is computationally stronger than a neural network it's it's turing complete and our brains even though they are finite they run a turing complete algorithm which means our brains know how to expand their memory right our memory we can go and write things on a whiteboard and we can go and you know get another notebook and that is a special type of algorithm which is not traversable with stochastic gradient descent so you know the The rough argument is, yes, there is a difference there. And I also wanted to touch on this RL with verifiable rewards thing, which is that we do that at training time. I'm very excited in the future about an active inference version of that, like an agentic version where we're actually doing this kind of transductive active fine tuning in an agential way. Right. So, you know, I take an action. I get some new information from the environment and I update my weights. and that would be truly adaptive that would be intelligent but what we do now is we do all of this stuff at training time and the resulting frozen artifact is still an LLM it still has just a bunch of patterns in there and I think that while that can uplift reasoning in many ways I don't think it has the intelligence and according to Chalet intelligence is simply the ability to search through the space of touring programs right And I don think that what happening with these LLMs at the moment i i think you generally correct uh that it not happening at the moment but i still think uh like fundamentally i don't think there is a like a fundamental blocker physically for why they won't be able to do it in the future and it's possible that sgd right like stochastic gradient descent is an issue fundamentally and i think we're going to overcome that i guess what i would say is artificial neural networks i think have the structure um capable of yeah uh basically um being as smart in every way as a human brain and i subscribe to uh francois's definition of general intelligence as well yeah i mean i i think we mostly agree i i think i mean you know let's look at alpha zero or mu zero or something like that um they did this training loop where they were actually updating the um you know like the value network and the policy network and then it was frozen and they did some kind of you know monte carlo tree search so they were achieving adaptivity through exhaustive search during the actual games and in an ideal world we would have this adaptivity that's actually updating the weights now i believe the only reason we can't do that at the moment is just computational tractability right we have these huge models we couldn't possibly have a dynamically updating model for every single person that's using chat GPT. It would just be ridiculously slow. But I think you and I agree that if that were possible, that would be an entirely different kind of form of intelligence. I don't think that's so intractable, actually. I think my guess is, and, you know, NVIDIA just put in $100 billion into OpenAI. Sam Altman's plan is to produce a gigawatt of compute a week, something like that. I actually don't think with, you know, ever efficient algorithms that that is like crazy far off. I mean, right now you could buy a GPU, you could have it running in your house, and it could be running OSS 120B, right? And fine tuning is relatively trivial compared to, you know, the entire process for pre-training. I actually think that is totally within the realm of possibilities in the next 10 years. And I think that actually is potentially where this goes. Yeah, I mean, you know far more about this than I do. But I think the reason why fine tuning is so expensive is, you know, we have this continual learning problem. And when you fine tune a model on OpenAI, they're not just fine tuning it on the data you give them. They, you know, just to stop this catastrophic forgetting problem, they presumably have to sample in a bunch of the original training data and maintain the distribution and so on. And if they did this for everyone, It would be insane. But I am excited about it just like you are, because I interviewed the architects and I think they got first place on the on the private version last year. And they were doing this transductive active fine tuning. And they actually said, by the way, that this is a curious oddity with Transformers. If you start with a, you know, almost like a virgin eight billion transformer, it almost doesn't matter what it knew about before. You could just pretty much start training it from scratch on the ARC challenges. So they did a whole bunch of augmentation and active fine tuning and they built an intelligent artifact. I mean, intelligence is domain specific as per Cholet. And they actually built this system, which was per task, adapting and solving the tasks. And they were updating the weights and it was beautiful. So that was an existence proof of nothing else that this thing could work. This episode is brought to you by Diet Coke. You know that moment when you just need to hit pause and refresh? An ice cold Diet Coke isn't just a break. It's your chance to catch your breath and savor a moment that's all about you. Always refreshing, still the same great taste. Diet Coke. Make time for you time. The Subaru Share the Love event is on from November 20th to January 2nd. During this event, Subaru donates to charities like Make-A-Wish, helping grant more than 3,900 wishes so far. When you purchase or lease a new vehicle during the 2025 Subaru Share the Love event, Subaru and its retailers will make a minimum $300 donation to charity. Visit Subaru.com slash share to learn more. And that was on a Kaggle notebook. Yeah. You know, in 10 years, this is going to be what, like the, you know, Apollo mission computer. I think that, I think what you're describing, I'm actually not even totally convinced that continual learning is fundamentally the blocker. But I think if it is the fundamental blocker, that's actually incredible because we will solve continual learning. Like that's something that's physically possible. And I actually think like it's not so far off. Now, the forgetful issue, that is a much more fundamental issue in my mind. And not just the fact that you need to every time you fine tune, you have to have some sort of very elegant mixture of data that, you know, goes into this fine tuning process so that there's there's no catastrophic forgetting. um this is i think actually a fundamental problem so i the the um and and it's a fundamental problem that that you know even open ai has not solved right um and i think francois is a great example and i think this is an important example you know if you have the perfect weights for a certain problem and then you fine-tune that model on more examples of that problem the weights will start to drift and you will actually drift away from the um from the correct solution his answer to that is, well, we could make these systems composable, right? Like we can freeze the correct solution, and then we can add on top of that. I think there's something to that. I think actually it's possible there's a research direction that where, you know, maybe we freeze experts, or maybe we freeze layers for a bunch of reasons that isn't possible right now, but people are trying to do that. But yeah, I think fundamentally compute is not the issue. I think it's this catastrophic forgetfulness. Yeah. So I'm inclined to agree. I've long dreamed about there being a docker for language models right you know in docker you can kind of freeze dry a state of you know like let's say linux operating system with an application with this security update so you have these kind of immutable layers and the composability that we often talk about could actually happen at the architectural level and we could do dynamic model merging between different layers and and whatnot that would be very very exciting so you know but also just to come back to what you said before i've never really heard this before you're distinguishing like forgetting and learning right when we were talking about you know catastrophic forgetting and continual learning um can you just sketch out that distinction a bit more so uh the way i think about it you know you have a neural network and it has all these weights inside of it right anytime you update those weights you are pushing some weights out and presumably you are pushing some correct answers that you've previously you know trained and they are getting pushed out. And the benefit, I think, fundamentally of symbolic systems is that doesn't happen, right? Symbolic systems are deterministic. When you get the right answer, you can be sure you have the right answer. You stash it away into your library of correct solutions. This is the problem with continuous structures. But this is also actually now why I think it's important to draw from the brain, which is this similar thing happens actually with the brain. I believe the brain is much more composable than neural networks biologically. But I think there's no reason why we can't, you know, we won't be able to figure this out, right? Like, again, it could be it's as easy as we end up freezing experts. Again, like the freezing of the layers. I think this is an underexplored area. And I think it's actually, I think we're going to go through basically this RL S-curve. And then I think this is the next S-curve is figuring out how to make language models composable, like figuring out how to actually even more fundamentally, like the ideal system would be we have a set of data. Our language model is bad at a certain thing. We can just give it this data and then all of a sudden it keeps all of its knowledge and then also gets really good at this new thing. We are not there yet. And that to me is like a fundamental missing part of general intelligence. Yeah, completely agree. So it sounds like we have very, very similar intuitions. And Cholet talks about this as well. I mean, interestingly, in his Measure of Intelligence paper, it was actually about the measure of intelligence. He's never really spoken about the machinations of intelligence. He talks about it just casually. He says, you know, we need the Spelke prize and we need to. So those are like the basis functions. And we need to do library learning and library transfer. And we do some kind of like, you know, symbolic compositional process, you know, to adapt to novelty. and he's kind of sketched out the mechanics of it, but he's never actually formally spoken about it. I assume that's what he's building at his company. But there was a famous guy called Jerry Foda in 1988. He had this connectionism critique. He had this beautiful paper, and he was basically saying that symbolic systems have systematicity and productivity. And systematicity is this compositional thing. It's the ability to generalize between Mary loves John and Mary loves Jane, right? So you have semantics. You have these kind of like, you know, these symbolic relations and they have certain computational properties. Like you can do a variable binding and quantification over potentially infinite domains. Like we intuitively understand that symbolic things have very interesting properties. And then what we're trying to do is like we know neural networks are really good and we want to somehow graft this capability onto neural networks. Yes. Yeah. And I actually think neural networks are in some ways a superset of symbolic systems. I think you can generally, you should be able to encapsulate a symbolic system with a neural network. In the same way, I think that you can do the, I think you can do the same thing with the brain as well. I think there's nothing fundamentally blocking. but of course once you have you know you have this symbolic system in the neural network it might catastrophically forget when you fine-tune it like i guess that's a that might be where we we disagree but i think everything you're describing is totally possible but then when you're uh when you're coming to train it again there's no guarantee that it keeps the same structure i think it's possible because a neural network is is not too incomplete so i i think in principle it can't do many of these things but you can build a controller right so you you could just build a very simple kind of envelope, just as you did with your solution. So you had a bunch of code, and it was doing this, you know, this basically compositionality in code on top of the neural network substrate. And that gives you many of those things. You know, for example, we often talk about library learning and library transfer. And I'm not sure if you've seen Eric Pang's solution. I'm speaking to him in Hong Kong, actually, in a couple of weeks. But rather than the dream coder approach where they do this explicit library learning um he was doing it in a kind of implicit way using the llms and i think there's a whole spectrum between you know you don't have to do it explicitly i think you can kind of use neural networks and you can do some kind of implicit composition and you can get many of these features also i i want to say um generally when i speak about language models i assume that they have um basically a python terminal that they can run uh so uh not so i guess two things the first is i think uh if you have a large enough neural network um i think uh generally almost everything is you could you could represent a symbolic system but of course it's not turn complete um but given a neural network plus uh the ability to write programs then um i think we're basically at the system that we're at the um human brain equivalent so yes so that that is a hybrid system and that certainly is significantly more powerful um i'm just regurgitating my co-host dr duggar because this is his like favorite point he always likes to make but but he says that um that's true but stochastic gradient descent does not find the algorithms that allow the systems to behave as if they are turing machines god knows how it happened in our brains there was some dint of evolution or something where you know We've suddenly got the merge operator or God knows what happened. And we've got this incredible like Turing complete algorithm in our finite brain. And so we're getting into that trainability thing. So, yes, maybe there is a set of weights that we might find one day and it can access like a Python tool and it can do all of those things. And its capability now, is it now effectively searching the space of Turing machine programs? I think it's not. Like there's lots of problems there. Like, how would it know which ones halt and which ones don't? And how would it be able to efficiently search that space? It feels like there's a gap now, but I agree with you that it's significantly stronger than not being able to use the tools. Yeah, I think I but but you think that the human brain is running a Turing system? Yes, I think the algorithm that runs in our brain is a is a Turing machine algorithm. So, you know, like a Turing machine has a code book, which is a finite state automata. and then it has this like you know read write access to these two potentially infinite tapes and you know the the algorithm that you put in that turing machine that is very difficult to find i i don't disagree with that but i don't why why wouldn't we be able to find that algorithm for neural networks right like why wouldn't uh you know we we train uh neural networks much bigger than the brain we put a lot of compute towards them do you just don't think that uh finding the same algorithm is possible with with SGD I think I think with SGD because the fascinating thing is that you know if you look at all of the FSA algorithms a tiny sliver of those algorithms are capable of controlling you know a Turing machine and expanding their memory and so on so it's in the space and I don't know if you saw that amazing paper by Kenneth Stanley the fractured entangled representations paper and he had this beautiful diagram and he said that you know SGD finds the algorithms over here and neuroevolution algorithms find the ones over here and it just so happens that the neuroevolution algorithms find ones that have these factored you know representations which means they they find representations that are about the world that are grounded in the world that carve the world up by the joints and if only we could find those things you know when i spoke to schmidhuber he said the same thing he said like you know it is actually possible to find the right weight in a neural network to make it you know effectively true and complete with some caveats and so on. But when we do SGD, because there are all of these shortcuts, right, it's a bit like good hearting, it will always just find the wrong thing. I need to think about that a bit more. Okay, so on the first one as well, you were generating Python programs explicitly. And because of all the things that we're just talking about, I'm a big fan of that, because I feel intuitively, and I think you did, that there's something special about Python programs. And then you did this iterative updating of those programs and you converged on the right one. You also had this amazing diagram in your first blog post where you kind of visualized the space of all the possible programs and you kind of showed what was happening in every iteration. In the first one, I used Python programs because Python programs are deterministic and it's really easy to verify whether or not it's correct. Did it run and then did it run on the training examples and produce the correct outputs. So it's like a perfect program, right? Like it is a program. The problem is, you know, Python programs are brittle in that, you know, there are many things that are very difficult to describe with Python. Arc grids in V2 being one of them, right? So you have some grids that are very easily described by Python, but then almost the majority, overwhelming majority in Arc V2 are very hard to describe in Python. The correct Python formulation is lines and lines and lines. And really what you want is a more expressive program. And so that's why I switched from Python to English, which is a much more expressive program. You can describe every single Arc v2 task in 10 bullet points of plain English, most of them in five bullet points. And I think this actually gets to the heart of Arc, right? Everything is quite simple. It's not very hard. And I think this is also how we do it too, right? When we look at these arc graphs, arc grids, we're coming up with these bullet points in our head and we're checking them. Okay, this was right. This was right. And Python doesn't have these features. It's just not as expressive as natural language. And I think another way to put it would be you have this inductive, transductive tradeoff, right? You could think of language models as being trained inductively. and then they have an inductive bias and you almost want to let that inductive bias express itself fully in a way. And the way you do that is to give it the full power of how it was trained. And I think this is the same thing with humans too, right? If I told you to solve Arc with Python programs, you'd do a way worse job, even if you were an expert at Python. And so I think fundamentally it's more general and it leads to general and better solutions. I mean, the accuracy is much higher when you use natural language. Now, the problem is you actually have to then verify whether the instructions are correct. You can't run natural language on our grids. This was the fundamental problem with the solution. This is what made iteration challenging, especially because for each grid, for each training example, you have to run the natural language instructions and it takes a really long time, especially with this thinking model. So I originally started with a weak model. You know, it's the checker model. It's the checker agent. Let just use GPT mini whatever nano And it did terribly So I ended up it was actually more important that the checker was stronger than the actual instruction creator which i i think is uh interesting um but yeah that just highlights you know the the trade-offs with using natural language it's it's uh you can express uh you know much more concisely um programs that you want to run but then they're not runnable programs you actually have to check them inductively um so that was the trade-off but it was worth it for ArchV2. Yes, so fascinating. And just for the audience, we've been using transduction and induction to distinguish predicting the solution space versus predicting a program. I had this discussion with Clem and Bonnet, need not detain us now, but I think in traditional machine learning, transduction means that the test example is a function of your prediction. I had this discussion with the architects as well, that when you have this natural language description, natural language is more expressive, which simply means that there are more degrees of freedom. And this is the beauty of LLMs, that there's this huge kind of space that you're traversing around. And when you use natural language, you can just traverse to more places in that space more easily. So it seems like it would be a win. And I'm really fascinated to find out whether that is just like a huge component of your solution, because on Eric's solution, he's still predicting programs and still doing very well. So I'm not sure about that. And the other thing is, I wasn't entirely sure whether you are actually using a transductive method. So in your solution checker agent, is it directly going to the solution space or is it generating a program and testing it? In the checker? Yeah. In the checker, it takes in the natural language and then it outputs a grid. That's all it does. It just outputs a grid. Okay, cool. So you've moved to a transductive modality. Did you see any errors in that? So did it sometimes produce the wrong grid or? All the time. Yeah. Yes. Yeah. All the time. um and it's worth noting actually uh you know the python solution was i obviously tried my v1 solution on v2 right um and it wasn't so much worse but part of what i wanted to do with v2 is show that as language models get more powerful and we get to use thinking models we can start using more general solutions and i just thought there was something elegant about using natural language and then it also happened to be that there were problems that i could tell the python functions we're never going to get. And there are basically no programs that ARC v2, my solution for ARC v2, won't get. So when Grok 6 comes out or GPT 7 comes out, you can use my v2 solution and it will win. It will beat ARC. That is not the case for my v1 solution. The other important thing is you're now using Grok 4, which is very, very powerful. I assume you chose Grok 4 because it just happened to be the winner on the leaderboard for the base ArcV2. And how much uplift is coming from that? So for example, if you used Grok4 on your solution last year, how much better would it be? So I actually don't think it would be so much better for some reason. And this is what I talk about in my blog post. These language models are very spiky in certain things where they were trained heavily on, right? So I think what happened with Grok is there was a distribution of similar shape tasks, grid tasks, just reasoning in the type of general direction that allowed Grok to have a special capability in this area. And I actually tested each model. So I tested Grok versus GPT. I didn't just go by the leaderboard. And Grok definitely outperformed. The problem is for my V1 solution, you also have to generate code. And Sonnet 3.5 is really good at thinking about code and generating code. And I prefer Sonnet to Grok for code generation. So my guess actually would be that if you use my v1 solution it's highly possible uh you know opus 4.1 would be the best i haven't tested that that would be very expensive on opus 4.1 but maybe it's worth testing um but i think that the general idea is that uh these networks are very spiky when you get into specific domains the net uh it actually very much matters which model you use and the arc is a great example of this right like the leaderboard is super spiky uh in ways that uh other benchmarks or not. I did an interesting interview at NeurIPS last year with the Google guys, and they were talking about this adaptive temperature and language models for reasoning, because, you know, there's this constant trade-off between reasoning, we want to be quite constrained, right? So we actually want to kind of like go a particular pathway. We want to be constrained by our knowledge. And when we're being quite creative and flexible, we want to be able to go in different places. and I'm really interested in creativity for example and and I think creativity is like you you you it's very similar to reasoning as Cholet talks about you know you're composing together these constraints there's this phylogeny of knowledge and you need to respect it as much as possible because if you don't respect it you're not grounded anymore so it kind of feels to me that intuitively code is great because it means that I'm actually respecting the constraints and the semantics are correct and it's grounded in in the real world do you feel in any way that by using these natural language descriptions that you're kind of creating something which might by dint of chance or search find the right solution but is isn't correct and verifiable yes okay yes tell me more um yes i i for sure um i think generally uh when models think in natural language and they output a natural language, they are higher entropy, right? I think the second you start prompting with code, they go into code mode. And there are a lot of papers that show just by prompting it in a certain direction, it activates certain weights that are just naturally lower entropy. But that was part of a thing that I wanted. I actually wanted to introduce entropy because, you know, still most arc tasks for V2, the models don't get close, right? You know, my solution was the top and it's at 30%. So I actually wanted to inject as much entropy as possible, which is partially why my prompts are so broad. You know, I could definitely improve my accuracy on a few tasks by making the prompts more specific, but I wanted to just constantly berate it, like more entropy, more entropy. So I actually found that to be a positive, not a negative. Interesting, interesting. on the efficiency of the solution so the o3 model from open ai that was about 200 per task and that was i think it was did we ever find out i think it was sampling right so they just sampled it much of times they had a basic verifier is that correct um i don't think we ever figured that out so you think it could be so because when i interviewed chole he was he was being quite kind of not cagey but it seemed like he was suggesting they were actually performing a search algorithm and and i think the open ai guys said on twitter no they were just doing very basic sampling and i think they even published the code that they used um yeah i'm not quite sure yes i i also spoke to the open ai guys about this and i i i'm not sure after all of this what they were doing i i think it's probable that they were doing sampling okay i mean it took a very you know uh hard to imagine they weren't doing they were certainly doing sampling i i'm not sure what else they were doing, whether the model is fine-tuned. My best guess is that they were sampling and it actually was not a fine-tuned model. Oh, very interesting. Yeah, I remember there was that big hoo-ha at the time that they, you know, it was scandalous that they were training on the training set. But anyway, that's one side. There's also like the thought occurs that if they did do something like what you were doing, so, you know, like this approach of iterative refinement with verification at every single step, would they have done even better? For sure. Yeah, for sure. I think OpenAI generally, they want to do the right thing, and they want their solutions to be very general and broad. And this is the sense I get. I think it's to their culture. And I spoke to the OpenAI guys. They did include the training data in that O3 model, but I think that's fair game, right? I don't think they fine-tuned on it, right? So it's just part of the corpus uh that when it's pre-training which to me is fair game right this is this is totally fine okay very cool so so on arc v1 that their efficiency was 200 per task what was your efficiency oh on arc v1 i maybe 10 something like that um i need to something like that okay in order of magnitude maybe maybe i need i need to check yeah so talk to me more about this So, and Eric Pang's solution, he was slightly, came in slightly worse than you, but I think he was a fair bit more efficient. I think his one, I'm going off my memory now, was it about $8 per task? Was yours about $30 per task on ArcV2? Oh, this is the V2. So my, yeah, my latest solution. Right. My latest solution was around $30 on V2 and $8 on V1. Oh, okay. And I think Eric's one was slightly more efficient. And he was indicating that it was because he was doing the library learning and transfer. And even that, I was left kind of thinking, first of all, it's interesting that you got better results. And is that because there isn't much transfer? Where does the library transfer come into this? Because maybe the broader question is, if you were to make your solution significantly more efficient, what would you do? I had a version that does do library transfer. Basically, I would save the traces from training and try and basically pull those in during test time. And I actually just out of simplicity sake, because I was getting such high scores with the simple solution, I wanted to just push the simple solution. And I might actually, we'll see if someone's going to beat my score, I might bring that back in. um i it's for sure that will improve the score and it's useful there is a lot of transfer efficiency i just found what i was doing very um elegant and so i i actually liked like to keep it um but um you know no third party dependencies or anything like that uh but that for sure helps accuracy um i think the fundamental reason why i got higher and also i could match his efficiency and i would still get higher um because i was using natural language natural language is a much more efficient um area to play in yeah that's at least what i found yeah i just wonder how close do you think are we getting to the kind of pareto optimal of this approach i mean just to give you a few examples um we interviewed the alpha evolve team that was fascinating and maybe you can contrast with with those guys sakana ai yesterday yeah robert lange was the first author they've released this um i think it's called shrinker and that was a kind of similar kind of evolutionary you know program thing and they had some cool features in there like you know using bandits and using UCB and I guess like are we getting to the point where we're going to really figure out what is the most optimal way to do this by the way they were also switching between different foundation models I think improvements will be log right will be will be logarithmic so I wouldn't expect using these basically using the language models we have today I would not expect anyone to break, let's say 40%. But you could probably make my solution twice as efficient, I would say. You wouldn't get more than a few percentage points more accurate, is my guess. But you could make it a lot more efficient. There's a ton of efficiency gains to be made. We've been dancing around this a little bit that, you know, Cholet's measure of intelligence was all about resisting memorization. And there is this question now, you know, which is, to what extent are we actually building systems that we might call intelligent? And he says that intelligence is simply like the efficiency of knowledge acquisition. And I'm really on board with that. And I think it's fair to say at the moment that, let's say like, you know, your solution and Greenback's solution, it's quite ephemeral and stateless, which is to say that when you have a new task come along, you kind of start again from scratch, which means it's not really like adapting and acquiring new knowledge and transferring that knowledge. So maybe you would agree that in the spirit of Chalet's measure of intelligence, at the moment, it's more of a kind of searching approach. But what do you think we would need to do to kind of, you know, make it more adaptable? Right. So I think test time fine tuning would be the way to like fundamentally make it adaptable. But I also think, you know, Chalet hits it a core problem with language models, which is their reasoning is domain-specific, right? I kind of, in my blog post, I described that when you train a language model to reason about math, for some reason, most of the reasoning circuits it just gained live in the math weights. And then you try and train it on science. And it's some generalization, but not as much as you would want. And I think not nearly as much as what humans get. Humans have this generalization engine that is our reasoning capability. And this is the fundamental hole in language models today. And I think, in fact, actually, I would say, I actually, you know, generally agree with Francois. And he says, you know, you can always teach a language model skill, right? But it's the meta skill. It's the skill to create the skills that is AGI. And to me, that's reasoning. Like, reasoning is that meta skill. And so to put it another way, I think if you fundamentally learn the skill of reasoning, you should be able to then apply that skill to learn all the other skills. That is the meta skill. And we need to figure out. And so that is the fundamental problem. And you need to do whatever you can, you know, kick whatever weights out you need to align the model to reason. And then from there, you have a foundation from which you can actually build general intelligence. So I guess I don't know if that was a that's maybe a higher level answer to your question. But I think, you know, how we and what I'm what I'm focused on is really just fitting all of reasoning into these models. And I don't really care what else is left. I just want all of reasoning. Yes, I pretty much agree. And I mean, you probably know that I'm Shole's biggest fan. So I've obviously, you know, been a huge fan of his for years. But by the way, he's just released revision three of his deep learning with Python book. And I recommend folks to read chapter 19. You can actually read it online for free. And he sketches out this entire vision. You know, it's it's so exciting. And I think that just to see it so beautifully articulated, because there is a bit of an elephant in the room and in the scene at the moment. I think so many people just just don't have such a crisp understanding. But the only departure that I make with Cholet and yourself, Jeremy, is that I think it's, you know, Cholet really focuses on behavioral tests of intelligence, you know, like so it's so it's reasoning if it can pass the test and it can actually like get the right answer. and I think we need to go this is where I was kind of talking about the systematicity and the symbolic AI I think how you got there is important right so I think it's possible to get the right answer for the wrong reasons and I think that if we have a system which has semantics so we actually know what these symbols mean and we've composed them together in a principled way not only to get the right answer for the right reasons but to be evolvable so to have like an efficient epistemic base that allows us to go on to acquire new knowledge in the future. And that to me points to this need to have a mechanistic, like, you know, like, how are we acquiring this knowledge view? Would you agree with that? Yes. And I think about it a bit differently. So let me know if what I say is aligned with what you think. Okay. So to me, pre-training is kind of the opposite of what you described. I view there's two types of knowledge. There's knowledge that is memorized, like the capital of New York or the Spanish language. And then there's knowledge that is deduced. So that's physics, special relativity, general relativity, right? From axioms, you can deduce these things. And it's a causal tree. And then there's another type of knowledge. What is the capital of North Dakota? That is a knowledge network. It's not deductive. It's not a tree. And I think pre-training treats all knowledge as a knowledge web. It's embeddings that are connected, but there's no guarantee that you have the correct causal relationship between things. And this is where the memorization comes in. And I think this is actually where compression fits into intelligence. So my view is actually intelligence is compression in that you should be able to deduce, you should be able to build a knowledge tree based on almost, you can build a knowledge tree based on almost nothing, right? You can deduce so much of math. You can deduce special relativity from the very roots of physics. And Einstein was extremely intelligent because the hints that he needed to come up with special relativity are zero, right? He could start from almost nothing and build up this deductive tree. And I think it's almost like reinforcement learning and reasoning is the process of pruning our knowledge network and replacing it with this tree. and um and until we have weights that represent the actual deductive nature of knowledge we won't actually get generalization i don't know if this this fits in but this is kind of how i think about reinforcement learning which is replacing uh knowledge web with an with a knowledge tree yes yes this is this is brilliant we're getting to the center of the bullseye here absolutely and i remember i read in in the first version of your blog post that you were talking about we need to do this kind of deduction where we synthesize hypotheses and then we we test them and we do this kind of generate test loop. And that is what creativity is It what reasoning is because you know when I first read Cholet paper you know years ago I was kind of like I didn understand whether he was talking about acquisition or synthesis And I now understand he talking about synthesis So reasoning is like you just like Lego, you you build, you build this kind of tree, this epistemic tree. And actually, this is what we do. So there's a difference between knowing and understanding. So knowing is kind of like at the high level and understanding is actually like there's this whole, you know just imagine this big block of this big lego structure and you're tracing down the structure of all the building blocks of how you got there and i actually think even when you teach kids at university what you're really doing obviously like you teach you teach them facts but then they kind of synthesize their understanding over time so they're doing this composition and they're kind of getting there the way that they get there and and we need to build systems to do this and there's the perennial problem that you were saying that in deep learning what we do is we kind of start with this big pattern network and we kind of sparsify it. And I think reasoning should be more about synthesizing from building blocks. And I think that when you synthesize, you can actually do types of reasoning which are not in the training data, right? You can build things that simply are not there. You can just think about things and figure things out. So do you think of that as a gap? Yes, I think that's exactly right. And then the question is, can you build the system with language models or not. And I think you can build them with language models. And I think the fact that we're slowly climbing our ability to synthesize new information is a testament to what I'm saying, which is when reinforcement learning and reinforcement learning, you know, with verifiable rewards is fundamentally ensuring whatever circuits led to the right answer, they must be consistent with the deductive tree. So it's basically like, can you replace all of your pre-trained weights with weights that are coherent from the environment. The problem is there's so many weights. There's so many weights from pre-training that it's very difficult. So I actually think part of one of my, I guess, hot takes is that pre-training in many ways slows down reasoning. It makes it harder to reason because I think the analogy I draw is you have consultants that know the names for things, but couldn't deduce the thing. And then you have Feynman, right, who can deduce anything. And reinforcement learning is turning your consultant into Feynman. And this is, yeah, this is what I'm most interested in. And I think it's interesting because you get to play at both sides. You get to play with the pre-training of, okay, well, maybe we shouldn't include these things and let the model figure it out in reinforcement learning. Well, there's no guarantee if we pre-train it like this, you know, it's going to have the proper deductive circuitry. So maybe this is best, you know, left for post training. I think this is not this is this is a hot day. This is not what people currently think. I think people think let's jam as much information as we can in pre training, and then we'll reinforcement learn when need when we need. But I think this this could be incorrect. Yeah, and I pretty much agree with you just with the caveats we discussed previously that if we could do that on the basis of representations that are actually like grounded in the world, rather than things that just, you know, happen to give you the right answer for the wrong reasons. That's absolutely true. Just a little bit of a curveball. So I think in the first version of the article, you said you were inspired by Jan LeCun's JEPA, these joint embedding prediction architectures. And he's also a big advocate of energy-based models, which are really cool. I don't know if you've seen the recent couple of papers that are applying it to transformers, you know, where essentially it's kind of a step towards probabilistic models where you actually have this uncertainty quantification, you can do counterfactuals, and you actually have to solve an optimization problem at inference time so you can do adaptive computation. And it's all very exciting. But I still have some reservations, but do you think architectures like that are exciting? I think they are exciting. I am a bit less excited about them, not because of their merits, but because I think I was underrating Transformers when I wrote that. So this was really before I started being an actual researcher and building transformers, actually coding with them. Since then, I've had a new appreciation for language models. I think where I was coming from it was it seems like language models are overfitting to the next token, right? And JEPA is so interesting because, oh, all of a sudden you have them predicting concepts. And fundamentally, we care about concepts. The words don't really matter. It's the concepts that matter. but I think that language models do operate on the conceptual level in the hidden layers and so that was you know something that I slowly came to the realization of so I think there's a lot of potential in JEPA frameworks I think they're really cool I hope people keep pulling on them but I think most of the benefits I thought that came from JEPA exist in language models I just didn't see it at the time you know a lot of people get an earworm where they just get obsessed about an idea and you're kind of thinking about it all the time like what is that thing for you if we do have these uh language models that have their weights that are aligned in this uh i guess the tree of deduction um it seems like we're missing still one more thing which is creativity which we touched on right so you can have the correct deductive tree but then how do you search through all of the possible premises you can add to this tree how do you find the right ones and i think there are a experiments that i'm looking forward to doing uh one of them is uh ablating at will pre-training data and then uh basically reinforcement uh building an environment to have the model um regenerate that information so for example let's say you have the ability to ablate special relativity and all of the physics that came from special relativity from your pre-training data that is a gold mine of an environment right because now you can prompt the model you can do everything You can really try and get it to deduce special relativity. And my hunch is that part of the reason why models are not yet great at coming up with novel solutions and information is because they don't have the circuitry of invention. And I think that is actually a circuit that needs to be developed. And we don't have the environments to develop that circuit yet. And I actually just saw today, I think OpenAI released this math. I skimmed it, this math paper, right, where they're coming up with, they almost came up with novel conjectures, something like that. And I think that's exactly what was in my head, which is I want to be able to build environments where the model's never seen something. And basically, it tries to deduce these new things that are outside of the distribution. And over time, it learns and practices and builds this invention circuit. So I think it's like, it's two things. It's reinforcement learning to make sure that the knowledge tree is consistent and then making sure that the circuitry to be able to pull from its entire corpus of understanding of the world and bring that in to kind of fuel the innovation engine. yeah i think we're so close i think the only like slight disagreement is whether there could be such a thing as an invention circuit it seems to me just like that llm biology paper and and whatnot that it would be very kind of um patterned in the weights it would be templated it would be kind of you know specific to certain domains and i feel that we can do it but we would need to build a controller on the top i also feel by the way that um creativity is very domain specific and what I mean by that is just with this Kenneth Stanley view that there's this big phylogeny of knowledge and I've noted that when I hire creative professionals they like an editor for example they can't edit my show and even if they're really good at editing other people's shows and it's because they simply don't know anything about machine learning so you know like I've just discovered the crazy degree to which creativity is domain specific and I just wonder whether there is an algorithm for creativity because i've got this epistemic lens of creativity which is that you know it's simply about okay well i can access all of the ancestors in my tree and i can do some composition it's like building lego and so on and and if i want to use something from the tree over here this other branch you know maybe it's compatible maybe i can bring it in or maybe i just need to start a new branch or maybe i need to jump over to the other branch and i'm not sure whether i'm applying the same algorithm when i'm doing that i think you i think you actually are And that it's not sufficient to just be creative in this case. You need to be creative and you need to be knowledgeable. Otherwise, you're creative, right? You can't build the tree. I think your editors can't build the tree because they don't have the deductive footing. That would be my best guess. But that's an interesting perspective. I want to think about that. I mean, fundamentally, right, creativity is knowing which axioms to include in the next branch of the tree, right? So you're at level five of the tree. How do you get to level six? It's knowing which assumptions to pile in to get to level six, right? It's so beautiful. We're writing an article about creativity at the moment. And I believe that in order to be creative, the depth of understanding of the tree is very important. And, you know, just per Chalet, you know, intelligence is the efficiency of how much of the history you can acquire. So like a university professor understands the tree very deeply. and that actually makes them unintelligible to a normal person. Like when Stephen Wolfram talks about the Ruliad all the time, people have got no idea what he's talking about. He's actually being very expressive. He's talking about things at a level of abstraction which can refer to anything, but it's beyond most people's cognitive horizon. But there is something to be said for that when you can respect the history deep down into the epistemic tree, the creative stepping stones you take, because they respect the history, they actually have more evolvability, right? Because you're still grounded to the real world. you're not becoming incoherent. So there's something there about really knowing things deeply that is important. Yes, yes. For the record, I would define understanding as being able to, I mean, I think understanding is a spectrum. I think on one end, it's memorization, and then, which is zero understanding. And then on the other end, it's ability to deduce. And also to deduce correctly. I agree with what you said. It's not enough to just, have the right proof, you actually have to have understood the tree. And this is understanding, and then intelligence is just how many things you understand. And so it's really just how wide and high is your garden of trees. Yeah, we're pretty similar. I would say intelligence is the efficiency that you can acquire the tree. And reasoning is building the tree. Discursive reasoning is executing the tree. And understanding is simply just possession of the tree so intelligence to you is the speed at which you can build the tree not how many trees you have yeah or not how large your tree is yeah yeah yeah i think understanding is how much of the tree do i have i think that's correct yes yes because you like us you could have a very intelligent child who doesn't know a lot about the world but has the ability it's like potential to build a tree yes it's uh yes i think that's correct there's a spectrum of understanding So language models, famously, they don't understand the tree very deeply. So they only understand the tree a few levels down. And so when language models are doing autonomous generation, the reason why we have to do so many different generations and select the best one is because obviously it's not grounded. It doesn't understand the tree very deeply. We can overcome that because we understand the tree deeply. So we can put a prompt in there that constrains their generation, and now we can make them act as if they understood the tree when they didn't. But, you know, we just need to build the models that do understand the tree deeply and then we can just trust them to generate autonomously. With Venmo Stash, a taco in one hand and ordering a ride in the other means you're stacking cash back. Nice. Get up to 5% cash back with Venmo Stash on your favorite brands when you pay with your Venmo debit card. From takeout to ride shares, entertainment and more, pick a bundle with your go-tos and start earning cash back at those brands. Earn more cash when you do more with Stash. Venmo Stash Terms and Exclusions apply. Max $100 cash back per month. See terms at venmo.me slash stash terms. This episode is brought to you by State Farm. Listening to this podcast? Smart move. Being financially savvy? Smart move. Another smart move? Having State Farm help you create a competitive price when you choose to bundle home and auto. Bundling. Just another way to save with a personal price plan. Like a good neighbor, State Farm is there. Prices are based on rating plans that vary by state. Coverage options are selected by the customer. Availability, amount of discounts and savings and eligibility vary by state. Yes, that is that is a very good way of saying what what I think. And that is what I'm focused on. That's a really good way of putting it right, which is forcing the language models to develop these deep trees from the ground up. You can only develop it from the ground up, I think. And so we need to come up with new techniques, new environments to grow the trees instead of pre-training, which is pre-filling, random. It's not random, but it's a web. It's not a tree. Yes. There's also the vexed issue of what happens at the bottom of the tree. so um sholet argues that this spelkey core knowledge you know these are knowledge primitives that are so fundamental that inside the deductive closure of those primitives we can talk about anything and you know you can still get lost in different parts of the tree i suppose and there's some issues of intelligibility going between them but if you understood the tree deeply enough you can go anywhere but maybe there are different trees maybe they're you know like in physics there are different levels of you know levels of description of understanding the universe so do you think it's like one big tree or lots of trees i think um the you have it's kind of uh gated in the laws of whatever you're doing and i know you know i i looked into this a few times i know there are some axioms that you need to take for granted and let's say deducing some forms of math right and they would be their own tree because you can't get to one part without the other you know if you take this for granted then you can't uh you can't deduce this but you can deduce this And so I'd say basically they have to, you know, they have to be a logical chain. And so, of course, there might be there might be multiple chains and maybe everything is grounded in logic, right? Like logic is the fundamental. I guess that is true, actually. Logic must be the fundamental block of a tree, right? Like everything comes from logic because if we didn't have logic, we couldn't have trees. So I guess that, yeah, I guess it would be all one tree and it's a logic tree. And I guess there are people that subscribe to logic trees and then there are people that are illogical and they don't have their trees at all. Jeremy it's been an absolute honor to have you on the show just before we go are you hiring or anything you want to say to the audience for sure yeah so at reflection we're building uh open uh intelligence models so uh we're hiring across the stack pre-training post-training uh large language models we have a lot of gpu so if you're interested in pre-training post-training we're in sf we're in new york and we're in london so definitely uh plan our site or you can just hit me up on twitter amazing um jeremy i've really enjoyed this thank you so much awesome thank you just keep doing what you're doing man i um i just i really really think that you that that you you're onto something here i mean obviously like with the minor you know um discussion about you know whether how we're actually going to do this but i think the the direction is is quite clearly set i know it's such a vexed issue though i mean um i'm interviewing a bunch of cognitive scientists in Japan next week. And you can really go down the rabbit hole on this. So for example, I'm a big fan of, you know, like, externalism, you know, like inactive cognition. And there are also all these philosophical views where, you know, like, consciousness is basically a property of certain types of physical material. So because you know what we're talking about here with understanding, like if you actually abstract it into physics, we're talking about certain types of causal graph. and you can argue that certain types of cognition actually require certain types of physical instantiation you know where in the middle of that graph you have material which is capable of producing consciousness so some components of understanding are phenomenal they're conscious right and like when when you start taking it to this philosophical level there's almost like no there's no end to it you know because you'll always have people that argue against functionalism that say that cognition must be physically instantiated in a certain way and i'm not sure where I am on that because even the cognitive scientists are saying okay well guys we have to admit that you know even though we can make all of these arguments I mean fuck me these LLMs are they're doing so well but um yeah I think it'd be a fascinating discussion because a lot of neuroscientists are internalists they basically like Jeff Hawkins they think that all of the shit happens in the brain and you know we have these sensory motor circuits and we've got this master algorithm in our neocortex and that does all of the things and I think there's something to be said for that but you know i think it does ignore quite a lot of the other field yeah well it all comes down to if we could build machines that actually have a deep grounded understanding of the world and let's assume that that doesn't actually have to be physically grounded right if the representation is like you know grounded in the sense that it's a faithful description of what is happening out there and we can do this creative reasoning on that understanding then what's to stop us from inventing new things i mean that's basically the thesis right yeah beautiful stuff jeremy thank you so much man i really really appreciate this

Share on X Share on LinkedIn