He Co-Invented the Transformer. Now: Continuous Thought Machines - Llion Jones and Luke Darlow [Sakana AI]

Machine Learning Street Talk

Sunday, November 23, 20251h 12m

Spotify Apple

Machine Learning Street Talk

0:001:12:39

Episode Description

The Transformer architecture (which powers ChatGPT and nearly all modern AI) might be trapping the industry in a localized rut, preventing us from finding true intelligent reasoning, according to the person who co-invented it. Llion Jones and Luke Darlow, key figures at the research lab Sakana AI, join the show to make this provocative argument, and also introduce new research which might lead the way forwards. **SPONSOR MESSAGES START**—Build your ideas with AI Studio from Google - http://ai.studio/build—Tufa AI Labs is hiring ML Research Engineers https://tufalabs.ai/ —cyber•Fund https://cyber.fund/?utm_source=mlst is a founder-led investment firm accelerating the cybernetic economyHiring a SF VC Principal: https://talent.cyber.fund/companies/cyber-fund-2/jobs/57674170-ai-investment-principal#content?utm_source=mlstSubmit investment deck: https://cyber.fund/contact?utm_source=mlst—**END** The "Spiral" Problem – Llion uses a striking visual analogy to explain what current AI is missing. If you ask a standard neural network to understand a spiral shape, it solves it by drawing tiny straight lines that just happen to look like a spiral. It "fakes" the shape without understanding the concept of spiraling. Introducing the Continuous Thought Machine (CTM) Luke Darlow deep dives into their solution: a biology-inspired model that fundamentally changes how AI processes information. The Maze Analogy: Luke explains that standard AI tries to solve a maze by staring at the whole image and guessing the entire path instantly. Their new machine "walks" through the maze step-by-step.Thinking Time: This allows the AI to "ponder." If a problem is hard, the model can naturally spend more time thinking about it before answering, effectively allowing it to correct its own mistakes and backtrack—something current Language Models struggle to do genuinely. https://sakana.ai/https://x.com/YesThisIsLionhttps://x.com/LearningLukeD TRANSCRIPT:https://app.rescript.info/public/share/crjzQ-Jo2FQsJc97xsBdfzfOIeMONpg0TFBuCgV2Fu8 TOC:00:00:00 - Stepping Back from Transformers00:00:43 - Introduction to Continuous Thought Machines (CTM)00:01:09 - The Changing Atmosphere of AI Research00:04:13 - Sakana’s Philosophy: Research Freedom00:07:45 - The Local Minimum of Large Language Models00:18:30 - Representation Problems: The Spiral Example00:29:12 - Technical Deep Dive: CTM Architecture00:36:00 - Adaptive Computation & Maze Solving00:47:15 - Model Calibration & Uncertainty01:00:43 - Sudoku Bench: Measuring True Reasoning REFS:Why Greatness Cannot be planned [Kenneth Stanley]https://www.amazon.co.uk/Why-Greatness-Cannot-Planned-Objective/dp/3319155237https://www.youtube.com/watch?v=lhYGXYeMq_E The Hardware Lottery [Sara Hooker]https://arxiv.org/abs/2009.06489https://www.youtube.com/watch?v=sQFxbQ7ade0 Continuous Thought Machines [Luke Darlow et al / Sakana]https://arxiv.org/abs/2505.05522https://sakana.ai/ctm/ LSTM: The Comeback Story? [Prof. Sepp Hochreiter]https://www.youtube.com/watch?v=8u2pW2zZLCs Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis [Kumar/Stanley]https://arxiv.org/pdf/2505.11581 A Spline Theory of Deep Networks [Randall Balestriero]https://proceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf https://www.youtube.com/watch?v=86ib0sfdFtw https://www.youtube.com/watch?v=l3O2J3LMxqI On the Biology of a Large Language Model [Anthropic, Jack Lindsey et al]https://transformer-circuits.pub/2025/attribution-graphs/biology.html The ARC Prize 2024 Winning Algorithm [Daniel Franzen and Jan Disselhoff] “The ARChitects”https://www.youtube.com/watch?v=mTX_sAq--zY Neural Turing Machine [Graves]https://arxiv.org/pdf/1410.5401 Adaptive Computation Time for Recurrent Neural Networks [Graves]https://arxiv.org/abs/1603.08983 Sudoko Bench [Sakana] https://pub.sakana.ai/sudoku/

Full Transcript

This episode is brought to you by Espolon Tequila. Slow, sticky, snoozy. They call these the dog days of summer. But Espolon, they don't do boring. Welcome to the mark days. Espolon Tequila, 100% blue Weber agave, born to shake up your summer. Just add lime, agave, and a little attitude. Visit EspolonTequila.com. Espolon Tequila, 40% alcohol, volume 80 proof. Copyright 2025, Campari America, New York, New York. Drink responsibly. This episode is brought to you by Indeed. You're ready to move your business forward, but first you need to find the right team. Start your search with Indeed Sponsored Jobs. It can help you reach qualified candidates fast, ensuring your listing is the first one they see. According to Indeed data, sponsored jobs are 90% more likely to report a hire than non-sponsored jobs. See the results for yourself. Get a $75 sponsored job credit at indeed.com slash podcast. Terms and conditions apply. Despite the fact that I was involved in inventing the Transformer, luckily, no one's been working on them as long as I have, right? With maybe the exception of the other seven authors. So I actually made the decision earlier this year that I'm going to drastically reduce the amount of research that I'm doing specifically on the Transformer because of the feeling that I have that it's an oversaturated space, right? It's not that there's no more interesting things to be done with them. And I'm going to make use of the opportunity to do something different, right? To actually turn up the amount of exploration that I'm doing in my research. We just released the Continuous Thought Machine. It's a spotlight at NeurIPS 2025 this year. You should care about it because it has native adaptive compute. It's a new way of building a recurrent model that uses higher level concepts for neurons and a synchronization as a representation that lets us solve problems in ways that seem more human by being biologically and nature inspired. The atmosphere in AI research was actually quite different back during the Transformer years because it doesn't feel like something similar could actually happen right now. because of the reduced amount of freedom that we have, right? The Transformers was very, very bottom-up, right? It's not that somebody had this grand plan that came down from one high that this is what we should be working on. It was a bunch of people talking over lunch, thinking about what the current problems are and how to solve them and having the freedom to have, you know, literally months to dedicate to just trying this idea and having this new architecture fall out. We've spent hundreds of millions of dollars. The biggest sort of evolution-based search is probably in the tens of thousands. We have all this compute. What happens? What happens if you scale up these search algorithms? And I'm sure you'll find something interesting, you know, when someone eventually does buy that bullet and really scale up these evolutionary sort of A-life experiments. Because I pitched it in an environment where people were just going all in on this one technology, I got zero interest. So now I have my own company and I can pursue those directions. This podcast is supported by Cyberfund. Hey folks, I'm Amar, product and design lead at Google DeepMind. We just launched a revamped Vibe Coding Experience in AI Studio that lets you mix and match AI capabilities to turn your ideas into reality faster than ever. Just describe your app and Gemini will automatically wire up the right models and APIs for you. And if you need a spark, hit I'm feeling lucky and we'll help you get started. Head to ai.studio slash build to create your first app. Tufa AI Labs is a research lab based in Zurich. they've got a team of amazing ML engineers and research scientists they're doing some really cool stuff if you look at their website for example you can see what their approach was for winning the ARK AGI 3 pub competition which closed out a few months ago and they are hiring amazing ML engineers and research scientists they also care deeply about AI safety so if any of that is a fit for you please go to 2 for AI labs and give it a go the audience will know I'm a huge fan of Kenneth Stanley's ideas. So his book, Why Greatness Cannot Be Planned, changed my life. It was absolutely insane. And what he was speaking to is that we need to allow people to follow their own gradient of interest, unfettered by objectives and committees and so on. Because that is how we do epistemic foraging, that when you have too many agendas involved in the mix, you kind of end up with a grey goo and you don't discover, you know, interesting novelty and diversity. And I suppose that's basically the thesis of your company, Sukana, is to lean into those ideas. Yes, exactly. At the company, we're a massive fan of that book. We're hoping to have him come and talk at our company next week, actually. And it's a philosophy that we do talk about internally. We have copies of the books, including the recent Japanese translation. as you know one of the co-founders one of my my main jobs one of the main things that i have to keep doing for this company is making sure that we protect the freedom that the researchers currently have right because it's it's it's a privilege really that we have the resources to be able to do that and inevitably as i've seen happen as the company grows more and more pressure comes in and it narrows the freedom. But I think because, you know, we believe in this philosophy so strongly, I'm hoping that we can give people all the research freedom that we do now for as long as possible. And what are those processes that curtail freedom as a company matures? I mean, how would you describe that? It's great that there's never been so much interest and people and talent and resources and money in the industry. But unfortunately, that just increases the amount of pressure people have in order to compete with all the other people working on it and trying to get the value out of this technology and making money. And I think that's what just happens. As a startup, you have a feeling of excitement and trying something new. And right at the beginning, you have a bit of a runway. you have the freedom to try different things, but inevitably people are starting to ask for returns on their investments or they're expecting you to churn out some product. And this just unfortunately reduces the creativity that the researchers have because, you know, the pressures to publish or the pressure to create technology that's actually useful for the products that we have goes up. And so the feeling of autonomy, I think, starts to go down. But, you know, I literally tell people when they start working for the company, I want you to work on what you think is interesting and important. And I mean it. There is, I mean, in YouTube, there's a phenomenon called audience capture. Right. And I think there might be a phenomenon called technology capture, which is that in the early days of Google, it was quite open-ended and what i mean transformers is now the ubiquitous backbone of all ai technology and it's a huge achievement that you're involved in that but i mean there's a similar story with open ai they're now starting to see all of these commercialization opportunities they they can i mean they're going to become linkedin they're going to become an application platform they're going to become a search platform they're going to become a social network and and i guess this could happen to you guys that there's a very strong chance especially with your new paper that we're going to talk about today, this continuous thought machines. It could be a revolutionary technology, but then it will become obvious how it could be commercialized. And that's how those pressures come in. I like the audience capture analogy. I think there's definitely been some kind of capture by large language models, right? They worked so well that everyone wanted to work on them. And I'm really worried that we're kind of stuck in this local minimum now, right? And we sort of need to try to escape it. So we spoke about the transformers, but there's a time just before the transformers that I'd like to talk about, because I think it's quite illustrative. So of course, the main technology before transformers was recurrent neural networks, right and there was a similar feeling right when recurrent neural networks came in and we you know we discovered this new sort of sequence to sequence learning that was also a massive breakthrough right the the the the translation quality went up massively right um voice recognition quality went up massively and there was a similar sort of feeling then of like okay yes we've you know we found the technology and we just need to sort of perfect this technology and back then even my my favorite task was character level language modeling right so every time a new rnn based character level language modeling paper came out i got quite excited right um i you know i'd want to like quickly read the paper like okay how did they you know how did they get the improvements but the papers were always these just these slight modifications on the same architecture right it was lstms and grus and maybe um initializing it with the identity matrix so that you could use the relu function or like maybe if you put the gate in a different place or if you if you layer them in a slightly different way or if you had gating going upwards as well as sideways. And I remember one of my favorites was this like hierarchical LSTM where it would actually decide to compute or not compute at the different layers. And if you trained on Wikipedia and you looked at the structure of when it was deciding to compute and not compute, it kind of looked like the structure of the sentences were actually being picked up by the model and I used to love that sort of stuff right um but the the the improvements were always like 1.26 bits per character 1.25 bits per character 1.24 that was the result that was publishable right that was exciting but then after transformer the team that I went on to afterwards right we applied for the first time very deep transformer models decoder only transformer models to language modeling and we immediately got something like 1.1 uh right so so something that was so good that people actually come to our desk and politely tell us like i i think you you made an error like a calculation do you think it's gnats not bits per character and we're like no no we you know it really is the the the correct the correct number what struck me later is that all of a sudden all of that research and to be clear very good research was suddenly made completely redundant yes right all of those endless permutations to to rnns were suddenly seemingly a waste of time we're kind of in the situation right now where a lot of the papers are just taking the same architecture and making these endless amount of different tweaks of like you know where to put the normalization layer and slightly different ways of training them. And we might be wasting the time in exactly the same way, right? Like I personally don't think we're done, right? I don't think that this is the final architecture and we just need to keep scaling up. There's some breakthrough that will occur at some point. And then it will once again become obvious that we're kind of wasting a lot of time right now. Yeah. So we are a victim of our own success. And this basin of attraction there are so many basins of attraction sarah hooker spoke about the hardware lottery and this is a kind of architecture lottery and it it actually made me think of the agricultural revolution which is that this kind of phase change happened and all of the folks that had these skills that were so necessary these diverse skills for living and surviving they died out and that's actually quite paradoxical because we need those skills to take the next step and so we're now in this regime we've got the term foundation model and the implication is that you can do anything with a foundation model in the corporate world we used to have data scientists you know they were ml engineers doing these architectural tweaks even in you know mid-size enterprise and now we just have AI engineers who are just doing prompt engineering and so on. So you're saying that the fundamental skills that we need to be diverse, to think of new solutions and new architectures, they're dying out. I think I'm going to disagree with that. I think the problem is we have plenty of very talented, very creative researchers out there, but they're not using their talents, right for example you know if you're in academia there's pressure to publish right and if there's pressure to publish you think to yourself okay well i have this really cool idea but it might not work it might be too weird right it might be difficult to get it accepted because i have to sort of like sell the idea more or i can just try this new position embedding right the problem is that the current environment both in academia and in companies are not actually giving people the freedom that they need to do the research that they probably want to do i mean there's also this interesting thing that even in spite of great new research i mean i was speaking to uh sep hot writer and he's got all of these new architectural ideas and open AI aren't implementing them. I mean, Google are doing this diffusion language model, which is quite cool. And I'd like to know your opinion on why that is. So there's a few philosophies floating around, like this concept of a universal representation, that there are universal patterns and the transformer representations resemble those in the brain. And it's rather led to this idea of, well, we don't need to use different architectures, because if we just have more scale and more compute then all roads lead to Rome so why would we bother doing it any differently there's actually better right there is actually already architectures that have been shown in the research to work better than transformers okay but not better enough in order to move the entire industry away from such an established architecture where you're familiar with it you know how to train it You know how it works. You know how the internals work. You know how to fine-tune them. You have all this software that's already set up for training transformers, fine-tuning transformers, inference. So if you want to move the industry away from that, being better is not good enough. It has to be obviously crushingly better. Transformers were that much better over RNNs. Transformers where you just applied it to a new problem, and it just was so much faster to train and you just got such higher accuracy that you just had to move. And I think the deep learning revolution was also another example of that, right? Where you had plenty of skeptics and people were pushing neural networks even back then and people are going, no, we think symbolic stuff will work better. But then they demonstrated it as being so much better that you couldn't ignore it. And this fact makes finding the next thing even harder, right? That's the gravitational pull of always pulling you back to, oh, okay, but a transformer's good enough. And yeah, you made a cool little architecture over here that, yeah, it looks like it's got better accuracy, but OpenAI over here just made it 10 times bigger and it beats that. So let's just keep going. May I also submit that there could be an additional reason, which is, you know, I love that fractured and tangled representations paper. There's this shortcut learning problem. And I think that there's a little bit of a mirage going on here. And there might be problems with these language models that we don't, you know, that we're not fully aware of. And there's also this thing that we're seeing that we are starting to bastardize the architecture. So we know we need to have adaptive computation for reasoning. We know we want things like uncertainty quantification. And what we're doing is we're bolting these things on top rather than having an architecture which intrinsically does all of these things that we know we need. Yeah. And I think our continuous thought machine is an attempt at addressing those more directly, right, which Luke will be able to tell you more about later. There's something still not quite right with the current technology, right? I think the phrase that's becoming popular is jagged intelligence, right? That the fact that you can ask an LLM something and it can solve literally like a PhD level problem. And then, you know, in the next sentence, it can say something just so clearly obviously wrong that it's jarring. Right. And I think this is actually a reflection of something probably quite fundamentally wrong with the current architecture, as amazing as they are. The current technology is actually too good. Okay. Another reason why it's difficult to move away from them. Right? So they're too good in the following sense. And you spoke about the fact that we have these foundation models that okay So that we have the foundation and we can do anything with them Yes I think current neural networks are so powerful that if you have enough patients and enough compute and enough data you can make them do anything But I don't necessarily think that they want to. We're sort of forcing them. They're universal approximators. But I think there are probably a space of, you know, function approximators that will more want to represent things in the way that a human represents them. So there's actually quite an obscure paper that is my poster child for this. It's called Intelligence Matrix Exponentiation. And I think it was actually rejected. So, you know, you can probably project the image of a figure one, but there's an image of it solving, you know, the classal spiral data set of needing to separate the two classes in the spiral. And it has the decision boundary for both a classic RNN multilayered perceptron and a TANH multilayered perceptron. And you can see they both solve it, right? Technically, they both solve the problem because they classify all the points correctly and get a very good test score on this very simple data set. And then they show you the decision boundary for the M layer that they build in this paper. And it's a spiral. The layer represented the spiral as a spiral. Shouldn't we? You know, if the data is a spiral, shouldn't we represent it as a spiral? And then if you look back at the decision boundaries for the Spira and the classic ReLU multilayer perceptron, it's clear that you just have these tiny little piecewise linear separations. and and that's what i mean yes if you know if you train these things enough and you push these little piecewise linear boundaries around enough it can it can fit the spiral and get a high accuracy but there's no feeling when i look at those that that image that the relu version actually understands that it is a spiral, right? And when you represent it as a spiral, it actually extrapolates correctly because the spiral just keeps going out. You're touching on something fascinating there because, you know, we were talking about the need for adaptivity and adaptive computation. I'm really inspired by Randall Bellistria's spline theory of neural networks, and we've had them on many times. And you can look on the TensorFlow playground. You can look what happens when you have a relu network on on this you know spiral manifold and you know you'd be forgiven for thinking that these things are basically a locality sensitive hashing table right because they they do they they they partition the space and and they they can predict this spiral manifold right but we want to do something a little bit more different than that and it also comes into this imposters thing because just tracing the spiral manifold but not continuing the pattern there's a big difference between that so from an imposter perspective just just tracing the pattern is not learning it abstractly or constructively right if we learned it constructively so we you know you speak about this in your paper this complexification the abstract building blocks and you can do adaptive computation you understand the spiral that means that with adaptive computation you can continue the spiral and then you can update the model's weight so it has adaptivity because that's so important for intelligence. So we know that we need models that can do these things, but for some reason they're so sycophantic that they're almost better than an adaptive intelligence system because they tell us exactly what we want to hear. They seem so intelligent, but we know that they're missing these fundamental properties. I'm still fairly sceptical when I see video generation models. you know we went through a phase where you could detect them because of the number of fingers on somebody's hand right and yes with more data with more compute with better training tricks okay they submit and now they usually do have five fingers but did we fix the problem or did we just use more brute force to just you know force the the neural network to know its five fingers where something that actually had a much better kind of representation space it's almost mad that it's controversial to say that we should represent a spiral like a spiral but you know something that could do that generally that if it represented a human hand the way that you know maybe i represent a human hand then maybe it would be much easier to count how many fingers are on a And it's unfortunate that they work so well. It's unfortunate that scaling works so well because it's too easy for people to just sweep these problems under the carpet. You guys have possibly created what I think might be the best paper of the year. This could actually be the innovation which takes us to the next step. And did you get the spotlight at NeurIPS as well? Yeah, we did. And congratulations on that. So I think that's testament to how amazing this paper is. The CTM, the continuous thought machine, it's actually not that far outside of the local minimum that we're stuck in, right? It's not as if we went and found this completely new technology, right? We took quite a simple biologically inspired idea, right, of the fact that neurons synchronize. And not even necessarily in a biologically plausible way, right? brains don't literally have all their neurons wired together in a way that they work out their synchronization but it's it's the sort of research that i want to encourage people to do and the way to sell it is quite easy i think at no point did we have to worry about being scooped right that stress was taken away from us completely so and there was so there was no pressure to sort of rush out with this idea because we're like, well, there's probably somebody else working on exactly this. And I think the reason that we were able to get a spotlight is because we were able to create such a polished paper. We took the time to do the science properly, to get the baselines that we wanted and do all the tasks that we wanted to try. encouraging researchers to take a little bit more of a of a risk right to try these slightly more speculative long-term ideas i think is is the sad thing is i don't think it's necessarily a very difficult thing to sell and i want to have the ctm as like a poster child of it works right it was a bit of a risk we you know we didn't know if we're going to find something interesting but you know It was our first shot. And we did find something interesting and it became a successful paper. If we do find a system which can acquire knowledge, design new architectures, do the open-ended type of science that you're speaking to, can you see a future where at some point the locus of progress will be mostly driven by the models themselves? I think so. Whether or not that's going to replace us completely, I go back and forth on. powerful algorithms are finding uh helping us do research right and i think it might just end up being a more powerful version of that right so i know the the ai scientists that we we released we showed that you could actually go end to end right go from seeding the system with an idea for a research paper and then just take your hands off and just let it go think about the idea write the code, run the code, collect the results and write the paper to the point that we were actually able to get it to a 100% AI generated paper accepted to a workshop recently. But I think we did that to show that you could do it as a sort of demonstration. In a real system, I think I would want it to be much more interactive. right i want to be able to seed it with an idea and then have it come back with more ideas have a discussion with me then go away to write the code i want look at the code and check it and then discuss the results as they're coming out so that's the sort of near-term future that i that i would envision or how i would like to do research with an ai and could you introspect on that is it because you feel we need supervision because the models don't yet understand. You know, there's this path dependence idea. So we need to do supervision because we have the path dependence so we can guide the generation of the language models. Maybe in the future, the language models will just understand better themselves. But there's also the output dimension, which is that we want to produce artifacts that extend the phylogeny of human interest. We want it to be human relevant. Yeah, I think it's more that, you know, in that initial seed idea, it's probably impossible to actually describe exactly what you want. it's exactly the same with you know when I have an intern I can't just have an intern come into the company and I go I have this mad idea and then just explain it to them and then just leave them alone for four months there's a back and forth because I have a particular idea that I want to explore and I need to keep steering them in the direction that I you know that I had in my mind originally. So I think it's more like that, basically. You have such a deep understanding. So you have this rich provenance and history and path dependence. And that means you can take creative steps, intuitive steps for you, respect the phylogeny. They respect all of this deep, abstract understanding that you have. And interns don't yet have that. But maybe AI models in the future will have that. Yeah, sure. If they get to the point where my input becomes detrimental, then yeah that'll be a thing it's kind of like chess right there was a point at which chess engine and human fusion actually beats chess engines that's not that's not true anymore right adding a human into the mix actually makes the the bots worse oh interesting i wasn't aware of that yeah so so what to do when that day comes for ai scientists is a is a broader discussion i think. I think now is a good segue to talk about this paper in a little bit more detail. So this continuous thought machines, you were just pointing to it before. Luke, first of all, first of all, introduce yourself and set this thing up for us. My name's Luke. I am a research scientist at Sakana AI. And my primary sector of research is this continuous thought machines. And it took us somewhere in the region of about eight months working on this project with the whole team. I did a lot of the work, but we also had a lot of people in different areas and doing different parts of it. That I think an eight-month life cycle for a paper seems a bit long for AI research at the moment. But yes, to the actual technical points of the paper. So we call it continuous thought machines. It originally had a different name. We called it asynchronous thought machines before. But every single time people asked us what's the asynchronous part, it became a bit confusing. So continuous thought machines basically depends on three novelties. The first one is having what we call an internal thought dimension. And this is not necessarily something new. It's related conceptually to the ideas of latent reasoning. And it's essentially applying compute in a sequential dimension. And when you start thinking about ideas and problems in this domain and in this framework, you start understanding that many problems that look like solutions to problems that look intelligent are often solutions that have a sequential nature. So, for instance, one of the primary tasks that we tested in the continuous thought machines was this maze solving task. And solving mazes for deep learning is quite trivial. It's really easy to do. if you make the task easy for machines. And one of the ways to do this is you give an image of a maze to a neural network, like a convolutional neural network, and it outputs an image, same size of the maze, and it's zeros where there isn't a path and there's ones where there is a path. There's some really brilliant work showing how you can train these in a careful way and scale them up essentially indefinitely. And this is fascinating. really interesting idea of how to solve this. However, when you take that approach out of the picture and you ask what is a more human way to solve this problem, it becomes a sequential problem. You have to say, well, go up, go right, go up, go left, whatever the case may be, to trace a route from start to finish. And when you constrain that simple problem space and you ask a machine learning system to solve it like that, turns out to actually get much, much more challenging. So this became our hello world problem for the CTM. And applying an internal sequential thought dimension to this is how we went about solving this. Two other novelties that we can touch on and talk about. We sort of rethought the idea of what neurons should be. There is a lot of excellent research in this world, in cognitive neuroscience, particularly exploring how neurons work in biological systems. And then we get on the other side of the scale how deep learning neurons work, which the quintessential example is a relu. It's off or on, in a sense. And this very, very high level abstraction of neurons in the brains feels a little bit myopic. So we approached this problem and said, well, let's on a neuron by neuron basis, let this neuron be a little model itself. And this ended up doing a lot of interesting work on how to build dynamics in the system. The third novelty here is, as I said before, we have this internal dimension over which thinking happens. We ask the question, well, what is the representation? What is the representation for a biological system when it's thinking? Is it just the state of the neurons at any given time? Does that capture a thought, if you wish? If I can be controversial and use the term thinking and thought. And my philosophy with this is no, it doesn't. That the concept of a thought is something that exists over time. So how do we capture that in engineering speak? We, instead of measuring the states of the model that is recurrent, we measure how it synchronizes, how neurons synchronize in pairs along with other neurons. And this opens up the door to a huge array of things that we can do with this type of representation. You were talking about this sort of sequential nature of reasoning and devil's advocate. I mean, there was that anthropic biology paper and they were talking about planning and thinking and they were saying that this thing is planning ahead because I think your system actually, we can say it does planning. It's actually different computationally. Can you explain that? Yes, I think the boundary in terms of computation from a Turing machine perspective, if you wish, is really interesting because the notion of being able to write your tape, read from that tape, and then write again to be in a Turing compute system, Turing complete system is obviously an incredible idea that has completely changed the world. And I think the primary difference with let's talk about transformers versus what we're trying to do with the CTM is that the process that the CTM thinks in, we can apply that process, that internal process to breaking down a problem. So the problem itself can be a single, there is a single solution to this problem. And you could do that in one shot. You could, as I explained with the maze, you could just process that in one shot. But there are certain phrasings of problems that are real problems that doing so becomes exponentially more challenging. So in the maze task, a really good example is that if you try to predict 100, 200 steps down the path in one shot, no models that we could train, not even our model could do that. And we needed to actually build an auto curriculum system where the model first predicted the first step. And then when it could predict the first step, then we started training it on the second and third and fourth step. And the sort of resultant behavior of this is where it gets interesting. One of the ways that I like to do research and that I encourage people who work with me to do research is understand, if you wish, the behavior of a model. We are getting to a point now where the models that we build are demonstrably intelligent in ways that keep surprising us. And breaking that down into a single set of metrics or even a finite single metric about performance seems maybe not to be the right way to do it for me. and understanding the behavior and the actions that those models take when you put them in a system and train them in a certain way seems to reveal more about what's actually going on under the hood. Very cool and I think I didn't pick up on this so you're doing a fixed number of steps so you have like a context window and did you say that you've set that around 100 steps? So for the maze task the model always observes the full image at every step the CTM will observe the full image. For argument's sake, those images could be tokens from a language, the output of a language model. Those inputs could be numbers that that model has to sort, whatever the case may be. It should be agnostic to data. That's how we've tried to build it. But in the maze task, the model can continuously just observe the data. No matter where, it can look at the whole image simultaneously, but it uses attention to retrieve information from the data And it has let call it 100 steps that it can think through And what we do is we pick up at some point the model solves three steps through the maze So it says I going to go up up and right And then it correct, but then it makes the wrong turn. And at that point, we stop supervision. We only train it to solve the fourth step. So one more than what it could. In practice, we do it five, but the principle holds. And when you do that, it's a self-bootstrapping mechanism. And I think the intuitive listener will understand how that extends to other domains, other sequential domains, for instance, like language prediction, many tokens ahead, that sort of thing. So I'm really interested in this idea of adaptive computation. So I guess the first question is, how sensitive was the performance to the number of steps? And then the next question would be, could you have an arbitrary number of steps, which means that, you know, perhaps based on uncertainty or some kind of criterion, you could do fewer steps. And then the final question is, could you have potentially an arbitrary or unbounded number of steps? Yeah, really super question. I think that I'll answer the uncertainty question first about the sensitivity to steps. So a very good example of this is we just trained the model on image net classification. And our last function is quite simple. What we do is we run it for, for example, 50 steps. And we pick up two points, two distinct points. The first one is where is it performing the best, i.e. where is the loss the lowest? And the second one is where is it most sure, or where is it most certain? And those give us two indices, between 0 and 49 inclusive. And we apply cross entropy at both of those points. We just make the loss the average of the cross entropy at those points. So what this does is it induces a behavior where easy examples are solved almost immediately, in one or two steps, whereas more challenging examples will naturally take more thinking. And it enables the model to use the full breadth of time that it has available to it, just in a natural fashion, without having to force it to happen. So you've decided to model every neuron as an MLP, which is really fascinating to talk about that. But also there's this notion of synchronization. And I think you use the inner product to determine the extent to which the parameters are synchronized. And this kind of unfurls over time as the driving force. Can you explain that in a bit more detail? Absolutely. I think it's a good point to explain the neuron-level models, as we call them in the paper, or NLMs, first because it ties into this. So you can imagine a recurrence system is a state vector, a state vector that is being updated from step to step. We track that state vector, and that state vector unfolds, and for each individual neuron, each I neuron in the system, we have a unfolding time series. It's a continuous time series. Well, it's discrete, but it's a continuous value. And those time series define what we call the activations over time. And synchronization is quite simply just measuring the dot product between two of these time series. So you have a system of D neurons, and essentially you have D over 2 squared different synchronization pairs. So neuron 1 can be related to neuron 2 by how they synchronize, and neuron 1 can also be related to neuron 3, etc., etc. The neuron-level models, they function by taking in a finite history, like a FIFOQ, of activations coming in. And instead of being just a radio activation, they use that history as information to process a single activation out. And that is what moves from what we call pre-activations to post-activations. And the principle here is that this might seem rather arbitrary. And does it help for performance? Turns out it does. But that's not really the catch-all solution here. That's not what we're after. What we're after here is trying to do something biologically plausible. find the line somewhere between biology, which is how the brain implements things in the biological substrate that we have, versus deep learning, which is highly parallelizable, super fast to learn, back-proper meanable, all of the nice properties of that that have got us this far, and find a line somewhere where we can take some sprinkling of biological inspiration, but still train it with deep learning. And it turns out that neuron-level models is a nice interim that we can do this with. The concept of synchronization is applied on top of the outputs of those neuron level models. So on the scaling, I think the time complexity is quadratic in respect of the dimension of the synchronization matrix, right? And in your paper, you were talking about subsampling to improve the performance, but how did that affect the stability and were there any things that cost you doing that? Yeah, that's a neat question. I think in terms of stability, what we found was kind of fun, and this was a sentiment that we had throughout the experiments that we ran with this paper was it tended, no matter what we tried it on, it just kind of worked with all spreads of hyperparameters. And the problems that you have with backprop through time, typically with recurrence models like RNNs and LSTMs, it's a challenge. And you run for many internal ticks with the RNNs or the LSTMs and the learning seems to break down. But the fact that we use synchronization in some sense touches all of the neurons through all of the time. So it really helps with gradient propagation. A nice interesting point that's maybe a bit oblique to what you asked about synchronization is we have a system of d neurons. And like I said earlier, there are d over two squared possible combinations. This essentially means that our underlying state or underlying representation to the system is quite a lot larger than what you would get with just taking those d neurons. And as to what that means in terms of downstream computation and performance and the things that we can do with this is what we're actively exploring right now. You guys used an exponential decay rate? You have the system that unfolds over time. It would be maybe a little bit too constrained if the synchronization between any two neurons depended on the same time scale. So for instance, there are neurons in your brain that are firing over very long time scales and very short time scales. The way that they fire together impacts other neurons and causes those neurons to fire. But everything in biological brains happens at diverse timescales. It's why we have different brain waves for different thinking states, for instance. But beside that point, what we do with the exponential decay in the continuous thought machines is it allows us for a very sharp decay to say that these two neurons that are pairing together, what only really matters is how they fire together right now. But if we had a very long and slow decay, essentially that's capturing a global sense of how those neurons are firing over an extremely long period of time. So this was essentially a way of us capturing this idea of how different neurons could maybe fire together very quickly and other neurons can fire together very slowly or not at all. And this lets that representation space that I spoke about, that d over 2 squared representation space, lets it again become more rich and we can enrich that space with more subtle tweaks to how we compute those representations. So we were speaking about this yesterday, Luke, that when folks apply transformers to things like the ARC challenge or things that need reasoning, we need to do lots of domain-specific hacks. So the architects who were the winners of last year's challenge, they did depth-first search sampling. And some folks have been experimenting with using language representations or using DSLs. and some part of this is to do with the the reachability of language right and and language is quite dense which means you can kind of monotonically increase but if i understand correctly your system might have some interesting properties for reasoning and for discrete and sparse domains and also for sample efficiency because we want we want to build a system that can actually do well on things like the arc challenge but can you kind of explain in simple terms why you think this architecture could be significantly better than transformers for doing those things? I think a lot of the really fascinating work in the last few years that I found fascinating in the literature of language models has been related to what one can actually call a new scaling dimension. I, in some sense, see chain of thought reasoning as a way of adding more compute to a system. That's obviously just one small part of what that really is and what it really means. But I think it's quite a profound breakthrough in some sense. Now, what we're trying to do is have that reasoning component be entirely internal, yet still running in some sort of sequential manner. And I think that that's rather important. And you spoke earlier about Geminized diffusion language modeling. And I think that there are a lot of different directions that are exploring this right now. I do think that the continuous thought machine with the ideas of synchronization and multi-hierarchical temporal representations gives a certain flexibility on that space that other people are not yet exploring. And that richness of that space, being able to project the next step to solve the arc challenge and the next 100, the next 200 steps to be able to break that down into a process that a model can then very quickly search that process in its high dimensional latent space becomes something that feels like a good approach to take. Do you see any relationship between this architecture and, you know, Alex Graves' neural Turing machine? Yes, that's really interesting. I do. I think that one of the most challenging parts about working with a neural Turing machine is the concept of writing to memory and reading to memory because it is a discrete action. And that has its own challenges associated with it. And yes, I wouldn't go so far as to say that the continuous thought machine is definitively Turing complete, but the notion of doing reasoning in a space that is latent and letting that space unfold in a way that is rich towards a different set of tasks. And this actually brings me to a point that I find quite interesting that I'd like to share with you. Consider again the ImageNet task. or any sort of classification task. It's a nice test bed. There are many images that are really easy, and there are many images that are really difficult. When we train, for instance, a VIT or a CNN to do this task, it has to nest all of that reasoning in the same space. It has to put all of its decision-making process for a very simple, obvious cat versus some complex, weird, underrepresented class in that data set. And it has to nest it all in parallel in a way that is, we get to the last layer and then we classify. I think breaking that down where you have different points in time where you can say, now I'm done, I can stop, versus now I'm done, I can stop, lets you take a data set or take a task and actually naturally segment it into its easy to difficult components. and I think we know that curriculum learning and learning in this continuous sense, again, seems to be a good idea. It's how humans learn. And if we can get at that architecturally and just have that fallout in the model, again, this seems like something worth exploring. I'm not sure if you know much about model calibration and how neural networks tend to be poorly calibrated. Well, go for it, Tommy. It's a bit of an old finding, but if you train a neural network for long enough and it fits really, really well and you've regularized it really, really well, you'll find that the model is uncalibrated, which essentially means that it is very certain about some classes where it's wrong and uncertain for some classes where it's correct. Essentially, what you want for a perfectly calibrated model is if it predicts a probability that this is in class, the correct class with 50%, 50% of the time you want it to be correct about that class and so on and so forth. So a well-calibrated model, if it's predicting a probability of 0.9 that it is a cat, then 90% of the time it should be correct. And it actually turns out that most models that you train for long enough get poorly calibrated. And there are loads of post hoc tricks to fixing this. We measured the calibration of the CTM after training and it was nearly perfectly calibrated, which is again a little bit of a smoking gun that this actually seems to be probably a better way to do things. when the holidays start to feel a bit repetitive reach for a sprite winter spiced cranberry and put your twist on tradition a bold cranberry and winter spice flavor fusion sprite winter spiced cranberry is a refreshing way to shake things up this sipping season and only for a limited time sprite obey your thirst This message may be shocking to many millennials. If you are one, you might want to sit down. Right now, loads of people are searching the following on Depop. Low-rise jeans, halter top, velour tracksuit, puka shell necklace, disc belt. You likely placed these in the dark of your closet in 2004, never to be seen again. But if you can find it in yourself to dust them off, there are a lot of people who will give you money for them. Salon Depop, where taste recognizes taste. The flavor of this kind of research is such that we didn't actually go out and actually try to create a very well calibrated model. Right. And we didn't even try to create a model that was necessarily going to be able to do some kind of adaptive computation time. I was a very big fan of the paper, Adaptive Computation Timers, Alex Graves, was it? But that paper, it had a massive amount of hyperparameter sweeps in it. Because in that paper, he needed to have a loss on the amount of computation that was being done. because anytime you try to do some sort of adaptive computation time research, what you're fighting is the fact that neural networks are greedy, right? Because obviously the way to get the lowest loss is to use all the computation that you have access to. So unless you had like an extra loss that had a penalty that said, okay, actually you're not allowed to use all the computation, and very, very carefully balanced loss, that's when you actually got the interesting dynamic computation time behavior falling out of the model in that paper. What was really gratifying to see with the continuous thought machine is that because of the way that we set up the loss that Luke described earlier, adaptive computation time seems to just fall out naturally. so that's more the way that i think research should go okay because we don't actually have like a specific goal um like a specific problem we're trying to fix like that or something we're trying to invent it's more that we have this interesting architecture and that we're just following the gradients of interestingness yes and on on that point i think maybe the most exciting thing about your paper is you know we were talking about path dependence and um having this understanding which is built step by step this process of complexification and uh i mean maybe this is this is um apropos in in the theme of world models in in general and also active inference and i say active inference in big quotes because it's not carl friston's active you know maybe adaptive inference or something like that but we want to build agents that can continue to learn that can update their parameters and most importantly can construct path dependent understanding and because it that's completely different to just understanding what the thing is it's how you got there is very important and this architecture potentially allows these agents using this algorithm to explore trajectories in spaces find the best trajectories and actually construct an understanding which carves the world up by the joints yeah that's a that's a really neat perspective i haven't actually thought about it like that but yes i think um that particular stance becomes really interesting when you think about ambiguous problems because carving the world up in one way is as performant as carving it up in another way yeah uh you know perhaps the hallucination in language models is carving the world up in some fine way, but it's just not performance in our measure of this is hallucination. And actually that's not true. But in some other trace down the path of wanting to carve the world up through a autoregressive generation of tokens, you end up in a different carve up of that world. And being able to train a model that can be implicitly aware of the fact that it is actually carving up the world in a different way and can explore those manners, those descents down the carve-up is something that we're after. I think it's quite an exciting approach to be trying to take a stance of let's break up this problem into small solvable parts and learn to do it like that. And how can we do this in a natural way without too many hacks? Yeah, it's something I've been thinking about because Cholet, as much as I love his measure of intelligence ideas is for him adapting to novelty is getting the right answer and the reason why you gave that answer is very very important and in machine learning we have this problem that we we come up with this kind of cost function that rather leads to this shortcut problem but you know we could just build a symbolic system we could be go fi appealed and and we could say okay we need to do this um principled kind of construction of knowledge maintaining semantics. Well, we're not doing that. We're doing a hybrid system. But there must be some natural way of doing reasoning, where in spite of the end objective being this cost function, that because of the way that we traversed these open-ended spaces, that we can actually have more confidence mechanistically that we're doing reasoning, which is aligned to the world. I think that a great way of seeing this particular avenue of research And I think that obviously we not the only people thinking like this and we not the only ones trying to do this What we have is an architecture that amenable to it And surprisingly so it wasn again wasn the goal It's not the goal to do this type of research. It's not the goal to be able to break the world down into these small chunks that we can actually reason over in a way that seems natural. Instead, what we did was pay respect to the brain. pay respect to nature and say, well, if we build these inspired things, what actually happens? What different ways of approaching a problem emerge? And then when those different ways of approaching a problem emerge, what big philosophical and intelligence-based questions can we then start to ask? And that's where we're at right now. So it might feel at times, especially for me, too many questions and too few hands to answer those questions. but I think the fun and exciting thing and the encouraging thing that I can you know try to encourage other younger researchers out there is that you know do what you're passionate about and figure out how to build the things that you care about and then see what that does see what doors that opens up and see how to explore deeper into those domains. We were talking about this yesterday weren't we that you can think of language as being a kind of maze yes like what is to stop us from taking this architecture and building the next generation language model with it i mean that that's honestly as you know something that i am actively trying to explore right now and uh yeah i think the maze the maze task gets really interesting when you add ambiguity to it when there are many ways to solve the maze honestly this isn't something i've tried yet and maybe it's something I should try next week. But it's essentially, you can imagine an agent or the CTM in this case, observing the maze and taking a trajectory. And surprisingly, we saw this. We have a section in our recently updated paper on archives, the final camera ready version of this paper, where we added an extra supplementary section that is not in the main technical report. And that supplementary section is basically, hey, we saw this cool stuff happen. And we list, I think, 14 different interesting things that happened while we were doing the research that obviously didn't make it into the paper, but we wanted people to know about these strange things that happened. And this is one of the strange things where we watched during training what was happening. And at some time during training, maybe halfway through the training run, we could see what the model would do is it would start going one path in the maze and then suddenly it would realize, oh no, damn, I'm wrong, and would backtrack and then take another path. But eventually it gets really good and it does some sort of distributed learning in this because it's got a attention mechanism with multiple heads, so it can actually just figure out how to do this pretty well and refine its solution. But sometime early on in the learning, it descends multiple paths and comes back and backtracks. We have a really fascinating set of experiments that also showed, and this we actually have some supplementary material online showing this where, and I don't really know what this says, it's kind of a deep philosophical thing, but if you're trying to solve a maze but you don't have enough time, it turns out that there's a foster algorithm to do it. And this blew my mind when I saw it. So if we constrain the amount of thinking time that the model has but still get it to try solve a long maze, instead of tracing out that maze, what it does is it quickly jumps ahead to approximately where it needs to be and it traces backwards and it fills in that path backwards. And then it jumps forward again, leapfrogs over the top and traces that section backwards, then leapfrogs. And it does this fascinating leapfrogging behavior that is based on the constraint of the system. And again, this is just an observation we made and what that means in a deep sense and how it's related to giving a model time to think versus not and is it enough time to think what happens what different algorithms does the model learn when you constrain it in this way i find that quite fascinating and an interesting thing to explore does it tell us something about how humans think does it tell us something about how how we think under constrained settings versus open-ended settings so there's a number of cool questions you can ask on this front you you guys are both huge fans of um you know population methods and collective intelligence and because we can we can scale this thing up and we can scale it out and what would it mean to scale this thing out not only just in a kind of um what do they call it trivial paralyzation but in terms of having some kind of weight sharing between parallel models and so on what what what would uh what would that give you potentially this is this is a fun area of research so one of the active things that we're trying to explore in our team is uh concepts of memory, long-term memory, and what does this mean for a system like this? So an experiment that one can construct, for instance, is to put some agents in a maze and let them try to solve this maze, not how we did it in the paper, but in a very constrained setting where an agent can only see maybe a five-by-five region around it. And we give that agent some mechanism for saving and retrieving memories. And the task, if you wish, is to solve that maze, find your way to the end. And the model needs to learn how to construct memories such that it can get back to a point where it's seen before and know I did the wrong thing last time and go a different route. And you can then see this with parallel agents in the same maze with a shared memory structure and see what actually happens when you can all access that memory structure and have a shared global, almost like a cultural memory that we can access and solve this global task by having many agents trying to use this memory system. And I do think that memory is going to be a very key element to what we need to do in the future for AI in general. So the subject of reasoning came up just a second ago. And I think there's a perception that recently we made a lot of progress in reasoning, right? Because it's actually one of the main things that I think people are working on. We released a data set recently called Sudoku Bench, and I was actually quite happy to see it come up organically on your podcast a few weeks ago. Chris Moore. Right. Yes. So I wanted to tell you a little bit about this benchmark because I think I've been having a little bit of issue promoting it because it doesn't on the surface sound particularly interesting because sudoku has a sort of a feeling that it's already been solved right so how interesting can a collection of of sudokus be for reasoning exactly we're not talking about normal sudokus we're talking about variant sudokus and what variant sudokus are are usually normal sudokus right so put the numbers one to nine in the row of the column in the box but then literally any additional rules on top of that and they're all handcrafted they all have extremely different constraints. Constraints that actually require very strong natural language understanding. So for example, there's one puzzle in the data set where it tells you the constraints of the puzzle in natural language and then says, oh, by the way, one of the numbers in that description is wrong. Right? So you have to be able to meta reason about the rules themselves, even before you start solving the puzzle there are other puzzles where you have um a maze overlaid on the sudoku and the rat has to work out a way through the maze by following a path to the cheese but then there are constraints on the path that it takes of like what numbers and what they can add up to it's difficult to really describe how varied these variants Sudokus are. And I think they're so varied that if anyone was actually able to beat our benchmark, they would necessarily have to have created an extremely powerful reasoning system. Right now, the best models get around 15%, but they're only the very, very simplest and the very smallest Sudoku puzzles in the set. we're going to be putting out a blog post about gpt5's performance and it is a jump but it's still completely unable to solve puzzles which are you know humans can can solve and what i really like about this data uh data sets um and actually was the catalyst for me creating it in the first place. There was a quote by Andre Kaparthi saying, okay, so we have all this data. It's from the internet. But what you really want, right? If you wanted AGI, you wouldn't want all of the text that humans have ever created. You would actually want the thought traces in their head as they were creating the text, right? If you could actually learn from that, then you would get something really powerful and i thought to myself well that data must exist somewhere my first thought was maybe philosophy like uh you know there's there's a type of philosophy where you just write down your thoughts without thinking or just stream of consciousness i thought maybe that could work um but then when i wasn't thinking about it and i was you know in my leisure time i was watching a youtube channel called cracking the cryptic yes where these uh these two british gentlemen will solve these extremely difficult sudoku puzzles for you right sometimes their their videos are four hours long and they're professionals like this is their job and what was perfect i realized is they tell you in agonizing detail exactly what reasoning they're used to solve those particular puzzles right so we with their permission took all of their videos which represents thousands of hours of very high quality human reasoning like thought traces and scraped them and made that available for imitation learning right um we did try to do this internally. Turns out that I did a little bit too much of a good job of really creating a very difficult benchmark, right? So we're still trying to get that stuff working. We publish that if we have some success. Yeah, I want to really sell the fact that this reasoning benchmark really is different, right? Not only do you get something that's super grounded, like you know exactly if it's right or wrong, so you can do RL to your heart's content, But you can't generalize very easily. Each puzzle is deliberately designed by hand to have a new and unique twist on the rules called a break-in that you have to understand. And right now, despite all the progress we've made, the current AI models can't take that leap. They can't find these break-ins, right? They'll fall back to, okay, I'll try five, I'll try six, I'll try seven, right? The reasoning becomes really boring and nothing like what you see in the transcripts that we've open sourced from this YouTube channel. So I just want to put the challenge out there, right? That this is a really difficult benchmark. And I think progress on this benchmark will really mean progress in AI generally. So good, so good, so good. Score holiday gifts everyone wants for way less at your Nordstrom Rack store. Save on Ugg, Nike, Rag & Bone, Vince, Frame, Kurt Geiger London, and more. Because there's always something new. I'm giving all the gifts this year with that extra 5% off when I use my Nordstrom credit card. Santa who? Join the Nordiclub at Nordstrom Rack to unlock our best deals. It's easy. Big gifts, big perks. That's why you rack. This episode is brought to you by State Farm. Listening to this podcast? Smart move. Being financially savvy? Smart move. Another smart move? Having State Farm help you create a competitive price when you choose to bundle home and auto. Bundling. Just another way to save with a personal price plan. Like a good neighbor, State Farm is there. Prices are based on rating plans that vary by state. Coverage options are selected by the customer. Availability, amount of discounts and savings, and eligibility vary by state. could you reflect a bit so after watching this um cracking the cryptic youtube channel how diverse were the patterns because um chris was saying to me oh you know these guys they go on discord servers they get these creative crazy ideas and i'm i'm obsessed maybe it maybe i'm just being idealistic but i love this idea of there being a deductive closure of knowledge right that that there's this big tree of reasoning and we're all in possession of different parts of the tree to different depths so the smarter and the more knowledgeable you are the deeper down the tree you go but in this idealized form there is one tree and all knowledge kind of you know originates or emanates from these abstract principles and we could in principle build reasoning engines that could just reason from first principles and it might be computationally irreducible so you have to perform all of the steps and it feels like because we're not in possession of the full tree what we need to do is kind of fish around we fish around to find lego blocks oh that's a good lego block i can apply that to this problem and maybe that's just what we need to do in ai for the time being is is we need to just acquire as much of the tree as possible but could could we just do it all the way down yeah fascinating question that tree is probably massive right and as a human is solving these puzzles they're definitely learning in real time and discovering new parts of this tree and it's it's sort of a meta task right because it's not just reasoning you're reasoning about the reasoning. And I don't think we can, we have that in AI right now, because if you watch the videos, they'll say something like, okay, this looks like a parity task, or this is a set theoretic problem, or, you know, maybe I should get my path tool out and trace this around. And of course, the professionals, they do have this already massive collection of reasoning lego blocks as you say in their head so they'll recognize okay that type of rule usually needs this kind of lego block it's actually fascinating to watch how good they are at just intuitively knowing where you know someone like me who haven't solved as many needs to spend a lot of time looking around like okay maybe maybe i should try this one maybe i try this one um but even they're not perfect so you can watch them take a certain kind of reasoning and start building up okay maybe we should solve it like this and then go and no that doesn't disambiguate it enough and then backtrack and then go down another path again something that we do not see current ai's doing when they're trying to solve uh this this benchmark the tree is very big and i guess the phylogenetic distance between many of these motifs and in the tree is just so large so it's so difficult to jump between and and i and i think that's why as a collective intelligence we work so well together because we actually find ways to jump to different parts of the tree right and i and i think that's probably why the rl the the current state of the rl algorithms that we're trying to apply to this just isn't working because in order to learn how to get these breakthroughs to understand what the sort of nuanced reasoning is to get these puzzles, you have to sample them. And that it's such a rare space, you know, it's such a specific kind of reasoning that's required to get to the specific breakthrough that this kind of technique doesn't work, right? And there's definitely a feeling in the community like, okay, this is how you just solve things now. Like we have RL, yes, we can get these language models to do what we want. it doesn't work for this data set. Guys, it's been an absolute honor having you on the show. Just before we go, are you hiring? Because we've got a great audience of ML engineers and scientists, and I think working for Zucano would be the dream job. That's very kind of you. Yes, we are definitely hiring. And as I said earlier in this interview, I honestly want to give people as much research freedom as possible. I'm willing to make that bet, right? I think things that are very interesting will come out of this. And I think we've already seen plenty of interesting things coming out of this. So if you want to work on what you think is interesting and important, come to Japan. And Japan just happens to be the most civilized culture in the world. All right. It might be the opportunity of a lifetime, folks. So, yeah, get in touch. Guys, seriously, thank you so much. It's been an honour having you both on the show. Thank you very much. Thank you so much. It's been great.

Share on X Share on LinkedIn