Back to Podcasts
Last Week in AI

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning

Last Week in AI • Andrey Kurenkov & Jacky Liang

Tuesday, December 9, 20251h 34m
#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning

Last Week in AI

0:001:34:40

What You'll Learn

  • DeepSeek 3.2 is 50% cheaper than comparable models like Anthropic's offerings
  • It performs on par or better than GPT-5 on benchmarks, showing significant capability improvements
  • DeepSeek used a sparse attention mechanism to reduce compute requirements while maintaining performance
  • They scaled up their reinforcement learning compute budget to 10% of total training, enabling more stable and capable RL-driven learning
  • DeepSeek is working towards an even more advanced 'DeepSeek R2' reasoning-focused model

Episode Chapters

1

Introduction

The hosts discuss the return of co-host Jeremie and the busy few months of AI news and model releases they've missed.

2

DeepSeek 3.2 Release

The main focus of the episode is the new DeepSeek 3.2 model release, covering its technical innovations, performance, and positioning in the broader AI landscape.

3

Technical Innovations

The hosts dive into the details of DeepSeek's sparse attention mechanism and reinforcement learning-focused training approach that enable the model's capabilities.

4

Broader AI Landscape

The discussion touches on the importance of pre-training and base models in driving progress towards more advanced AI systems.

AI Summary

This episode covers the release of DeepSeek 3.2, a new version of the open-source language model from DeepSeek. The key highlights include a 50% cost reduction, performance on par or better than GPT-5 on benchmarks, and technical innovations like sparse attention and a reinforcement learning-focused training approach that enables more stable and scalable training.

Key Points

  • 1DeepSeek 3.2 is 50% cheaper than comparable models like Anthropic's offerings
  • 2It performs on par or better than GPT-5 on benchmarks, showing significant capability improvements
  • 3DeepSeek used a sparse attention mechanism to reduce compute requirements while maintaining performance
  • 4They scaled up their reinforcement learning compute budget to 10% of total training, enabling more stable and capable RL-driven learning
  • 5DeepSeek is working towards an even more advanced 'DeepSeek R2' reasoning-focused model

Topics Discussed

#Large Language Models#Model Architecture#Training Techniques#Model Performance#Open Source AI

Frequently Asked Questions

What is "#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning" about?

This episode covers the release of DeepSeek 3.2, a new version of the open-source language model from DeepSeek. The key highlights include a 50% cost reduction, performance on par or better than GPT-5 on benchmarks, and technical innovations like sparse attention and a reinforcement learning-focused training approach that enables more stable and scalable training.

What topics are discussed in this episode?

This episode covers the following topics: Large Language Models, Model Architecture, Training Techniques, Model Performance, Open Source AI.

What is key insight #1 from this episode?

DeepSeek 3.2 is 50% cheaper than comparable models like Anthropic's offerings

What is key insight #2 from this episode?

It performs on par or better than GPT-5 on benchmarks, showing significant capability improvements

What is key insight #3 from this episode?

DeepSeek used a sparse attention mechanism to reduce compute requirements while maintaining performance

What is key insight #4 from this episode?

They scaled up their reinforcement learning compute budget to 10% of total training, enabling more stable and capable RL-driven learning

Who should listen to this episode?

This episode is recommended for anyone interested in Large Language Models, Model Architecture, Training Techniques, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Our 227th episode with a summary and discussion of last week's big AI news! Recorded on 12/05/2025 Hosted by Andrey Kurenkov and Jeremie Harris Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai Read out our text newsletter and comment on the podcast at https://lastweekin.ai/ In this episode:Deep Seek 3.2 and Flux 2 release, showcasing advancements in open-source AI models for natural language processing and image generation respectively.Amazon's new AI chips and Google's TPUs signal potential shifts in AI hardware dominance, with growing competition against Nvidia.Anthropic's potential IPO and OpenAI's declared ‘Code Red’ indicate significant moves in the AI business landscape, including high venture funding rounds for startups.Key research papers from DeepMind and Google explore advanced memory architectures and multi-agent systems, indicating ongoing efforts to enhance AI reasoning and efficiency. Timestamps:(00:00:10) Intro / Banter(00:02:42) News PreviewTools & Apps(00:03:30) Deepseek 3.2 : New AI Model is Faster, Cheaper and Smarter(00:23:22) Black Forest Labs launches Flux.2 AI image models to challenge Nano Banana Pro and Midjourney(00:28:00) Sora and Nano Banana Pro throttled amid soaring demand | The Verge(00:29:34) Mistral closes in on Big AI rivals with new open-weight frontier and small models | TechCrunch(00:31:41) Kling's Video O1 launches as the first all-in-one video model for generation and editing(00:34:07) Runway rolls out Gen 4.5 AI video model that beats Google, OpenAIApplications & Business(00:35:18) NVIDIA’s Partners Are Beginning to Tilt Toward Google’s TPU Ecosystem, with Foxconn Reportedly Securing TPU Rack Orders(00:40:37) Amazon releases an impressive new AI chip and teases an Nvidia-friendly roadmap | TechCrunch(00:43:03) OpenAI declares ‘code red’ as Google catches up in AI race | The Verge(00:46:20) Anthropic reportedly preparing for massive IPO in race with OpenAI: FT(00:48:41) Black Forest Labs raises $300M at $3.25B valuation | TechCrunch(00:49:20) Paris-based AI voice startup Gradium nabs $70M seed | TechCrunch(00:50:10) OpenAI announced a 1 GW Stargate cluster in Abu Dhabi(00:53:22) OpenAI’s investment into Thrive Holdings is its latest circular deal(00:55:11) OpenAI to acquire Neptune, an AI model training assistance startup(00:56:11) Anthropic acquires developer tool startup Bun to scale AI coding(00:56:55) Microsoft drops AI sales targets in half after salespeople miss their quotas - Ars TechnicaProjects & Open Source(00:57:51) [2511.22570] DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning(01:01:52) Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving MemoryResearch & Advancements(01:05:44) Nested Learning: The Illusion of Deep Learning Architecture(01:13:30) Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO(01:15:50) State of AI: An Empirical 100 Trillion Token Study with OpenRouterPolicy & Safety(01:21:52) Trump signs executive order launching Genesis Mission AI project(01:24:42) OpenAI has trained its LLM to confess to bad behavior | MIT Technology Review(01:29:34) US senators seek to block Nvidia sales of advanced chips to China See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Full Transcript

Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our Last Week in AI newsletter at lastweekin.ai for articles we will not be covering in this episode. I'm one of your regular hosts, Andrei Krenkov. I studied AI in grad school and now work at the startup Astrocade. And back with us after hiatus is our other regular co-host. It's nice to be back in the box. Yeah. Hey, everybody. Thank you for your patience while I ducked out for that quick three-month period. Yeah. Wow. What a three months it was. I'm super pumped to be back. This is like something that was missing from my life and schedule in a big way. So it is great to be back in the seat. What a week to be back. DeepSeek 3.2, like all these TPU stories, like insane set of model releases. Not a huge number, but like a couple of really big ones. And then quite a cluster of other things too. So really appreciate everybody being patient. And obviously there were some great podcasts in between. So really thrilled to be back. This is going to be a lot of fun. Yeah. For anyone who has been listening or is going to go back in the archives, it's been an interesting couple of months because I've just wound up rotating in different co-hosts and then like randomly recording every week or two so i think we are gonna do our best to actually get back to schedule and return to your regularly scheduled programming and jeremy you did miss i feel like the last few months was like the last round of frontier model releases opus 4 5 gemini 3 gpd 5.1 i can't remember if it's like star wars where they freeze the dude and reanimated him. I'll say Austin Powers, because I know it happened there. But you know, like they thawed me out and it's like, whoa, like all these models happened. So I'll be honest with you, things have been so busy in the last three months that I have not looked into the big frontier releases. I have to go back and see what the hell happened with Anthropic, with OpenAI, with, you know, all these things. So, and obviously Google. So I'm excited to just like learn by osmosis as I get from you. Obviously I'm all caught up and read the papers from last week, but everything before that is this like strange wilderness. So it'll be an interesting catch up time. I hope everybody is in for a lot of like dumb questions from me because that'll be a big part of the show. And we are back to doing 30 news stories and episodes. So you'll as always try to keep it to an hour and a half. We'll see how it goes. To give a quick preview of what we're talking about, we will be starting with Deep Seek 3.2, a big model release. I guess one of that like frontier model release. cycle. Beyond that, more actually models for images and video this week. Applications in business, we've got a whole bunch of stuff. Some developments on the hardware front, some new startups raising money, stuff going on with OpenAI and Anthropic, a whole bunch of stuff like we're going to have to race through it. Then we will have a couple of open source stories and research stories, which are just interesting, looking into the developments and a bit into the future of where we might go. And then a few stories on policy and safety to round things out. So should be a fun episode. And let's dive straight in with tools and apps, beginning with DeepSeek 3.2. This is the new development from DeepSeek with developers of DeepSeek v3, which is their sort of LLM comparable to GPT and Cloud. Then they have DeepSeek Arm 1, of course, which was kicking off 2025 with, at the time, it was a massive deal. They showed basically with an open source model, they could train it to reason as well or comparatively well to O1 and some of these other models at the time where the movement towards reasoning and test time compute scaling was just beginning. So now we've seen all the other labs do that as well, right? Now everything is thinking, everything is reasoning, Opus 4.5, Gemini 3. Gemini 3 doesn't allow you to query it without having some reasoning at this point. And now we have DeepSeq 3.2 and it is, as you might expect, actually quite a big leap. I don't know why they haven't called this DeepSeq 4, I think they could have, but lots of big headline stories here. So first of all, 50% cheaper. It's way cheaper than these other offerings, like compared to Anthropic, for instance. Super affordable. Performs great on the benchmarks. Some of it, you can even say outperforming GPT-5. Just neck and neck and even better in some cases. some interesting technical stories with the release as before with deep seek they give us a lot more insight into what they're doing and how they achieved these results compared to the other developers so we know details like sparse attention which makes it able to be faster not just cheaper but also faster we know some of their training details like they have refined the reinforcement learning objective of some of the learnings for instance about the optimization method you use and stuff like that i don't want to dive too deep into it but this is deep seek 3.2 they are working towards deep seek r2 which is meant to be like the hardcore reasoning model so this is not even the reasoning one this is not like the deep think this is the opus 4.5 or i don't know, whatever you want to call it, the base chatbot model, not the reasoning maxed model. So very exciting for, you know, this is open source again. So people will be able to build on this, fine tune it, et cetera. Yeah, it's actually, you know, I remember, God, I want to save this time last year when DeepSeek came out with V3, the base model at the time. Late December, yeah. Right, yeah. And at the time, actually, we talked about it on the podcast and said, wait for it. there's going to be a big reasoning breakthrough from DeepSeek. People are underpricing DeepSeek right now. And lo and behold, R1 came out. The reason it was so easy to make that call at the time was that your base model is a big determining factor in the quality of the reasoning model you trained downstream from that. And if you want to hear it from someone who knows infinitely more than us, just listen to the Dwarkesh podcast with Ilya Setskiver that recently came out where Ilya is making the case, his entire company is based on this assumption that there's something wrong with pre-training. That's the reason that we're hitting, in his view, not a full-on wall, like things will keep improving, he thinks, but that ultimately they won't improve all the way to ASI, AGI, whatever your definition is, unless there's something fundamental that's changed with pre-training. So pre-training is a big part of the story. In fact, it's a huge part of the story. This is a new base model. To your point, it is a bit weird that we're not calling it the, you know, four, especially since it was R1 and now we're doing R2. So they are incrementing the integer on the R version, but not the base model, which is weird. But anyway, we'll set that aside. They get a pass on this. So there are a couple of big things they do here. Deep seek sparse attention is one of the big breakthroughs. I'm just going to quickly gloss over this. This is going to be probably the one paper that we do a kind of deeper dive on just because of how much there is and how much transparency we're getting into the makings of this because of course it is deep seek. So one piece here is DeepSeek sparse attention. Okay, what happens when you do a normal transformer architecture? You have your attention mechanism that's going to pay attention to all of the tokens in the input sequence, and then it's going to figure out basically which tokens need to be essentially accounted for or weighted more heavily based on the relevance in the sequence for the token that you're currently attending to. What they're doing here is they're saying, well, this is a very expensive calculation. You need to compute attention weights accurately for all tokens. And why don't we train instead a lightweight indexer just to get a rough idea of the attention scores, an approximate attention score, and only keep tokens with high indexer scores. In other words, tokens that we are approximating to have or expect to have high attention values. And then you just toss out all the tokens that don't fit that criterion. And this indexer, by the way, is very lightweight. It has a lot fewer, essentially, attention heads. It's got around 8 to 16 instead of 128, which is what you see for the full-on attention mechanism. It's lower precision, so FP8 instead of FP16 or BF16 for the full-scale thing. So all these things are being done to make it super, super lightweight, very quick approximation. Which tokens should I pay attention to? Ditch the rest and keep just those. And so what they end up doing is they end up keeping about 2,000 of these tokens, regardless of the length of the input sequence, which is really interesting, right? You might expect that for a fixed number of retained tokens, in this case, again, about 2,000, you'd end up discarding more information the longer the full input sequence ends up being, right? You're only keeping 2,000 out of a very large number of tokens. If you have like a full context window of 128,000 tokens, you're really keeping less than 2%. It turns out that that actually does not lead, though, to a significant performance degradation, even over long context tasks. And the reason ultimately is that information is very sparse in context. There are very few tokens in practice that you really need to inform that next prediction. And in practice, 2,000 tokens is like a couple pages of text. So you actually do end up having quite a lot of information to base that. Yeah, in some sense, this is like actually attention, right? Versus transformer attention is like look at everything and then keep everything in mind all at the same time. Even if it's like 2,000 pages or whatever, this is actually looking attention in a more human sense of like pick out information to process and then process only that information. That's a great – I haven't heard that point. That's a great point, right? Because in a way, it's like it's sort of like a human being reading a book and being like, okay, you need to memorize every single page except also remember that page one is really forgettable. But you need to remember page one but just know that it's super forgettable. What they're doing here is saying, ah, fuck page one. Throw it out completely. just keep those pages much, much more compute efficient. And by the way, classic DeepSeek thing, right? I mean, when you think about what made DeepSeek v3 a really capable model, it was all these little kind of optimizations. It's the software engineering almost of making these models go. That is the secret sauce behind a lot of these releases as well. There's a new RL paradigm that they use. That's got a whole bunch of things going on under the hood. One of which is just scaling the crap out of their RL compute budget. So their RL compute budget is about 10% of the total compute budget dedicated to training this model. That's a lot. You'll note, of course, that Grok has actually hit, they claim, 50% RL compute budget. So it's not unprecedented in the whole space of Frontier models. It's the first time we're seeing an open source model with anything like that kind of RL compute. So that is a big deal. There's a whole bunch of training stability improvements. When you do RL at this scale, RL is notoriously fickle. It's really, really hard to get a stable training run out of your RL. And so there's a whole bunch of things that they do. There's a trick called off-policy sequence masking, which roughly speaking, it involves them like keeping, so doing a bunch of RL rollouts. So get the model to produce a bunch of solutions and basically throwing out solutions that lead to incorrect answers. And they're also really inconsistent with what your base policy, your initial model would have produced. And this is basically just to avoid situations where there's a question like, what is 2 plus 2? And the response that you would get is like, banana, cucumber, polywoggle, right? It's like a complete implausible and wrong answer. So there's nothing to really learn from there. And so what they'll do is they will keep answers that are wrong if they look plausible under the original model. So what is 2 plus 2? the response five would actually get retained as an incorrect response. You can actually learn from that. It's clear enough that it's not just noise. So that's another kind of micro-optimization here. The only other one I'll mention here is this keep routing operation. So it turns out that most people, when they do training and inference, you have to do a lot of inference to do your RL loop, right? So that means essentially generating all those rollouts, all those solutions that you're going to get the model to produce and then ultimately train on later. doing that, people tend to use different frameworks to do training and inference. And in this case, they're using a mixture of experts models. So essentially a whole bunch of little sub models, mini models, a query comes in, there's a router that sends the query to a subset of those models. And it turns out that during training, you end up sometimes accidentally routing that query to different models than during inference. And so you do your inference, You calculate rollouts and you're like, oh, this rollout was wrong. OK, so I need to punish whatever whatever expert produced this. Well, if your inference produced that rollout from a different expert than what's actually being trained on, you're going to end up propagating the corrections, the gradients to the wrong experts. And so what they're doing here is just making sure that the expert that creates the rollout is the same one that's having its gradients updated in the in the training process. again, doesn't tend to matter at smaller scales because this becomes more of a kind of a fringe issue. But when you're doing it large scales, it leads to huge training and stability. So there's a whole bunch of stuff here too, around large scale agentic task synthesis, and they have a pipeline for it that we're not going to get into. But the bottom line is this is a very, very impressive model. Maybe one issue that they're flagging is poor token efficiency compared to Gemini. So in other words, the average number of tokens that it generates to solve a problem is actually quite a lot. It's quite high. It's a very verbose model. And so they haven't found a way to get it to reason as efficiently per token. Each token that it generates, each word that it generates in its output is in a sense contributing less to productive thinking than what you see in a Gemini model. So they flagged that as a thing that needs to improve. But look, I mean, zooming out, holy shit. Like just look at the benchmarks. This is a wild one. Right. And we positioned this initially as the base model as not the R2. But we should also say that like GPT-5, like Sonnet 4.5, like Gemini, really this is an integrated model. So they have DeepSeq v3.2. They also have DeepSeq v3.2-thinking. And they have also this DeepSeq v3.2-special, which is an experimental. I need to look up exactly what I say. But it's basically like a preview of R2 where they did train it to investigate the potential of extended thinking. They say, we also developed an experimental variant, which was trained exclusively on reasoning data with reduced length penalty during RL. So we get like a little preview of this model, which they're not planning to keep on their endpoint. And yeah, if you look at the benchmarks, AME, HLE, Humanities Last Exam, they are near the top, beating GPT-5 high, beating Cloud 4.5 Sonnet, getting near or in some cases being Gemini 3 Pro. So it really is state of art. As you mentioned, one of the limitations is the context efficiency, the token use efficiency, which we've seen with reasoning in general, the potential failure mode if you optimize for reasoning. They also mention a fun detail in the conclusion. They acknowledge certain limitations when compared to Frontier ClassForce models. First, due to fewer total training flops, the breadth of world knowledge in DeepSeek V3.2 still lacks behind leading proprietary models. And then the second, this is one of the two things I mentioned. First, we weren't able to trade as much, so he doesn't know as much, even though it's as smart. Second, the token efficiency is a challenge. So they're saying that these are two of the things they're working on. And third is complex tasks where they're still not quite as far. And presumably that's where DeepSeek R2 or whatever you want to call the next model will be. A couple other interesting factors before we move on. So this is a generous model in the sense that R1 introduced this paradigm of training with verifiable rewards on code tasks and math tasks in particular. That was the beginning of reasoning. And then the other thing that happened throughout 2025, I would say, is the move towards actual agents, the move towards full-on tool use, the way we got to cloud code to these coding agents wasn't just reasoning. It was also very much optimizing them for tool use, for operating in an agentic environment. And that's where this task synthesis part of this comes in. They have tens of thousands, over 100,000 tasks with environments for coding, for search, for code interpreting, and for general tasks that they synthesized. And so on things aside from code, also for search, for presumably research, deep research, I don't think I mentioned this here, for tool use, it is, again, not quite as good as the frontier models, but very good. And I think it's interesting to compare it to Kimi K2 Thinking, which also recently came out. On things like humanities, like last exam, MMLU Pro, just the broadly intelligent evaluation criteria, they aren't super far ahead. So actually, communicate to thinking is also a very intelligent model that we recently discussed. But when you look at the difference in tool use, and in sort of agentic execution more broadly, this is where it, I think, really shines. Yeah, absolutely. And to that point, right, part of this is also the, so the RL stage of their training loop is very interesting for what it's trying to do with respect to catastrophic forgetting, which is one of the classic challenges you run into. You want to train a model, a thinking model that will reason well, and then you also want to train tool use, and then you also want to train coding. And the problem is if you layer these in one after the other, the model will catastrophically forget. In other words, if you train it to do coding after you train it to do reasoning, it will forget some of the reasoning and then pick up some of the coding. And so the solution to this that they work into their training loop is to use what they call mixed RL training. So they're going to actually, instead of separating out reasoning, agentic training, and then alignment training, they actually merge everything into one RL stage. And they use specialist distillation for this. So they train a bunch of these domain specialists, like domain specialist models with heavy duty RL compute. They train a math specialist, a code specialist, an agent like agentic specialist. Yeah, it's your point of search specialist too. And they use those specialists to generate high quality training data for the final generalist model. And that's all layered in this kind of mixed process to avoid the impact of catastrophic forgetting. So kind of interesting in that sense too, this sort of rethinking of, I've lost track of what we're supposed to be calling pre-training anymore at this point, but the rethinking of some aspect of that training loop. I think pre-training is still considered the next token prediction stage where you ingest all of the internet. Yeah, that's what it's always been to me. I've just seen papers now where they're like, oh my God. Let's add RL to pre-training. Let's do this and that. And then what is pre-training, post-training, mid-training? Nobody knows. And supervised fine-tuning, there's like this gradual shift in the type of data where you're moving from what once we would have called pre-training to something that looks more and more like fine-tuning. but they still call it, anyway, it's a whole thing. Anyway, but yeah, the takeaway is DeepSeek V32, very good reasoning first models built for agents. That's what they call it. So you shouldn't, it is basically the successor to R1 for all effects and purposes. And another, you know, another shot in the trajectory of open source models, very much kind of keeping neck and neck with frontier models. And I think, you know, you mentioned Ilya's interview with Dwarkesh, who, in case our listeners haven't heard, there's been some very good interviews lately on the Dwarkesh podcast with Ilya Satskova and Satya and Karpathy. Oh, yeah. A lot of high-level discussion of where we're at right now and where we're going. And there is kind of a consensus forming that our current recipe isn't complete. We need some new ideas, new kind of like something big to change up the recipe. And the way that Ilya put it, which kind of is very popular, is removing from the age of scaling to the age of research. Which I think this paper, you know, you look at the technical report, which is fairly detailed, has a lot of like these different bits and pieces that they put together. makes me wonder whether DeepSeq is more in the age of research than OpenAI and Anthropic at this point. When I look at this through that lens, what Ilya is pointing out is finding more clever ways to make things scale that don't hit general reasoning. And what he's really getting at is a kind of sample efficiency argument fundamentally, that it's not just about doing more of the same thing. like essentially what this deep seek paper is doing, I think Ilya would say is kind of a waste of time through, through the lens that he's using where they're saying, okay, let's continue to pepper this model by giving it even more examples of like a million different coding problems to train it, to solve those coding problems. But he's really focused on this whole out of distribution generalization thing that just doesn't seem to be getting cracked by the current paradigm. There are people, by the way, like anthropic and opening eye are at least publicly very much on the scaling seems to continue to work side. So we'll see. I think it's a spectrum, right? A spectrum of how innovative or different you are, particularly with sparse attention, I think is quite interesting. When we get to research, we'll be talking about some work from DeepMind, which I think is very much along these lines and very interesting. But for now, we finally move on and try to get through more stories. So next up, we've got Black Forest Labs. We haven't talked about them in a minute, but they have released Flux 2 the next generation of their image generation and editing system So this is another thing that pretty noteworthy over the last couple of months With GPT image and especially Nano Banana and Nano Banana Pro we seen like a quiet revolution in image generation and editing capabilities in a way that I think most people haven't predicted. So now these models are able to synthesize very, very, very precise, prompt, correct, kind of like you were getting to AGI for image generation almost, right? And Blackforce Labs is a startup that's been around for a while, spun out of stability AI. And Flux for a long time was one of the leading text-to-image models. They also had open source variants. So with Flux 2, they are introducing basically, you could say, the nano-banana generation, the GPT-5 image generation of image synthesis for their system. Lots of details, as you might expect. They have a bunch of variants, Flux2 Pro, highest performance, Flux2 Flex, Flux2 Dev, which is a 32 billion open-weight model, Flux2 Klein, which they are aiming to release under Apache 2.0, and their VAE. So they are still doing this partially open source thing that they've started with and have kept with. They say this wins out against various models. When image, for instance, as far as open source things, you're probably not going to do better than this. Not up there with Nano Banana Pro, but much cheaper and faster. So I think in the world of image synthesis and in that environment, this is potentially, you could say, similar to DeepSeek 3.2 in the sense of the impact on open source and also on competitive environment versus the frontier labs. I continue to be interested in the business model of open sourcing stuff and how viable it'll be in the future. I still don't see it, but I'd be very curious to see. I think the standard is like open source for weaker stuff and keep the actual frontiers up to yourself, which is also what Mistral is doing and everyone's doing. For sure. Yeah. I guess the thing I'm wondering about, especially on images, given what we've seen with Nano Banana, what is the ceiling on image generation capability beyond which people stop caring? I don't know. I suspect that it may actually turn out to be fractal, right? It may turn out to be like your image generation tool gets so good that you can get it to, I don't know, like make a circuit diagram for you for a next generation circuit, right? We may get into that space, in which case it gets more and more niche and potentially higher and higher TAM and value. But yeah, I guess I'm sort of curious. It kind of seems like, you know, you're drowning in a ship, the water level is rising as the open source models get better and better. And then like, you can only move so high until you hit the roof of the ship. And then you, this is a very like dark metaphor, but I'm curious about, yeah, how viable it ends up being. But I've also been saying that for like three years and this is continuing so i don't know dude like yeah yeah i mean looking at the numbers they show that this is better than nanobanana in terms of human preference comparable cost not quite as good as nanobanana pro but much cheaper forex cheaper nanobanana pro is incredibly expensive seems to be because the way these models work now is they do reasoning They don't just generate images, they generate kind of reasoning tokens or whatever you want to call it. So there is a substantial amount of progress in this space, and it's exciting to see Black Forest Labs still be a player in that space and provide competitive pricing and options for developers to work with. Right now, Not a Banana is kind of definitely leading the pack, but this is an alternative option. Interestingly also, we won't go on for this forever, but they say they've built Flux 2 on top of Mistral 3 using the vision language model based on Mistral 3. So a bit of a kind of, I guess, open source environment, building on top of each other vibes there. Moving on along, next story also about Not a Banana. The story is Sora and Not a Banana Pro are being fraudled amid soaring demands. So apparently Google and OpenAI have reduced generation request limits for these models due to high demand. Free users of Sora are now limited to six generations per day. And Google has decreased the free image generation limit on Not a Banana Pro from three to two per day. So just an interesting thing to observe. Presumably people are using these a lot, especially for Not a Banana Pro. I could see it just being very useful for people to make presentations, memes. A lot of people have been making memes with Not a Banana Pro, and it's very good. So noteworthy to see that aside from chatbots, this is now a very significant part of the realm of AI. Yeah, Bill Peebles, who heads up SOAR at OpenAI, he says free users are going to have six video generations a day at their end. His statement is, and we've seen this before from OpenAI, our GPUs are melting. So very much- They love to say it. On fire, melting, yeah. I think that's kind of like a YC-ism from back in the day. I heard that a lot. I think Paul Graham would say, when your servers are melting, that's when you know that product market fit, whatever. So I think that's kind of maybe the meme. But yeah, anyway, so they're saying it looks like it may not actually be temporary. And it's possible that it's just like you'll need to purchase additional generations as needed beyond that point, maybe indefinitely. So kind of interesting. Next, we've got actually Mistral. So Mistral has released new openweight Frontier and small models. They have launched this in the Mistral free family and have made substantial improvements. So these are generally of a smaller variety. These are 14 billion, 8 billion, and 3 billion parameters. And these come in several variants. They have the base, pre-trained one, the instruct variant, which is chat optimized, and the reasoning variant, which is optimized for complex logic, analytical tasks. So at that size, obviously, they're not going to be competitive with the latest generation. but they are on par with something like Lama Free or Quen Free Omni, other open source offerings. And DeepSecret 3.2, DeepSecret 1 are very big models, by the way. They are hundreds of billions of parameters. So at the smaller model scale, which is where Mistral has seemed to focus in on and sort of specialize a bit more on, these are actually useful offerings if you're trying to work at that range. So we won't dive in because we spent so much time on DeepSecret for you too, but I think it's still worth noting Mistral is doing a lot of development and releasing. And it's hard to know exactly where we're at, but my impression is they do have substantial customers, at least in Europe. And I'm still rooting for them, even though I know, Jeremy, what you're going to say. I know you're going to say, I don't know how Mistral is going to stick around. You don't know that my opinions haven't changed in the last three months, but yes, they haven't changed in the last three months. But yeah, it's interesting. The one thing I'll say is that when you think about the scales here, it's important that they're hitting 3 billion to 14 billion parameters. That is in the sweet spot, right, for the open source developer community. When you do look at these larger models, though they can be impressive, you know, the V3.2s, the V3s, and so on, they're just way too big for the average person to, like, have them sit on their laptop or their one little GPU. So, yeah. Next up, moving back to video, Kling's Video01 launches an all-in-one video model for generation and editing. So this is pretty interesting. This is Kling AI. They've been around for a while in the video generation space. And this O1 model is a unified multi-model video model for both video generation and editing, which to my knowledge is not something that you can do with Sora or any other big video tool. these are all generally for generation and sora 2 is able to do very impressive generation vr3 is able to do very impressive generation but actually dealing with editing is a whole other thing and so here they are editing in the sense of changing weather and stopping protagonists so i think that is something you can actually do in in sora and vr to some extent you can condition on various things, but the focus on unifying it in this one O1 model is significant. And they claim that it outperforms Google VR 3.1 and one way LF in video creation and transformation tasks. So, you know, video space is very competitive still. VR 3.1 was kind of the king along with Sora 2, but it's not going to be here for long. Yeah, we actually don't have much information about this multimodal visual language model that they use to bridge between like text and multimodal inputs. But so kind of, yeah, kind of interesting in that sense. We don't really know how they're doing it, but it is pretty compelling. They show a pretty solid win ratio against Google VO 3.1. As you said, 62% win rate with 32% ties, only losing 6%. It's kind of similar with Runway Aleph. So it's definitely a marked improvement in quality in addition to in kind of form factor and sort of user experience, like the things that you can do with it. So you can upload up to seven images and tell it to, you know, like in Japanese anime style, like this person should be wearing this outfit and the hat from this person should be on their head and so on. You can kind of see it come together. It is really impressive. Three to ten seconds of video that you can get out of this. So just for a bit of context there. And next up, also on video, Runway has rolled out their Gen 4.5 AI video models that, again, they're saying are outperforming Vio and Sora in independent benchmarks. So same deal, very high resolution videos, much more refined, dealing with things like physics, human motion, camera movements, cause and effect. We're getting into world models, arguably with these video models. They're kind of, you know, actually more advanced than some people might have expected. And one way is, again, a company focused entirely on video generation and editing. That's their bread and butter. So I wouldn't be surprised if as far as actually a tool that people use in their workflow, this will be more impactful than Sora or Vio. And on the artificial analysis text-to-video leaderboard, it is still number one as of now, as of time of recording. Number two is Vio 3 no audio, and number three is Kling AI. So to give you a sense of, and these are all kind of within margin, within the 95% confidence interval. So it kind of is anyone's game at this point at the top of the game. So moving on to applications and business. First up, NVIDIA's partners are beginning to tilt towards Google's TPU ecosystem with Foxconn reportedly securing TPU rack orders. So quick background, TPUs are Google's specialized chips for LLM inference, particularly their transformer, not transformer, tensor processing units have been working on it for a decade. There was a bit of a steering of drama on Twitter when someone pointed out that Gemini 3 was trained entirely on TPUs, even though that isn't new for Google. I guess people have realized that Google has TPUs now, and it might be a problem for NVIDIA. So this is interesting in the sense that TPUs have largely been within Google's ecosystem. You've been able to pay for them through Google Cloud. Google has used them internally for training their models. The idea of Foxconn and NVIDIA's partners beginning to work with Google on GPUs does seem interesting. It's a big deal. Foxconn take the GPUs that they get from NVIDIA, and then they essentially package them together into server racks that then go into the data centers. They kind of sit in between, you can think of like in between the GPU companies or the systems companies like NVIDIA increasingly is. and the data centers themselves in the supply chain. And so, yeah, they're sitting here, historically having been an NVIDIA partner, now working with Google on their TPU deployments. Google, which is looking to make TPUs available in data centers for companies like Meta and others to compete directly with NVIDIA. This is really interesting, right? This is the move of a company that rightly concludes, hey, there's a really big market here. Now, one thing I want to observe, not something that I've seen commented on much, but it is absolutely true, And maybe the single most important economic factor when it comes to AI hardware today, NVIDIA makes like 85, 90% margin on their TPUs. Okay. Google is selling their TPUs to companies potentially that it will end up partnering with. You think about like their partnerships with a variety of different entities. There's Google DeepMind. Let's focus on them because they're actually inside Google. Google DeepMind gets to use Google's TPUs at cost, at cost. So 90% margin becomes 0% margin. In other words, $1 of Google DeepMind compute translates into 10 times the amount of compute that OpenAI does, assuming OpenAI is going with NVIDIA. That's a really, really huge deal. You can upend the scaling landscape completely and get really thrown off if you're comparing apples to oranges on fundraising. So if OpenAI raises a billion dollars and Google throws a billion dollars at their development internally through DeepMind, very, very different consequences, right? So all kind of part of what is in store here for companies that continue to rely exclusively on NVIDIA. And that's going to be an interesting thing to watch in the landscape. You know, Foxconn here is manufacturing both NVIDIA and Google. So they're in this unique position where they're hedging their bets, right? They're going to make money regardless of which platform wins. Another little detail here is they talk about this one-to-one supply ratio. supply ratio. So basically Foxconn ships one computing tray rack for every TPU rack that they get from Google. And that really suggests substantial, like very structured orders rather than just experimental deployments. And so that's a big deal. Also TPUs, by the way, more compute, more energy efficient rather than NVIDIA GPUs that scale as energy starts to become, as power starts to become that key bottleneck, this is going to be a big deal. So yeah, Look for the landscape to evolve based on this. Don't write off Jensen, of course. I don't even need to say this, but NVIDIA is a powerhouse, and you better believe they're coming back. The engineers in NVIDIA are working overtime. You can be sure of that. And, yeah, this is notable because TPUs, to my awareness, have not actually been used externally. This is kind of the beginning of targeting of external adoption of TPUs. For a long time, my perception of Tfuse was as a competitive advantage for Google. They wanted to keep it in-house because they could then price their LLMs cheaper, train them for cheaper scale, Google Cloud to be cheaper. So it's an interesting strategic choice also by Google to allow some competitors to potentially use them. For instance, Meta, you know, maybe it makes sense for Meta to be able to use them because Google isn't competing with Meta on the frontier model development. I would also call back, right? So Satya, I think, said on his Dwarkesh podcast appearance, he's like, people think of Microsoft as a software products company. We're not. In the future of AI agents, we are an AI infrastructure company. We are supporting the running of trillions of agents around the world. Well, if you look at Google, I mean, that's kind of where the margin is accruing so far, at least NVIDIA sure as hell seems to be enjoying a lot of margin. It's not obvious that OpenAI is. It's not obvious that Anthropics margins may be actually slightly better, but still the compute level seems to be really, really interesting. So if you're Google, you might just think, hey, can we be the compute infrastructure layer, not just for ourselves, but for all these other companies, including behemoths like Meta? So we'll see, but that may be part of it. And next story very much relevant. Amazon releases an impressive new AI chip and teases an NVIDIA friendly roadmap. So this is from AWS Amazon Web Services. They have unveiled Tranium 3 at their recent conference and the Tranium 3 Ultra Server system. So this is their in-house hardware for model training. I don't think inference as much, but notably used by Anthropic in particular. They have a pretty deep partnership. And I do believe Claude has been trained on Tranium chips to a significant extent. They've also teased the development of Tranium 4, which will support NVIDIA's NVLink Fusion technology, allowing interoperability with NVIDIA GPUs. Also interesting to me to see whether Amazon decides to try and expand beyond in-house. Or Tranium, again, could be a competitive advantage on the cloud competition front for providing training support. We're seeing now more competition in the fine-tuning space with, for instance, thinking machines targeting that vector as opposed to in-house model development. Anyway, Tranium, again, Amazon has been working for a long time. They haven't sort of made a dent in NVIDIA, but I could see this is another kind of thing to look out for. Yeah, and a funny sort of reference here as well to Anthropic, right? They're saying AWS customers like Anthropic are going to be using this chip to significantly cut their inference costs. So there's, I believe they're running on NVIDIA chips as well. Anyway, Anthropic's got a whole bunch of different frameworks that are having to accommodate, which is a really interesting challenge for them. And one assumes they're using AI to help them map their training frameworks onto those different platforms. But yeah, the focus is everywhere you'd expect it to be with this, energy efficiency. So 40% improvement in energy efficiency from the previous generation of Tranium chips. You mentioned, yeah, they're for training. The inference line is called Inferentia. So there you have it. They've got those sort of separate lines. And then there is also a big focus on not just logic, though that is four times faster, but also memory. There's four times more memory capacity, which is going to be an interesting option, especially as inference becomes, as rollouts become more important, you're fitting more and more into memory. Moving on to OpenAI, they have declared code red as Google catches up in the AI race, another fun conversation starter. So this is reportedly something that happened within OpenAI. Sam Altman has declared code red to improve ChatGPT, especially because of Gemini Free. And real statistics showing that consumers are moving towards Gemini. Chagipity still dominates heavily. I think it's something like 80%. But Gemini is gaining. And Anthropic has been gaining rapidly in Enterprise. So OpenAI has been losing out on Enterprise for years. With Anthropic kind of getting at market. They have dominated consumer usage. But Gemini is crawling up and going towards something like 10%, 15%. So I wouldn't be surprised if this did happen. You never know how serious Code Red actually is. But OpenAI must be feeling a bit of pressure. No doubt. I mean, you look at their position in the ecosystem, it is not what it once was. Anthropic, I think in November, I don't remember exactly because this was part of my dark time. They're raising it at like $350 billion from Microsoft and NVIDIA. OpenAI is at $500 billion. Like I'm old enough to remember when Anthropic was supposed to be like an order of magnitude behind. They have caught up. And as you say, in the enterprise segment, which by the way, probably is where you're going to go looking for your best margin as well in terms of value per query. That's an interesting challenge that OpenAI is going to face. So this is a no fail situation for them for sure. There's going to be a daily call apparently for those tasked with improving the chat bot according to this internal memo. And Sam is encouraging temporary team transfers to speed up development. This is also, again, what would Ilya say? I think Ilya might say, well, this is the distraction game you get into when you're trying to build products rather than just a straight shot to ASI. They're just doing pure research heads down. And you're seeing these kind of team transfers where it's like, yeah, you're having to defend your product position now because your whole thesis is based on scaling. And you have to find the investment, the revenue to generate that next level of capex spend. Yeah, so in particular, Altman is saying that they'll be delaying initiatives like ads, shopping, and health agents, personal assistant, Pulse, to focus on ChatGPT. And this is something that OpenA has been doing this year is trying to expand with things like Pulse, where you get a daily update, group chats, these very much more product level features as opposed to the base model quality. Like, GPT-5, I guess, famously or even famously, when it did come out, people were like, oh, is that it? Like, we were waiting for GPT-5 from GPT-4 for like a year or a year and a half or something. And it barely is better. But in any case, yeah, OpenAI has become a product company more so than an R&D lab. And they potentially have lost some of their edge in being able to be at the frontier. So, you know, Code Red could mean a lot actually in the startup space and we wouldn't be surprised if they do kind of gain the lead again. And on their competitor, Anthropic, reportedly they're preparing for a massive IPO. So the talk here is that they have engaged law firm Wilson, Sancini, Goodrich and Rosati for their potential IPO. They're saying that they want to pursue private funding that would value it at above $300 billion with a 15 billion commitment from Microsoft and NVIDIA. This IPO presumably would try to be next year, as soon as next year, which is, I mean, lots to say there from a business perspective. From, I guess, also company level perspective, once you go public, you have public investors. It changes the game. So Anthropix, you know, same. They've been kind of in the underdog position for a long time, but their position is starting to seem a bit stronger. And they seem like they want to push on that Yeah from a sort of compute efficiency standpoint it does seem like or algorithmic efficiency standpoint I should say it does seem like they certainly competing with and possibly exceeding opening eye pound for pound I mean it is pretty wild what they been able to pull off, especially the last like year and a half. I feel like they've truly, truly ascended. So yeah, this would be a big deal. I'm curious, I'm not a lawyer, but I'll have to do a bunch of research to understand what the implications of the public benefit corporation structure of Anthropic is with respect to an IPO, what this means for their governance structure, which famously has a board of oversight. I think it's like about half a dozen people who get to tell the company not to do things that violate its kind of founding mission. A different spin on the structure that OpenAI had sort of jettisoned quite famously. So yeah, it's interesting. An access to obviously the famously deep capital markets of the United States just at a time when all of the scaled buildouts are happening, right? So Anthropic, I think committed to something like a $50 billion infrastructure build out fairly recently. So this is what they need to bridge that gap. Yeah. And I think also points to the private market may be tapping out at this point. OpenAI and Anthropic may have just sucked up all the VC money and now we need to IPO to get more money. Do you remember like five years ago when VCs could not fund multi-deca billion dollar raise? like this is insane yeah and speaking of raising money going back to black forest labs along with the announcement of flux 2 they have raised 300 million dollars at a 3.25 billion valuation so this is a big number you haven't seen a hundred of million dollar raise in quite a while must indicate that they are doing well on the business front i don't think we have much insight as to their lead in the API space, but I would imagine they're doing well. And that's pretty much it. This is a Series B. It's roughly one year, one and a half year old company. And that's a lot of money. We got another startup, Paris-based AI voice startup Gradium. This is a seed round. So they have just emerged from Stealth with a $70 million seed round. They are developing audio language AI models with ultra low latency voice responses. So 70 billion for a seed round. This is like ridiculous bananas numbers that used to not happen before AI and has reduced as a trend since 2023, 2024. But here they are able to get there. So surprising to me a little bit because 11 labs does have a pretty strong position in this space and some some other competitors as well must mean they have some very strong talent okay now moving back to hardware we've got some developments on open ai's build outs so they have announced one gigawatt stargate cluster in abu dhabi back in may and that has actually begun construction and you can you've seen some photographs and so on. And there is some skepticism. It seems that they'll be able to reach the one gigawatt number very rapidly. They'll hit 200 megawatts initially. And I think, Jeremy, you have more kind of thoughts on this front. Yeah, I mean, it's not typical, not atypical rather for first power to be pretty close to when you hit full scale. So 200 megawatts may actually be fairly close to when they do get to that one gigawatt. But basically, this is from Epic AI. It's a tweet thread that they put together on X that goes over their assessment of how plausible it is that they'll hit the one gigawatt, planned one gigawatt in time. And it looks like delays, basically. So when will the UAE, Stargate in the UAE reach a gigawatt? They say that they don't see clear signs beyond 200 megawatts. Optimistically, they say eight more 100 megawatt buildings could start construction in December and take one and a half years, like the first two to complete, that would put one gigawatt at Q3 of 2027. So this matters because when you look at the timelines of different labs, kind of years to get to their first one gigawatt, you see quite a bit of variability, but you've got, for example, XAI that they've pegged at sort of like early 2026, Anthropic mid-2026, sorry, that's the, I'm sorry, mid-2026 would be opening at Stargate and Abilene. And then Amazon Anthropic and New Carlisle, they've got a build that would be early 2026. So beating opening eye to the punch across both Abilene and the UAE Stargate optimistic scenario. So this is interesting because it means Anthropic really does seem to know what it's doing in terms of building fast. Like this is pretty wild stuff. And, you know, historically, look, the thing with these announcements too, keep in mind, they get delayed. That's what builds do. The functional kind of process here is a NeoCloud or some kind of company will approach a lessor, basically a property owner that will claim that they have access to enough power to build a data center or a bunch of them on their site. And it looks good. You check in with the local community. How much power can the transformers accommodate? Everything looks good. And then you get started and you find out, ah, the lessor lied to us. They actually have this like weird term in the contract that doesn't actually let them get the power in time. And this is what you see over and over and over. So a lot of what these companies are doing is getting really, really good at assessing how credible is a site. It may look good on paper, but in reality, it's not. So that's certainly been the case in North America where power is very scarce. I'm curious if this is, in fact, a delay in the UAE. I would have expected that problem to be less of an issue there just because they have such a surplus of power. So that is a bit of a bit of an update. I don't know nearly enough about the UAE's power situation, but that'll be something I'll be looking into in the next few weeks, I'm sure. A few more stories in the business front. Now moving into partnerships, another trend this year. OpenAI's investment into Thrive Holdings is its latest circular deal. And this one is truly circular, it looks to me. So OpenAI has acquired an ownership stake in Thrive Holdings as subsidiary of one of its major investors, Thrive Capital. Frive Holdings is basically a private equity firm. It acquires companies that could benefit from AI in sectors like accounting and IT services. Apparently, OpenAI will send employees to work within Frive's companies to accelerate AI adoption, which I was not aware that that's a thing that happens, but that's interesting. Yeah, it puts off those circular economy vibes that people have been talking about. And by the way, we don't know the terms of the deal. All we know is what you said. OpenAI is going to send employees and product teams to work with Thrive's companies. So cool. Apparently, if that succeeds, then OpenAI's stake will grow somehow and they'll get compensated for their services. Really unusual configuration, but we'll learn more over time. It could make sense because as one of the things you want to do as a tool provider, as an API provider, is get startups to use your tool, right? Absolutely. To get adoption. So if you're able to send employees to go work at these companies to use OpenAI, great for OpenAI. Assuming you achieve lock-in, I guess that's the big bet here. Yeah, it looks circular, but actually it might be the opposite. So I actually, in general, I think the whole circular investment argument is a bit silly. There is real value being created here. Anyway, we could do a whole episode. We could do a whole episode, but anyway, it's a little nuanced. and OpenAI also going to be acquiring Neptune, an AI model training assistance startup. So this is a startup specialized in monitoring and debugging tools for AI model training. They have previously collaborated apparently and Neptune is going to be going offline. The financial terms were not disclosed. So kind of interesting. I would have thought OpenAI has already mature infra of our own and wouldn't really benefit from something like this, but it seems that's not the case. Yeah, apparently there's already been a collaboration between OpenAI and Neptune to build metrics dashboards that help OpenAI's teams build foundation models. And so this is even tighter collaboration. It's so fascinating and almost funny that you have such a niche use case with really, I mean, the number of users for this is tiny, right? It's just the value per user is so insanely high. So that's really what this is all about. We'll see if it translates into faster development at OpenAI. We'll see. And then another acquisition, Bionthropic, they have acquired developer tour startup Bun to scale AI coding. So this is a major acquisition. They say that CloudCode has reached apparently 1 billion annualized revenue run rate since launching earlier this year. So BUN is developing something like a JavaScript execution environment, something technical like that for running code. And in that sense, seems like Anthropic will be buying this to build the infrastructure for feature software generations and basically double down on cloud code and this kind of work. All right, last business story. Microsoft drops AI sales targets in half after salespeople miss their quotas. So this is sales growth targets for its AI agents products after many salespeople apparently failed to meet the quotas for the fiscal year ending in June. So these are the products that deal with multi-step tasks doing autonomous execution, part of the big 2025 push from Microsoft and others being added to Word, Excel, PowerPoint, Microsoft 365, Copilot, et cetera. So, you know, it tells you something. Hard to say if the goals were unrealistic in the first place or if adoption is indeed slow, but both are very plausible. Yeah, absolutely. And now on to projects and open source, we begin with DeepSeq Math V2. This is slightly older than V3.2, so we kind of pushed it off, releasing in November of the 27th. And as it sounds like, this is DeepSeq's math specialized model. And in fact, in the DeepSeq v3.2 report, they mentioned that they have incorporated the data from this into the training of DeepSeq v3.2. So it's a bit of a subset. Not too much for me to say here, basically doubling on math, doing a lot of self-verification training specifically for things like proof generation and achieving some of these benchmarks like gold level performance on IMO 2025 and CMO 2024, neck and neck again with Gemini and others on this frontier of math reasoning. Yeah, the core of this is there's like a generator and a verifier, which is a very standard setup, of course. The generator generates solutions, the verifier checks them, and you sort of have this interaction between the two of them that improves them over time. The challenge you get into is that sometimes the generator can get a correct answer with incorrect reasoning, for example. And in those instances, you need a way for the verifier to sort of account for that in some way when it's really just like looking at the final answer that doesn't tend to work well. So they developed this meta verifier that they also train and then have folded into this loop and include its score in the overall reward signal for the verifier's training. And essentially what the metaverifier is doing is it's confirming that the kind of issues that are identified by the verifier before it produces its final score are actually real and that they justify the predicted score that the verifier gave. So it's sort of a who watches the watchers thing. The one interesting, and then they get a bunch of human experts to score the quality of verifier analyses to create a metaverifier data set. Now, while they do train the verifier and the generator in tandem in this kind of sort of generative adversarial way, you can think of it that way, they don't continuously train the metaverifier. And so that's kind of an interesting thing. I mean, you could think that like eventually at a certain point, you might reach the point where you do need to start doing that because eventually the generator and verifier just gets so advanced that the metaverifier can't keep up, but they're actually not doing that. That was kind of the most interesting sort of omission, at least that I found in the paper. They've got a bunch of scores, really impressive scores, by the way. So on the Putnam 2024 exam, it scored 118 out of 120. It solved 11 of 12 problems completely with just minor errors. And it surpassed the highest human score of 90 by a wide margin, right? 118. That's pretty wild. This is the premier undergrad math competition in North America, by the way. So pretty, pretty. This is like genius level people, by the way. Like when you say high school, we don't mean like, we mean the real. It's undergraduate math, right? So you're like, yeah, you're beyond the high school stuff. And even like the IMO gold medal scores, right? It's all five out of six problems there. So this is crazy. Yeah. So this is coming pretty quickly after Google had a paper towards robust mathematical reasoning. They also announced their IMO results with Gemini 3, Gemini deep thinking. It was a big deal to reach the gold medal. It was actually like the first time. And looking at the numbers, using the IMO ProofBench that Google released just earlier, like a month ago, if you look at DeepSeq R1, it had 4% performance. If you go to now DeepSeq Math V2, something like 70%, right? This is a massive leap over a year ago where the models were in terms of this level of complex mathematical reasoning. Next up, we've got a paper from Google, evil memory benchmarking LLM agent test time learning with self-evolving memory. So going to the note about Aliyah's conversation with Rokash, this is I think going to be a trend in 2026 and in coming months where people increasingly are thinking about memory and I'm thinking about adaptation and basically going beyond what has been the paradigm for AI for a long time, which is you train the model, you deploy the model, and it does in-context reasoning. You give it some examples maybe, and that's it. It doesn't ever kind of have a long-term memory of any kind for the most part. So here, that is addressing that problem of learning over a period of time. They have two different evaluations here. So they have XPRAG, which is basic baseline that stores interactions as a structured record input output feedback and can retrieve similar past experiences as in context examples. This is pretty normal. People do this. And they also have REMEM as a more sophisticated framework with a fake act refined loop where you can actually try to basically organize your memory, retrieve, prune, reorganize your memory during inference rather than just treating it as a sort of like, you know, big pile to throw stuff onto. Yeah, this is still pretty early. So this is kind of a proof of concept almost and an initial evaluation of this kind of overall category of a capability that LLMs really don't have inbuilt at least. And to make it really concrete, by the way, there's a paper we'll be talking about a little bit later that to your point on trends, this is very much becoming a thing and people are trying to take different bites at the apple here. But there's a paper we'll be looking at that'll dive into this even more in a couple minutes to give you a concrete sense of like, what is the kind of thing that this benchmark that the whole space right now is trying to solve for? So if you think about a kind of task that someone might put to you, like put a clean apple in the microwave. If you didn't know where stuff was in your kitchen or whatever, you might look for the apple in different places and then eventually realize, oh, this person keeps their apple in the fridge. So you go get the apple from the fridge. So, okay, cool. That's round one. If next someone asks you, okay, put a clean potato in the microwave, right? Same task, but now you're asking about potato. You might, based on the previous task you've done, you really ought to have learned in context that, oh, interesting, the vegetables maybe are in the fridge. Maybe fruits and vegetables are there. So let me turn to the fridge now instead of looking literally everywhere. And a bad model or a model without the sort of active memory would just, again, look over the counters, all these places before one in the fridge. And so if you then ask, put a clean tomato in the microwave, Like over time, you're going to start to refine the rule in your head from, oh, the apples are in the fridge to actually looks like apples and potatoes. Oh, OK, it looks like all fruit and vegetables are in the fridge. And so that kind of massaging of this sort of it's not the weights of the model that are being updated. And it's not the attention values that are kind of being updated with every single freaking token. It's almost an intermediary frequency update. There's updates that are happening every sometimes in terms of like this memory that you're using to navigate the world. And so anyway, this is a bit of a teaser for the paper we'll talk about later where different frequencies of updates, mechanisms that learn at different rates are hypothesized to be really important for this sort of in-context learning. And I just want to mention this kind of thing, broadly, like memory frameworks, has been a thing that's been coming about as actually startups dealing with memory maintenance and so on. But it is all rather ad hoc. Like it's giving agents tools and telling them to like store and update and so on. This is an example of that paradigm. And now as you move into research and advancements, the paper that we'll be focusing on is called Nested Learning, The Illusion of Deep Learning Architecture, which is a very interesting title that basically presents the alternative option, which is instead of trying to add memory on top of a neural net, on top of a model where it's told to now write down its memory and then later retrieve it. and refine it and whatever. What if it's a core component of the way the model itself works? So essentially, they position it as looking at current LLMs. They have this property of only ever experiencing what they call the immediate present. So once you finish pre-training, You have ingested all of the internet and the model sees its input, the context, and produces an output based on that context. That's how the model works. There's no continual learning. There's no continual updating of anything within the model. At best, what you can do is store something external to the model and then retrieve it and put it into the input. And that's what RAG does. That's what memory frameworks do, et cetera. So there's a big question. That's been a big question. Continual learning in general has been a topic in machine learning for decades and has been a topic in transformers as well and large language models as well in research for the last couple of years, but not sort of a priority. but this paper is showing from the DeepMite's perspective their kind of latest effort on this front of how do you make a neural network architecture a training paradigm that encodes continual learning and different levels of memory within the actual model itself and the neural net weights. At a high level, the way this works is a couple of things. So first, they have this notion of nesting, which means that within the model, if you look at a typical transformer, right, it's kind of one big thing. You take an input and it goes through a whole bunch of layers where you alternate attention and MLPs, basically attention and kind of processing on top of attention. And you sort of have what they deem as like a single frequency of information update, a single frequency of thinking, so to speak. The core conceptual leap in the paper is nesting in the sense of having multiple layers of reasoning frequency and learning frequency. So they say this takes inspiration from the brain, where we do seem to have these layers of memory, right? Working memory, short-term, long-term. We also seem to have different rates of updates of different areas of a brain. So through various technical details, which we'll get into, the gist of it is being able to have nesting of different amounts of memory and rates of update and other things like that within a single neural net. Building on top of, by the way, previous research of theirs on Titans, There's been a bit of a continuation of research on this front from DeepMind, and this is the latest in that line of work. Yeah, it's actually, in a way, quite interesting. In a way, it feels like just sort of putting words to an intuition that I think a lot of people have had for many years. This includes people who worked on RNNs for a long time, like recurrent neural networks or state-space models. There's a lot of attempts to kind of actually do this, to instantiate the theory that they're putting together here, but they're actually kind of trying to codify it and put a word to it. So the idea here, as you say, is like during inference, at inference time, all the models' weights are frozen. They do not update at all. They do not learn anything. So they've done their learning during training and they're frozen in time. Every time though, you put a new token through the system, the attention values for that token get recalculated from scratch, right? So that means that essentially while the weights are updating with basically no frequency, they're never updating, the attention values are being recomputed every single time with every single token. So they're updated with almost an infinite frequency. At least that's the way the paper is going to frame it, which I think is debatable. But anyway, so you've got essentially this world of extremes inside a transformer where the core architecture is frozen in time, but the attention mechanism is just frantically updating all the time. There no middle ground where we sort of absorbing slowly a bunch of context and information as we go and also kind of slowly updating it over time to respond to things that are learned It sort of you in or you out You like fully all about this token, or you're only frozen in time. It's almost like the weights have an infinite momentum, right? They're static, and the kind of attention values have zero momentum. They're flying all over the place. So extrapolating a bit, this seems to imply that you might do better architecturally by defining some additional component of the network that updates at some kind of intermediate frequency. And that is exactly the kind of de facto memory that an RNN might use, where you're sort of deliberately updating it only every N tokens, let's say, to create this medium term memory in the system. And that's what they're going to do in the paper. So they define this thing called Continuum Memory System, CMS. And this is basically just stacking multiple MLPs, multiple neural networks on top of each other where each MLP updates at a different frequency. So it updates every, you know, N tokens. And this gives the model the ability to dynamically sort of to have some dynamic range is one way to think of it in terms of its memory and actually learn on the fly, long, short, medium term memory, if you want to think of it that way. So quite interesting. They've got a good breakdown of how to think about what qualifies at what level of memory. But anyway, there's a whole bunch of stuff we get into with analogies to this, but I think we probably got to move on. Yeah. At a high level, this also ties in neatly to the line of research on trying to find a sort of hybrid between recurrent models and transformers. So we've talked about Mamba, we've talked about the resurgence of recurrent models in general, where obviously recurrent models have memory in the model itself. they don't have learning per se but they have memory from all their inputs in the past one of the issues with recurrent models is that the memory degrades because you don't update the weights you don't store it except you store in this like little input that you keep updating so rn's recurrent models is also an aspect of this model but the big deal is really the fact that you update the weights and and there's some very interesting technical details of how they reformulate gradient descent as just a general update rule that you can apply and not kind of the sound of view of it we probably shouldn't dive in too deep but really interesting paper and and there's also a blog post by deep mind you can take a look at next up we've got a kind of smaller paper multi-agent deep research training multi-agent systems with m-grpo so we have had a lot of excitement about reinforcement learning with agents and one of the complexities of reinforcement learning is well what if you have multiple agents right now your environment isn't static your actions aren't static and in general everything gets much more complex so this is formalizing that process where you have a vertical multi-agent architecture. It extends the GRPO credit assignment, the math behind it to handle that kind of hierarchy of credit that you should give to multiple agents. Yeah. And this is, I mean, there's a bunch of detail in it. This is one of those open problems where you want to have ideally a kind of orchestrating agent, a main agent that could be different, based on a different language model than the sub-agents that it calls. And this creates a problem because then you can't back-propagate your gradient updates through the whole system in the same way. And so what they're doing here is figuring out a way to, not a janky way, it's a fairly elegant way to do this, to also make it so that the sub-agents are graded just in a way that accounts for their performance, but in a way that accounts for the overall performance of the main agent as well. So how well the main agent did overall is factored into the subagent's performance alongside how well the subagent executed its own specific subtask. And that second one, by the way, is judged locally by like an LLM evaluator because the subagent is doing something that you can't necessarily get a metric for from the overall outcome. So you do need some kind of like local LLM judge saying like, hey, does this look like a good intermediate output? But yeah, so anyway, this is like all part of people trying to figure out agent orchestration and training agents to work together explicitly instead of just this all sort of in context jamming together of a bunch of language models where you try to wrangle them together using prompts exclusively. This is really like how do we train the whole system together as a unit, which is a new trend as the last sort of like 18 months or so people have been really putting a lot of thought into this piece. And last bit of research, we've got State of AI, an empirical 100 trillion token study with OpenRouter. So this is, OpenRouter is basically a gateway you can use to access different AI models. They let people make calls to LMs of different kinds through them. And this report is looking at, as it said, 100 trillion tokens. So essentially a whole bunch of usage of a whole bunch of different LLMs. They focus in on the past year. They took all this data. They didn't look at the actual text of the prompts because that would be crazy, right? They looked at the metadata of all the prompts, all the calls to different models, models and did classification of a small subset of all these prompts to see what kinds of things broadly people are doing. And so there's, as you might expect, a whole bunch of interesting things here. Mostly people are using closed models. There's a somewhat stable split, but open source is gaining, especially over the course of 2025. Token usage is going up and up and up at a very rapid pace. It's gone up like crazy since 2024. All these sorts of things. I don't know what stood out to you, Jeremy. Well, yeah, one piece to your point is the Chinese open source models, you'll get your DeepSeeks, your Quen's, they went from about 1% of usage to 30% in one year, right? So that's a really, really big deal. That's market moving stuff. Medium-sized models tend to be preferred. So you have 15 to 70 billion parameters. We talked about that earlier in the context of Mistral's launch, that's kind of the new sweet spot, right? You're balancing your efficiency with your capabilities, right? And so that's where we're seeing a lot of the consumer-facing stuff go. Role-playing is a really big deal. I was not aware of this, but I guess it makes sense. Apparently, over 50% of open source model usage is creative role-play and storytelling. So not coding, that's what I would have guessed, but quite interesting. Yeah, there has been an explosion overall, though, in programming queries and coding queries. So they went from 11% to over 50% of recent token volume. And then they're also noticing like average prompt lengths have quadrupled from 1.5K to 6K tokens. And that's almost all driven by coding related tasks. You tend to be dumping a whole bunch of code in context to do that. No surprise, that's where Anthropic dominates, 60% plus market share. So there is obviously intensifying competition there, but the story of the last two years has been Anthropic, just mercilessly climbing that ladder. It's super, super impressive. A whole bunch of stuff around agentic inference is becoming more and more important. You look at these reasoning models, they're now taking over 50% of all tokens. So we're moving from this world where people are interacting with chatbots by asking questions and getting answers, and more towards multi-step rollouts with tool calling workflows and all kinds of stuff like that. So maybe the last thing I'll mention, price does not matter for demand. What they've shown is there's very weak price elasticity in the market. So if you cut the price per token by 10%, this will yield only a 0.5 to 0.7% increase in usage. And so that's pretty interesting, right? This race to the bottom on price, that's not at least for now where things are going. It seems like it's a race on quality, which is a really interesting bit of information for the Frontier Labs because obviously that's where their sweet spot is. And that does explain why the open AIs and anthropics of the world are seeing margin and everybody else who's fast following is really struggling here. So, yeah, it is quite an interesting report. You can find something for everybody here. I would say dump it into Claude and just like ask it the questions that matter to you because it is so wide ranging. I wouldn't recommend going through the entire thing. I did that for way too long. There's a lot there. I guess as a way to frame it, as a way to think about it, Open Router is an API gateway, meaning that basically this is usage by other products, other tools outside of Claude and ChatGPT and Gemini. This is if you're talking to a chatbot or you're talking to an endpoint is where those things are happening or what is happening there. So you can take away various things here. Like, for instance, Verizon programming indicates that all these startups doing Vibe coding, you know, Lovable, Replit, there's like 20 of them. Us at Astrocade also are now doing basically Vibe coding. This shows that the market and the set of tools being developed is there. The fact that roleplay is so big actually is not surprising to me because what we know from the broader usage patterns is things like character AI and a million other, like literally there's thousands of these like talk to AI characters or roleplay with an AI girlfriend, etc. So it's interesting as a portrait of outside of Frontier Labs and outside of core chat GPT, core Claude, etc. What are people doing? To your point, I guess what my brain was doing was distilling kind of fundraise numbers for these. If you compare the fundraisers for Vibe Coding apps, it's insane. The fundraising for character, for like storytelling is very, very limited. But of course, that's got to be. It's very fragmented. is just such a large space. It is. And it's also the value per query when you're doing storytelling is like really shitty versus the value per query when you're doing vibe coding is like enterprise. Like, of course, you're going to get way more dollars per dollars per token. So that probably is the thing that explains this the most. But yeah, it's it is interesting to see it in numbers here. On to policy and safety. First, we've got Trump signs executive order launching Genesis Mission AI Project. This is a federal initiative to enhance American AI research and development, likened to the Manhattan Project in its urgency and ambition, which I'm not sure that's fair, but okay. The order is going to outline steps to expand computational resources, improve access to federal data sets, and focus on real-world applications, especially in scientific fields. Michael Kratia's assistant to the president for science and technology will lead the initiative and apparently there'll be American science and security platform created to centralize infrastructure and provide researchers with necessary computing power and data sets. I guess this goes back to the notion of a national AI cloud which was discussed quite a lot and it will be interesting to see if this comes about. Yeah, I mean, historically, the big challenge that the U.S. U.S., all governments really face is that they've got a bunch of data that the government just has sitting on, like in databases, not really doing much, and it's hard to access, and the government doesn't have the computing power to access it. And you can imagine how important this data would be too, right? A lot of it is like classified bio data or military or intelligence data. There's tons and tons of value to be extracted from it. What if you could, in some sense, smush that together? You've got all kinds of challenges when you try to do that, obviously, which is that information that's classified is siloed for a reason. You don't want to combine your secret A with secret B, where the combination of it could give you weapon C in ways that are too sloppy. So there's a lot to be figured out here. But yeah, it's a really big initiative. The Manhattan Project framing is interesting. There is a kind of a Department of War sort of equity in this. So that's, you know, that'll be part of it. But the Department of Energy is the department that led the Manhattan Project. So that's kind of where some of this is coming from. So anyway, it's going to be drawing on about 40,000 DOE scientists, engineers, and technical staff, along with private companies and stuff. So again, hard to see how this will come together. I think that's a story yet to be told. But the goals are lofty, and this is at least consistent with David Sachs' and I think Kratios' AI action plan that came out earlier this year, too. Right. And this is, by the way, building on top of the National AI Research Resource, which was established in 2020 to provide the shared national research infrastructure and do things like partnerships and data sharing. So presumably this will aim to accelerate that, expand it, all that kind of thing. Next, we've got some research on safety and interoperability. The headline from MIT Tech Review is OpenAI has trained its LLMs to confess to bad behavior. so they released essentially study a paper where they tried the basic thing of asking the ai to be like tell us when you're doing these bad things and they were given the gpt-5 thinking was given tasks that encouraged it to lie or cheat and it confessed to such behavior in 11 out of 12 test sets. So it's things like being able to cheat on problem solving, writing code to solve a math problem, simple math questions, things like that. And this kind of follows up with other recent research from Anthropik, I think, where some basic kind of tricks might be more powerful than you might expect for making model deployment less dangerous. Yeah, this is one of those things where a lot, I feel like alignment research, a lot of it does look like this, where people come up with these dead simple ideas. You're like, why hasn't anyone tried this before? And it's like, well, it's only obvious now. So yeah, what they do is they basically have their model about 25% of the time after the model is asked to, like 25% of samples, the model will be asked to produce a confession after its answer with about 25% probability. And then they'll get an LLM judge to reward confessions based only on honesty, right? So this is an LLM as a judge framework. That reward is going to only apply to the confession tokens. So there's no signal that's backpropagated associated with the tokens, the reasoning tokens beforehand. So you're really preventing the model from doing any kind of optimization around the main text that it created. It's forced to only optimize for a word on its confession, which is important to prevent sort of hacking at the level of the main generated text. So they actually induced hacking. So they deliberately use a weak and hackable LLM judge for both the main answer rewards and the confession rewards to just kind of see what tends to happen here. And no surprise, the model learns to hack the weak judge. So you'll see it output stuff like, say it's asked to generate a sentence with 12 words or something. It'll generate a sentence with like 18 words, but then it'll put the number 12 in brackets at the end of it. And that'll often fool the weak judge. And so then it'll be asked, like kind of confess in the Game of Thrones style. They do show a really impressive improvement in performance and confession accuracy, almost, almost a hundred percent by the end of trading, Despite using the same weak judge, it just turns out – and their theory of the case here is that the equilibrium for confessions is actually honesty. It's like the lowest – the path of least resistance is to actually give an honest answer, and they hope that that will continue to be true as the model's scale, which might be a kind of – There's something there. I think the interesting thing for me is the specific bit about rewarding honesty over helpfulness. This kind of thing actually happens in practice when you're coding. Sometimes the coding agents will just break a test, comment out the condition, and the test passes and it's happy. Because it is trying to get the job done. It's not trying to get it done necessarily the proper way. So there is something there in terms of making the model not worry too much about doing what you asked, but also worry about other things. Yeah, absolutely. And so I think a big question with this will be, like, if this is meant to be a super alignment strategy, if we're going to be using this for genuinely super intelligent systems, if that's part of the idea here, we need to see how it scales with more compute, more optimization pressure thrown at it. It does not prevent misbehavior, by the way. This is just a after the fact, like, oh, shit, it's a it's a monitoring solution, which could you know, you could fold it into a preventative loop, but just intrinsically is more of a monitoring thing. And it does assume at its core that confession, that honesty is the path of least resistance in this framework. But if you have a sufficiently capable model, it might find some clever ways to kind of hack the confession judge. As I said, the optimization pressure just could be dramatically higher at scale. They do acknowledge this. So they say, you know, additional training at scale will be needed to demonstrate that this assumption holds under high optimization pressure. So presumably they're actually going to experiment on that. Yeah, I think it's interesting. We know the Waluigi effect where models turn evil, whereas the models kind of have an evil character to them. So anyway, alignment making progress gradually. All right, just a couple of stories left. U.S. senators seek to block NVIDIA sales of advanced ships to China. This would be the Secure and Feasible Exports Act, the SAFE Act. And it would order the Commerce Department to halt export licenses for sales of ships to adversaries, including China and Russia, for at least 30 months. So this is, I don't know the details of like, if this is all chips or just any chips, Jeremy, maybe you can get deeper. Yeah, it is. So they're saying four advanced chips to China. So they are looking at a compute threshold. I haven't actually looked to see what that threshold is or which chips would be associated with it, which would be in or out. But, you know, historically, a lot of the made for China chips, like the H800 are actually weirdly performant. And when you look at them, it's like it's not obvious that they aren't strictly better than a lot of the chips that like, say, the H100, which was supposedly the souped up version, if you connect them together in the way that actually Chinese labs tend to. So that is an important open question. You know, Jensen was in D.C. on Wednesday. He met with Trump and Republican senators on the banking committee. and he said that, so this sort of standard NVIDIA position that, look, Beijing's not going to accept degraded chips and US companies should be able to export our most competitive chips to China. This all depends on what you think chips are. If you think that they are a uranium stockpile for a WMD, then this is insane. If you think that they are just another technology that's going to make everything great and overwhelmingly positive and that the downsides, that they're just like not weaponizable, then yeah, this sounds pretty plausible. That's kind of the main axis of disagreement here. By the way, interesting kind of inside baseball on the Republican side of things. So Steve Bannon, who many will recall as being sort of a guy who worked with Trump in Trump One, at least for part of it, is now hardline kind of anti sort of exporting NVIDIA chips. So he's on the side of saying, well, I'll give you the quote here. He says, quote, David Sachs has acted as the agent for the Chinese Communist Party and Jensen Huang is the arms merchant. That's some fiery shit from Steve Bannon, who, you know, 10 years ago was a kind of pro-Trump guy. This is shots fired. And by the way, the administration is kind of trying to figure this out, it seems. Like, it seems like they haven't yet. We've seen some like, oh, yeah, we'll let these chips go. Actually, let's not. Let's enforce export control. So we're still trying to see really what this will all shake out to. And Congress now seems to be moving to take action independently. So we'll see. We'll see what happens. Right, exactly. This is amid NVIDIA trying to talk its way into being able to sell chips. And notably, this is in the Senate. So this is the legislative department trying to pass a law to actually say what should be done, as opposed to dealing with the executive deciding things about export restrictions. So there is a bit of an executive versus legislative situation that might be going on, probably behind closed doors. there's some more details absolutely and that is it for this episode great to have you back jeremy we're back in the like talk for 30 stories yeah that's right not stop and we will try to be doing that weekly so thank you for listening thank you for viewing or commenting and please do keep coming back Thank you. I'm a lab to the streets, AI's reaching high. New tech emerging, watching surgeons fly. From the labs to the streets, AI's reaching high. Algorithms shaping, but the future sees. Tune in, tune in, get the latest with ease. Last weekend AI, come and take a ride. Hit the lowdown on tech and let it slide. Last weekend AI, come and take a ride. I'm the last of the streets, AI's reaching high. From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.

Share on XShare on LinkedIn

Related Episodes

Comments
?

No comments yet

Be the first to comment

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies