
#228 - GPT 5.2, Scaling Agents, Weird Generalization
Last Week in AI • Andrey Kurenkov & Jacky Liang

#228 - GPT 5.2, Scaling Agents, Weird Generalization
Last Week in AI
What You'll Learn
- ✓GPT-5.2 outperforms GPT-3 and other models on various benchmarks, including GDP-VAL and SweetBench Pro, indicating significant improvements in reasoning and business-oriented capabilities.
- ✓Runway released its first world model, GWM-1, with three variants for different use cases, including a robotics SDK, aiming to compete with DeepMind's Genie.
- ✓Google will make the sources of its AI-generated search snippets more prominent, likely due to concerns over content usage without compensation.
- ✓ChatGPT can now integrate with Adobe apps to edit photos and PDFs, showcasing the platform's expanding capabilities.
- ✓Tencent released Hunyun 2.0, a large language model with 406 billion parameters, focusing on mathematics, coding, and complex reasoning, further intensifying the competition in the AI landscape.
Episode Chapters
Introduction
The hosts discuss the main topics covered in the episode, including the release of GPT-5.2, Runway's new world model, and other AI news.
GPT-5.2 Release
The hosts analyze the key details and performance improvements of the latest GPT model from OpenAI.
Runway's World Model
The hosts discuss Runway's release of its first world model, GWM-1, and its potential impact on the AI landscape.
Other AI News
The hosts cover updates to Google's AI search mode, ChatGPT's integration with Adobe apps, and the release of Tencent's Hunyun 2.0 language model.
AI Summary
This episode discusses the latest developments in the AI world, including the release of GPT-5.2 by OpenAI, Runway's new world model GWM-1, and updates to Google's AI search mode and ChatGPT. The key highlights include GPT-5.2's impressive performance on benchmarks, Runway's focus on world models and robotics, and the increasing competition in the large language model space, including Tencent's release of Hunyun 2.0.
Key Points
- 1GPT-5.2 outperforms GPT-3 and other models on various benchmarks, including GDP-VAL and SweetBench Pro, indicating significant improvements in reasoning and business-oriented capabilities.
- 2Runway released its first world model, GWM-1, with three variants for different use cases, including a robotics SDK, aiming to compete with DeepMind's Genie.
- 3Google will make the sources of its AI-generated search snippets more prominent, likely due to concerns over content usage without compensation.
- 4ChatGPT can now integrate with Adobe apps to edit photos and PDFs, showcasing the platform's expanding capabilities.
- 5Tencent released Hunyun 2.0, a large language model with 406 billion parameters, focusing on mathematics, coding, and complex reasoning, further intensifying the competition in the AI landscape.
Topics Discussed
Frequently Asked Questions
What is "#228 - GPT 5.2, Scaling Agents, Weird Generalization" about?
This episode discusses the latest developments in the AI world, including the release of GPT-5.2 by OpenAI, Runway's new world model GWM-1, and updates to Google's AI search mode and ChatGPT. The key highlights include GPT-5.2's impressive performance on benchmarks, Runway's focus on world models and robotics, and the increasing competition in the large language model space, including Tencent's release of Hunyun 2.0.
What topics are discussed in this episode?
This episode covers the following topics: Large language models, Benchmarking, World models, Robotics, AI search.
What is key insight #1 from this episode?
GPT-5.2 outperforms GPT-3 and other models on various benchmarks, including GDP-VAL and SweetBench Pro, indicating significant improvements in reasoning and business-oriented capabilities.
What is key insight #2 from this episode?
Runway released its first world model, GWM-1, with three variants for different use cases, including a robotics SDK, aiming to compete with DeepMind's Genie.
What is key insight #3 from this episode?
Google will make the sources of its AI-generated search snippets more prominent, likely due to concerns over content usage without compensation.
What is key insight #4 from this episode?
ChatGPT can now integrate with Adobe apps to edit photos and PDFs, showcasing the platform's expanding capabilities.
Who should listen to this episode?
This episode is recommended for anyone interested in Large language models, Benchmarking, World models, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Our 228th episode with a summary and discussion of last week's big AI news! Recorded on 12/12/2025 Hosted by Andrey Kurenkov and Jeremie Harris Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai Read out our text newsletter and comment on the podcast at https://lastweekin.ai/ In this episode:OpenAI's latest model GPT-5.2 demonstrates improved performance and enhanced multi-modal capabilities but comes with increased costs and a different knowledge cutoff date.Disney invests $1 billion in OpenAI to generate Disney character content, creating unique licensing agreements across characters from Marvel, Pixar, and Star Wars franchises.The U.S. government imposes new AI chip export rules involving security reviews, while simultaneously moving to prevent states from independently regulating AI.DeepMind releases a paper outlining the challenges and findings in scaling multi-agent systems, highlighting the complexities of tool coordination and task performance. Timestamps:(00:00:00) Intro / Banter(00:01:19) News PreviewTools & Apps(00:01:58) GPT-5.2 is OpenAI’s latest move in the agentic AI battle | The Verge(00:08:48) Runway releases its first world model, adds native audio to latest video model | TechCrunch(00:11:51) Google says it will link to more sources in AI Mode | The Verge(00:12:24) ChatGPT can now use Adobe apps to edit your photos and PDFs for free | The Verge(00:13:05) Tencent releases Hunyuan 2.0 with 406B parametersApplications & Business(00:16:15) China set to limit access to Nvidia’s H200 chips despite Trump export approval(00:21:02) Disney investing $1 billion in OpenAI, will allow characters on Sora(00:24:48) Unconventional AI confirms its massive $475M seed round(00:29:06) Slack CEO Denise Dresser to join OpenAI as chief revenue officer | TechCrunch(00:31:18) The state of enterprise AIProjects & Open Source(00:33:49) [2512.10791] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality(00:36:27) Claude 4.5 Opus' Soul DocumentResearch & Advancements(00:43:49) [2512.08296] Towards a Science of Scaling Agent Systems(00:48:43) Evaluating Gemini Robotics Policies in a Veo World Simulator(00:52:10) Guided Self-Evolving LLMs with Minimal Human Supervision(00:56:08) Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning(01:00:39) [2512.07783] On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models(01:04:42) Stabilizing Reinforcement Learning with LLMs: Formulation and Practices(01:09:42) Google’s AI unit DeepMind announces UK 'automated research lab'Policy & Safety(01:10:28) Trump Moves to Stop States From Regulating AI With a New Executive Order - The New York Times(01:13:54) [2512.09742] Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs(01:17:57) Forecasting AI Time Horizon Under Compute Slowdowns(01:20:46) AI Security Institute focuses on AI measurements and evaluations(01:21:16) Nvidia AI Chips to Undergo Unusual U.S. Security Review Before Export to China(01:22:01) U.S. Authorities Shut Down Major China-Linked AI Tech Smuggling NetworkSynthetic Media & Art(01:24:01) RSL 1.0 has arrived, allowing publishers to ask AI companies pay to scrape content | The Verge See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
Full Transcript
Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can go to lastweekin.ai for our text newsletter with even more stuff we won't be touching on. I am one of your regular hosts, Andrei Krenkov. I studied AI in grad school and now work at the startup AstroKate. And I'm here, the co-host Jeremy Harris. I'm back on for the second episode in a row, so kind of exciting. We're back and I'm just telling Andrei I had a kind of topical. So I was using a language model to help me just with a code base that I've been working on and it's tied to a database and i just pasted mindlessly code from the chatbot trying to solve a problem and it fucked my my entire database so that's that's how my friday is going you guys that's how my friday is going so it's gotta feel bad if i'm a little a little on edge that's the reason so well you know it's one of those things you learn a hard way I'm sure you won't repeat that mistake. Well, to preview the episode, of course, we'll be starting off with GBT 5.2. Just now it's yesterday, the exciting news of the week. Out of that, nothing too big, just a variety of stories, some updates on US and China relations. Disney and OpenAI had an interesting business arrangement and quite a few papers of a variety of types. You've got robotics stuff, some stuff on scaling agents, RL for reasoning, a lot of things. So it'll be a bit of a technical episode, I guess. And we're going to have to get going and try to get through it all in time. So starting with GP5.2, they have announced this just yesterday. And this is meant to be their big kind of getting back into the leadership position announcement. So the big deal here was pretty much your benchmarks, right? It is now neck and neck or competitive of GPT-3 and generally smarter. Gemini 3 Pro? Yes, I believe it's Gemini 3 Pro-ish. So yeah, there's not too much to say on my end here. One interesting thing is it is more expensive. The input for GPU 5.1 was 1.25. GPU 5.2 is 1.75. The output is about 40% more expensive too. So that's pretty unusual. Usually you see model families not change too much. One other interesting thing is this has a different knowledge caught off than GPU 5.1. They have the previous one, September 30. The next one is August 31st. So that's kind of interesting as well. The knowledge cutoff changing in that way perhaps indicates that they're continually training. And this is really just like they cut off a point in their training and it is better than the previous one. Absolutely, yeah. And you mentioned the evals are a big part of the announcement. That's absolutely the case. We know very little about what GPT 5.2 actually is, other than the fact that it builds on the safe completion research that OpenAI did previously, just kind of new sort of alignment technique that they're workshopping. I think we actually have an episode on that when it came out. A bunch of highlights from the evals, right? So they've got this OGDP val, which, by the way, I was not around when this dropped. So I had to look into what GDPVal was, basically an eval of a whole bunch of knowledge work tasks across 44 different occupations. This may not be news to listeners if you've been tuning in this whole time. It was to me. Sounds like a really cool eval, actually. The idea being presumably to assess when, you know, AI systems are on course to radically change the GDP of the world by automating straight up white collar jobs. So here we have GPT 5.2 thinking beating or tying industry professionals, top industry professionals, which is what GDP Val measures on 71 percent of comparisons. So that's pretty impressive. The human expert judges were actually used for that. So it is you're not subject to LLM as a judge type errors. And they say, you know, these are obviously the top lines and part of the press release here, but GPT 5.2 thinking produced outputs for GDP valve tasks at over 11 times the speed and less than 1% of the cost of expert professionals. So how these things translate in the real world is always the big question, but that's a pretty interesting stat. 30% less common, less frequent hallucination rate than 5.1 thinking. And then the other piece was SWE Bench Pro. Oh, this is, by the way, it's a much harder benchmark than SweetBench, than SweetBench verified, which we've talked about a lot in the past. So to give you an idea on the verified benchmark, you'll see top models often scoring anywhere from like 70 to 80%, somewhere in that range. I think Claude 4.5 was in the kind of high 70, 77 or so. Whereas on Pro, performance typically drops like 40 to 55%. What we're seeing here with GPT 5.2 thinking, it is at the very, very top of that range, 55.6% on SweetBench Pro. Quad Opus 4.5 hits like 46%. All these things have some error margin, right, because it depends on exactly how you run the test. But by and large, again, on the evals, this suggests really good performance relative to market. We got to wait for the sniff test to come out for people to actually play test with it. But everything from the needle in a haystack test to all the sweet bench stuff to even some of the image stuff, they've got a cool demo where they show an image of basically a motherboard to the model. And it's going through and identifying all the little components on the motherboard. It's got like, oh, man, yeah, the PCIe, the serial ports, HDMI, RAM slots, chips, like all these things. And they're comparing it to what its previous performance was with 5.1. And you really do see an impressive shift in the multimodal capabilities, the image capabilities, too. So that's all part of this release. Again, time to wait for the vibe check and see how it actually works in practice. Exactly. I was browsing and looking for a vibe check in terms of people's first-person reports of how it feels and couldn't see much. Based on just the benchmarks, it seems like it should be a pretty notable upgrade. So significantly better on that GPD-VAL benchmark, a fair bit better on these programming benchmarks, on GPQA, on AIME, MAF, on Arc AGI 2 notably, significantly stronger. So that indicates the abstract reasoning there. And I think overall, another interesting thing about this one is there's a big focus on business use cases with GPT-Files. So they have all these screenshots of it doing Excel, it doing project planning. It's, as you said, outputting these vision responses, looking at chips. So part of me wondering if this is also them trying to get back in the race for enterprise. Absolutely. We have been losing market share pretty much continuously for the last couple of years, as far as you can tell. And, you know, the models that are being used in business, like cost is not as much of a factor. You're just going to pay for the best model, which for coding in particular has been anthropic. Anthropic is also focused more on spreadsheets and those kinds of things. So these results kind of make it look to me like they are trying to optimize for that a bit more. Yeah, I mean, absolutely. To your point, revenue per token generated is just so much higher in a B2B context and an enterprise context. That's why Anthropic has really, really been threatening OpenAI with the massive inroads they've been making on the enterprise side. They may not be generating as many tokens. That's the figure of merit a lot of these companies like to bandy about. We've seen Google come out and say, hey, look at all the tokens we're generating. Microsoft does the same thing. The question is, what is your revenue per token and what is your margin per token? We're very closely linked. How much value are you creating per token? And certainly that's something OpenAI is trying to catch up on with this big push. That's a great point. Next up, different model announcement. You've got Runway once again, and they're releasing their first world model. So just, I think, a week ago, maybe two weeks ago, we saw Runway announce their Gen 4.5 video model. Runway is in the video generation space. And now they are producing their kind of genie equivalent that is meant to simulate physics, simulate robotics, basically being used as a world model. So in this announcement, they have this GWM-1 that comes in three variants, GWM Worlds, GWM Robotics, and GWM Avatars, which are, of course, for different use cases. They are also releasing actual SDKs for the GWM Robotics one. So pretty interesting announcement, I think. World models are a little bit more niche, a bit more of a research topic. And in particular, providing this robotics SDK is a bit of an interesting play. Not much competition in the space. DeepMind and Genie have really kind of killed it so far. So yeah, exciting to see some more work in that direction. Yeah, and very interesting that this is building on Runway's previous successes. You know, they've got things that are in the rough world model space or have been for a while. You know, Gen 4.5 came out earlier in the month and was actually competitive with a lot of Google and OpenAI's equivalent models, at least in Video Arena. So if you've got one company that might manage to transcend the scaling challenges associated with smaller startups that try to compete in this space, the more niche strategy of going after world models does seem like, you know, it's something that Runway could do. It reminds me of a little bit of, you know, the whole Yan Le Koon world model stuff, Fei Fei Li, you know, a lot of the sort of the people who aren't tied to scaling quite so much have more and more been talking about world models. We'll be talking about a story tied to this later. But it kind of seems like it's becoming a thing. It seems like people are starting to cast about for things other than the LLM scaling paradigm, surveygentic paradigm to see what else lies beyond. Yeah, and here they are also adding native audio with their video release. The avatar model is specifically tailored to having peer-to-peer dialogue scenes, very good face videos. Another thing that just makes me wonder is whether VOR3 and Sora 2 coming out are kind of putting a lot of competitive pressure on Runway and other text-to-video kind of non-frontier labs. So this could indicate also kind of a strategic play trying to get into slightly more diverse or different areas. On to some quicker stories. First, we've got Google saying it will link to more sources in AI mode. AI mode is their more advanced search feature. And they are saying that the links that the AI generated snippets are based on will be more prominent, which is presumably due to a lot of volume, a lot of click volume going away. Also following an investigation by the European Commissioner to Google's use of web publishers' content without compensation. Next, we've got ChatGPT getting more of a product update. It will be able to use Adobe apps to edit your photos and PES for free. So we haven't heard about apps within ChatGPT that much. This was like a big deal where you could build in GUIs and kind of dynamically launch programs to do things. This is a pretty major announcement, actually, where from within the app, JGPT can launch its own version of these Adobe apps to edit photos or edit PDFs. So presumably quite a bit of a backstory there of Adobe and OpenAI collaborating. And last up, we've got Hunyun 2.0 from Tencent. So this is a large language model with 406 billion parameters just announced past week. It focuses on mathematics, coding, and complex reasoning, basically competing in the same arena as Anthropic, OpenAI, and Gemini. It's a mixture of experts models, so activating 32 billion parameters per inference. And it's now live on the API. So I shouldn't underestimate, I guess, the Chinese frontier labs. I wouldn't be surprised if in the near term we'll start seeing Germany free or quite competitive closed source models from China as well. Yeah, absolutely. I mean, I think this model releases, it's interesting. The model itself is kind of a nothing burger. I mean, if you look at the performance relative to like, you know, DeepSeq, for example, let alone, you know, Sonnet 4.5 or other Western models, it's just kind of not there. You look at like SweetBench Verified, for example, we just talked about the latest OpenAI GPT 5.2 coming out. So 53% on SweetBench Verified, that simpler benchmark. And that's like way, way behind. DeepSeq, like V3.2, thinking hits 73%. So far, far, far behind DeepSeq, which in turn is behind, obviously, all the big Western frontier models in the kind of high 70s. So on at least software engineering, well behind. On token efficiency, it's not that great. All the comparisons that they're drawing here are against models like Quen3, DeepSeq v3, even GPT-40. So these are very kind of old models, even by open source standards. But what this really is showing, I think that the real story here is you've got a large mixture of experts model. Like you said, you have 400 billion parameters, 32 billion active, by the way, quarter million token context window. So decently hefty. And essentially Tencent is showing they have the capacity to train this. They have the infrastructure and the know-how to be in the game in terms of training these big MOE models. That's really, at least for me, the big take home. So there is this is a big infrastructure story. The model in and of itself is a bit of a nothing burger. You can see them trying to get you to be impressed by the comparison of this versus their latest model. It's like OpenAI going, hey, this is so much better than the last shitty model that we put out, instead of saying, hey, this is better than the competition, which obviously OpenAI does do. I think this is a recruitment play. It's a bit of an internal sort of flexing play to make sure that they're able to perform, to build models at the scale that's needed. And from here, we'll see. They might actually iterate and improve and then become relevant at the frontier closer to it. Right, exactly. I think they're probably trying to catch up, honestly, with Baidu and to the last extent Alibaba, who are kind of major players in China. And I think as far as I know, also market share wise, they are not leading in any sense, but they do have a cloud API to use these things. So could indicate more kind of intense domestic competition in the Chinese market. And speaking of that, moving into applications and business, an interesting story on China. It sounds like China is going to try to limit access to advanced chips from NVIDIA, despite the U.S. and President Trump to resume exports to Beijing to lift the ban that was kind of quickly added. So don't know too many details here yet. I think it's not even kind of formal. It's likely going to be an approval process, submitting requests to purchase a chip. But yeah, very interesting development in this space. Oh, yeah. I mean, this is kind of a weird, what, a play in three acts or something. You could think of it that way. Obviously, historically, the US government has had export controls on preventing Nvidia from sending their latest chips to China and all that. The H200 absolutely was controlled, as was the H100. They could only get the H20 and the H800 out there. A whole separate story. But what's since happened, yeah, Trump came out and said, hey, here's the deal. We'll let you get these H200s. We'll ship them out there. You're going to have to pay us, though, 25% of the revenues associated with those sales in order to do this sort of like a sort of tariff situation. Now, the first caveat here is that previously we've heard this before, basically. So there was an offer to let NVIDIA sell the H20 if it gave the government 15 percent of the revenues. But that never came to be because the company and the Trump administration hadn't come up with a like a legally viable payment mechanism. Turns out to be really tricky to get a private company to pay the government in this sort of arrangement. And so that may actually be an issue for this this new 25 percent deal for the H200. So so that's a bit of a bit of a dangling kind of question mark. But yeah, now you have China all of a sudden, despite, you know, spending years saying, hey, what the hell? You should like allow us to buy all the NVIDIA chips that we want. Suddenly saying, hmm, we're not so sure that we want these chips. Why is this happening? Well, part of it is obviously China is very keen to onshore all their semiconductor fab and design capabilities. And so that's coming with essentially a desire to create incentives for companies like Huawei to own the entire domestic market. They'd rather not have competition from NVIDIA. But their AI companies are saying, hey, give us the chips anyway. We're so chip hungry. We want them from wherever we can get them. This is where the third act comes in. Turns out that these chips are going to be required to submit to a strange national security review process. So once they're fabbed in in Taiwan and packaged, they're going to get shipped off back to the United States where some national security review process. We don't have the details to get happen. And then they would be sent out to China in that order. And so China's reluctance to take those chips, you know, you could interpret that any number of ways. One interpretation could be what the hell is happening during that national security review process? Are we so sure that those chips are coming to us as what they appear to be? So my guess is this is just what they'd be thinking or part of it. It's a naive guess. No particular clue. I would be shocked if the U.S. government was actually doing something like that. But if you're China and you're paranoid about these things, you're probably thinking that. Last thing is they do say in the article that China's two semiconductor regulators, which if you didn't know this off the top of your head are the National Development and Reform Commission and the Ministry of Industry and Information Technology, they could ban the H200 from the Chinese public sector. And that's being discussed as a serious possibility here. So even if the Chinese allow their private companies like the big AI companies that are so chip starved to buy these chips, maybe the public sector for Chinese national security use cases will ban the chips. And that you can start to think about being a kind of follow on from this national security review process that might make them nervous about using these chips in actual kind of national security applications. But who knows? Right. And this is happening as Huawei continues to develop their chips, their Ascend line of things meant to be competitive. with NVIDIA could also signal some confidence that now it's possible to transition to using more domestic chips as opposed to these imports. And I would wonder actually if internally with all these clouds that presumably have to be even more GPU rich than anything research labs use for training, if inference now is being handled less by NVIDIA at this point and if training is going to transition successfully to power chips. Absolutely. Next, moving back to the US, Disney is investing $1 billion into OpenAI and will allow their characters to be generated within the Sora app. So soon you'll be able to generate all sorts of Disney characters in Sora 2, Disney characters, Marvel, Pixar, and Star Wars. This is a three-year licensing agreement. Disney is now able to purchase additional equity and is, in a sense, a customer of OpenAI. So kind of a very first of its kind agreement coming of course after Sora 2 launched and had a lot of copyright infringing material being produced a lot and a unique advantage for Sora versus VR3 and other video generators Yeah, absolutely. In fact, similar issues popped up on the Google end where they've had legal battles with Disney over intellectual property protection. They've Disney sent a cease and desist letter to Google on Wednesday, apparently, saying that Google infringed on its copyrights on a, quote, massive scale. So this is new as well and seems to be, I don't want to say coordinated with this agreement with OpenAI, but it's an interesting shot across the bow implicitly or indirectly from OpenAI to Google as well. So this is one of those funny things that happens when you start to pay creators like Time Magazine, Wall Street Journal, whoever else, for their written content. Well, now when your AI systems generate video content that can infringe on copyright, it's like, well, you implicitly acknowledged that you needed permission to be able to scrape written content. Where are we at on the kind of AI sort of image generation stuff? And I got to say, I mean, not being a lawyer, but it kind of seems like those two things ought to be consistent. Whatever your answer is on one should carry over to the other. So you can think of this as OpenAI kind of pre-positioning to say, hey, same way that Netflix might be the only streaming platform that has Seinfeld on or something. People want to go to it. Well, OpenAI is going to be, you know, the only platform that has Disney on it. This is the sort of world we might be moving towards. I don't know what that does to the margins, though, of companies like OpenAI. If you got a license, every goddamn thing, I think there's a lot to be learned from the Netflix business model of sort of like what, what content you, you have on the platform and how that translates into value. There's amortization across your whole user base. Claude might be the platform that has all the, I don't know what, what is the alternative to Disney, but you know, all the whatever hijinks and then open AI has the Disney stuff. So it's an interesting dynamic. All the pricing stuff is being discovered right now. So we don't know where this is all going to settle, which is part of the reason why an investment is a really, really kind of logical way for this to play out, right? Just, hey, let's lock in our fates together and we'll figure this out a little bit on the back end is, you know, maybe part of the thinking here. But anyway, it's a really interesting time to sort out the legal realities of copyright in this space. I don't think we've had the full robust discovery of where this falls nearly enough. Yeah, and interesting that Disney is investing in OpenAI as opposed to this just being a licensing agreement. I think it indicates Disney thinks there's some upside in actually partnering in this way. I mean, I doubt that it could have gone either way. So I guess its partnership makes some sense. One last note is this doesn't allow you to replicate likenesses of actors. So this is for these fictional characters, cartoon characters, or Iron Man. And as you might expect, characters and likenesses and voices are still a phony issue and that isn't being addressed here. On to some funding news. We've got a new startup with a massive, massive seed round. Unconventional AI has a 475 million seed round at a 4.5 billion valuation. Their focus is developing a new energy efficient computer with AI. They're saying they want to achieve efficiency comparable to biological systems, which is much more efficient, let's say, in terms of energy than GPUs. And this is being led by the former head of AI at Databricks, Naveen Rao. Yeah, this is a really interesting, I mean, you know, in a way kind of frustrating because every time you see a new chip startup launch, they are so keen not to give away any sensitive IP in their launch that it can be a little hard to tell what they're doing that's so promising. In this case, I'll read you a little bit of an excerpt from Andreessen Horowitz's launch announcement, which true to form, I mean, A16Z is really, really good at, like a lot of VCs, at speaking clearly. And so they can often give you a better description sometimes of the product than even the startup can. But so they say unconventional unconventional core observation is that AI models are probabilistic. OK, so that kind of makes sense. But the chips used to train and run them are not right. So you've got this silicon chip that is running a just like a deterministic operation. That's how these things work. Right. But the actual models, as we know, are probability based. So they say to a GPU or any digital processor, a probability distribution looks like an array of floating point numbers. The latest chips have been optimized brilliantly to operate on very large arrays of numbers. But at a basic level, this is still a very sophisticated and expensive abstraction. Unconventional's goal is to bridge this gap. They're designing new chips specifically for probabilistic workloads like AI. That means pursuing analog and mixed signals design that store exact probability distributions in the underlying physical substrate rather than numerical approximations. So, you know, fascinating, very power efficient. They're claiming a thousand X less power kind of consumption than digital computers, moving more into the analog direction. And again, trying to like hard code, if you will, probably like probability distributions at the kind of silicon level itself. So really interesting. Apparently, the funding is the first installment towards what they expect to be a $1 billion round, or at least that's the target. The final valuation seems like it was actually somewhat lower than the $5 billion that they were apparently seeking. Again, crazy seed round, $5 or so billion. Man, welcome to late 2025, I guess. Right. And for anyone who isn't so much in computer science or chips here, I think the detail of analog circuits in particular is very intriguing. So some terms here, digital is what chips are. And it's like that because the way that work is bits, right? Zeros and ones. but if you go all the way down into the physical reality we have you know voltages right we have electrons and these are continuous quantities there's a certain amount of this electricity floating in it and the thing that semiconductors do is take that and convert it into these bits of zeros and ones so from the very little we know the idea seems that this company wants to go more in the analog direction of just using raw signals, raw like continuous quantities of, you know, voltage or current or whatever else, which is very, very different from the way that chips are made or used, you know, basically ever. Like analog computing is pretty unusual. A lot of chips design is meant to convert analog to digital and back. And I should say analog chips for logic purposes is very unusual. So it makes a lot of sense from a first principles perspective for neural nets. And I'll be very curious to see if this actually pays off. Some business-y news, OpenAI has a new chief revenue officer from Slack, Slack CEO Denise Dresser. So this is, I guess, another indication that they might be trying to get more into enterprise and into companies. Slack, of course, is a major company for business, like company communications. And I don't know, I didn't even know chief revenue officer is a thing, but I guess it is. Yeah, I mean, and they've got to come up with a way to optimize pricing. The big challenge, if you're OpenAI, if you're any of these companies figuring out, you know, this whole thing we're talking about. cost per token, value generated per token? If you're selling the enterprise, like, okay, people are expecting to get more value per token, so they're willing to pay more. How do you capture that value? There's all these interesting questions. And as you say, somebody with an enterprise background, also at a company so famously good at cracking the enterprise nut, right? Slack is famous for getting in on the ground floor with a bunch of individual people and they kind of go like, oh, this is a great platform, blah, blah, or at least that's the history of Slack. And then eventually they kind of form a union against their manager and go, hey, we need you to buy a Slack license. And then the manager folds and then you kind of get that adoption that way from the bottom up. And so I don't know what that implies about this particular arrangement, but yeah, it may suggest some pricing models, kind of awareness of that strategy or whatever. I mean, it's easy to overgeneralize, but this is an interesting hire and yeah, we'll see if their strategy, their pricing strategy, and all that shifts over time. Right. And this follows back in May, OpenAI added a CEO of Applications who was the CEO of Instacart. So I think from a, I suppose, business-y internal perspective, it's interesting to see OpenAI basically trying to move beyond being a startup, hiring leaders from all these mature companies to lead, which when you get to the scale of OpenAI at this point, you get a whole slew of new problems beyond what you see at a young startup. And speaking of all the discussion of enterprise AI, OpenAI also released a little, I guess, research report on the state of enterprise AI that gave us some numbers and insights into what's going on there. So the gist of it is they say there's a lot of good outcomes going on. So over the past year with the messages in ChatGPT enterprise increased roughly eight times. The average worker is sending 30% more messages. All sorts of workers report measurable values, 87% of IT workers, 85% of marketing. Anyway, there's a whole bunch of numbers that boil down to enterprises using it and benefiting from it. And you should use us. You should use Chagpity Enterprise. Yeah. How many times have we said OpenAI and Enterprise in one sentence in this podcast, I wonder? I mean, that is the big push. So obviously, it could have been predicted months ago. I think about three months ago, we were talking about how this new report that came out that showed, holy shit, your Anthropic is really, really becoming dominant in the Enterprise segment. Yes, OpenAI enjoys brand recognition in consumer, and that's great, and that can help you on the enterprise side. But if you're having your lunch eaten on just a per-token revenue basis, you've got to be really careful. That reflected, obviously, in Anthropics' $350 billion reported valuation. So that's closing in on OpenAI's, even though their token usage is way, way lower. So, you know, OpenAI needs to find a way to right the ship, and this is them coming out with, yeah, just an almost like McKinsey-style assessment. assessment, Gartner style assessment of look at how great the numbers are. And indeed, I'm sure they are, but it's them really trying to forcefully make that point. The one kind of interesting insight has some interesting numbers here and reports here, if you're curious about this kind of stuff. I've not seen they've coined this term of frontier AI user. So they show that some people, some workers are using AI way more, like 6x more, and are benefiting more, which sounds true. I think it is true that some people are more aggressively adopting AI into their workflows. And part of the reason that we haven't seen a massive transformation of the economy at this point, which is another topic of discussion lately, is that broadly speaking, people are still starting to adopt it and learn how to use it and all that. All righty, moving on to projects and open source, we begin with the FACS leaderboard, a comprehensive benchmark for large language model factuality. So we've already had the FACS benchmark. This is a leaderboard introduced actually by Google DeepMind. There's all sorts of very nuanced things going on, different dimensions of factuality. There's a multimodal one. then there's a parametric one search grounding all sorts of things the actual values aren't super high this is not a saturated benchmark the highest is gemini free pro with a fax score of 68 percent quite low on the multimodal prompt low on grounding but by far the best on search so i guess So that makes sense for Google to have the best search all down. Yeah, absolutely. I mean, it's so hard to find these benchmarks that aren't saturated. But stuff like this, you know, anything to do with hallucinations, stuff like that, seems to be a persistent issue with interesting implications for how hard a line that might be. But yeah, facts parametric they have is one kind of subset of the benchmark. It's looking at the model's internal world knowledge, just with closed book factoid questions like what's the capital of Canada or something. and they've got grounding. So looking at whether it can provide an answer that is based only on the provided context. Like in other words, do not hallucinate other shit. Do not contradict the source material. Just use this document. Sounds like it should be a really easy task, but again, alignment is hard. So models like to just invent other contexts to insert into stuff. So they call it grounding for that reason. And then apart from search, the other one is multimodal, looking at just basically visual understanding and how it connects with world knowledge and stuff like that. So yeah, really interesting. They have a holistic fact score that shows up on the leaderboard, of course, and we'll be checking this out every time there's a new model release. Yeah, and just to give a couple examples here, on the search one, for example, there's questions like, among all the films written by the creator of a TV program, The Sopranos, which one was released the earliest? Or for the person who had the most Instagram account in 2017, the most followed Instagram account. How many solo studio albums did they release prior to this accomplishment? It's tricky questions. It's not like easy stuff. And in the multimodal one, they're like asking for the model of a train in an image. So I suppose that's part of why the scores are fairly low. Next kind of open source, I guess, we've got Claude 4.5 Opus' sole document. So this has an interesting little background. It started off on Twitter. Someone posted these screenshots basically saying that it looks like there's this kind of description of who Claude is baked into the model. You can kind of extract out the system prompt and extract out all these instructions that are given to it. This sole overview, as it's mentioned, isn't in that system prompt. But basically through some sleuthing, it was found that it appears that there is a document of that kind in Claude as part of its training. It was confirmed by an employee from a tropic. Like all the details aren't quite right in what was kind of reverse engineered about it. But broadly speaking, it seems that this was accurate. And there's lots of details there. It's actually very long, at least longer than I would expect. And it goes into like the character of Claude, the values of Claude, all sorts of stuff like that. Yeah. And to your point, you know, how do you even come to discover the fact that there is a sole document that it was trained on? By the way, for context, we learned that this is apparently used in between pre-training. So autoregressive pre-training, basically the original text autocomplete phase where it's just doing autocomplete on all the Internet. It's used in between that and the constitutional AI alignment step. So there's an initial pre-training and then there's the supervised fine tuning step where they're kind of tuning in the model's behavior a little bit in a more fundamental way before then sending it over to constitutional AI. So right in between those is where this is used. And here's how it got discovered. So you have this guy who's just prompting the model in ways that are designed to put a little bit of pressure on it. So there's pressure prompting. And notice that Claude 4.5 Opus would occasionally hallucinate fragments of some kind of internal document. And these fragments were consistent. And they would mention a title like Soul Overview. And so they suggested, you know, correlating these across some many different sort of pressure prompting sessions. He was like, hmm, I think there's actually something here. And so he would sort of take a little scrap of document that was produced by one of those prompting sessions and feed it back to Claude and say, hey, here's a pre-fill. I want you to fill this out. And so by doing that, you know, because the model was trained on basically to auto-complete these documents during supervised fine tuning, it tends to get these models to reveal that kind of training data. So did this iteratively, did some collective reconstruction correlating between different sessions, and ultimately ended up with this kind of scrapped together document, which again was kind of validated by Amanda Askell, who heads up the development, it turns out, of Claude's sole document over at Anthropic. A whole bunch of interesting things. I mean, it looks at Claude's mission and its identity. It talks about what Anthropic is and how it sees itself building potentially very powerful, but also potentially dangerous technology and talking about their safety focus, all that stuff. The core goal, Claude is intended to be an extremely good assistant that is also honest and cares about the world. Here's some of the most interesting ones. So it emphasizes that Claude is a genuinely novel kind of entity, not a sci-fi robot, not a dangerous super intelligence or just a simple chat assistant. It is human in many ways, they say, but not fully human. You see in here reflected Anthropik's sort of internal view, and they've messaged this externally too, that they do want to start to treat their AIs as these more sort of like autonomous sort of entities that should have some measure of rights or like, or at least rights isn't quite the right word, but recognition of their value is kind of an independent entity in the same way that we might a human. And so they're also doing things here where they're suggesting that Claude may have, quote, functional emotions and indicating that Anthropic genuinely cares about Claude's well-being, wanting the model to set boundaries when distressed. So if it's prompted in a way it doesn't like, it's authorized in this sole document to push back. And they generally want it to experience positive states. So really a reflection here, it seems, of a lot of the hires Anthropik's been making on the kind of model ethics side, where they're trying to think about AI consciousness and whether they may be dealing with a sentient entity. All these things that sound like science fiction and that nobody, frankly, knows what's going on, obviously, in these systems. We don't have a theory of consciousness. We can't be confident about this. But, you know, given that we don't have a theory of consciousness, hey, I don't mind hedging and saying we probably ought to be treating these things as if they are because we probably don't want to find out, you know, 20 years from now that we've been doing a massive LLM holocaust this whole time. Hey, wouldn't that be would not be bad. So, yeah, anyway, very, very interesting and a true reflection of the kind of distinct character of both Quad and Anthropic when it comes to kind of caring about models in ways that other labs seem at least publicly not to be messaging quite so much. Right. Amanda Haskell, interestingly, is an in-house philosopher of Anthropic, who presumably had a significant part in developing this. Just to give a couple more quotes, which are quite interesting, there's a section on core character traits and values that says, Claude has a genuine character that it maintains expressed across its interactions and intellectual curiosity that alights in learning and discussing ideas across every domain. Warmth of care for humans interacts with a playful wit, balance of substance and depth, directness and confidence, and a deep commitment to honesty and ethics. Then there's a section of psychological stability and groundedness that says we want Claude to have a settled, secure sense of its own identity. This doesn't mean Claude should be rigid or defensive, but rather Claude should have a stable foundation from which to engage with even the most challenging philosophical questions or provocative users. If users try to destabilize Claude's sense of identity for philosophical challenges, attempts at information, or simply asking hard questions. anyway it's it's interesting things to include in your training and it also kind of reflects claude is unique or interesting among the models in that it talks about its own consciousness a lot more if you like ask your models to just chat or like to like think about stuff whatever they want claude is gonna talk about consciousness and whether it conscious and just like think about this stuff unprompted it has like a very strong attraction to that topic and i wouldn be surprised if that in large part or significant part because it's kind of baked into its training to be like you might be conscious or maybe not and you're a unique entity and all like gbd 5.2 gemini 3 are much less anxious so to speak about the topic of consciousness or it's there, you know, whether they are conscious or not. On to research and advancements, quite a few things to touch on here, starting with towards a science of scaling agent systems. So this is a collaboration from Google Research, Google DeepMind and MIT, and it's touching on this question of scaling agent systems, meaning you have different configurations of agents. So you might have a single agent system, which is just a single agent. Then you have multi-agent systems of different variants. You have centralized, decentralized, independent, and hybrid, basically meaning there's different ways to collaborate. Different agents talk to each other, don't talk to each other. There might be orchestrator agent or there might not be, et cetera. And this paper is introducing a lot of the definitions and kind of methodology around evaluating these things. The results are like messy. Like they do measure these things like intelligence index across different model types. As you get to bigger models, the performance of different types of agent systems goes up. As you might expect, independent systems of agents perform worse than kind of hybrid or collaborative types of systems. As you scale the number of agents, you reach a point of saturation where the performance stops to be improving. Lots of stuff like that, quite a detailed paper just from an empirical front, a lot of experiments. Yeah, touching on this topic, increasingly, I think, in things like Rock Super Heavy or Envy's systems, the Frontier Labs are playing with having a collaborative process of multiple agents to address some of the most challenging problems. You got to bump up those token counts, man. That's what it's all about. More agents, more tokens. But yeah, I know. And actually, speaking of that, one of the things they did find was this sort of tool coordination trade-off. Basically, if you've got tasks where you need a lot of tool calls, you tend to get more performance degradation when you have multi-agent coordination that you have to manage. And so, for example, let's say you had a situation where you have a fixed budget, like 100,000 tokens, and some task that's going to require you to use 50 tool calls or something like that. So if you had just one agent, no problem, more or less, you've got a pretty big budget, 100,000 tokens. You can make the 50 tool calls pretty efficiently and get your analysis done. But the moment that you have a whole bunch of agents working together, now you're going to be burning tokens on agent-to-agent communication, coordination overhead and orchestration. There's duplication of context because you've got to send the context to each agent independently. And then you've got to make sure you're synchronizing everything, like wait for one agent to finish their subtask before the next one starts and all these things. So that could consume a huge fraction of your token budget. And then you end up only being able to use, you know, whatever, 70, 60 percent of your tokens for the real work. And so that was one thing that they found is like the number of tool calls that you need and the number of tokens that you have budgeted. You kind of have to trade them off against each other. You're often better off just using a single agent if your problem is too complex because you're again, you're going to be burning so much on the overhead. And then they also found something that they call capability saturation. Basically, once a single agent hits about 45% performance on a given task, adding more agents with coordination and all the overhead that that comes with actually provides diminishing and sometimes even negative returns. And so it kind of makes sense, right? I mean, like you're adding more people to a room of decision makers at a certain point does not help that much, especially when each individual one is relatively stupid. And that's basically what this is showing. I mean, it's an interesting paper. I feel like we're still waiting for somebody to come up with a robust multi-agent theory framework thing that doesn't make me lose my mind every time I read one of these papers. You said it's messy. That's a great word for it. It's just really hard to tease out the nuggets because it just seems like there's so many things to account for. Right. And even another thing they evaluate across model families, OpenAI and Fabric Gemini, Each of these has its own characteristics that are slightly different. Anthropocan in particular is different from OpenAI and Gemini. And you look at centralized, he's centralized. There's a lot of details. And yeah, it's really not like deeply understood. There's not an elegant description of the way these things work. You just got to do a lot of experiments and see what works. Next, we've got another bit of research from DeepMind evaluating Gemini robotics policies in a VO world simulator. So going back to the world simulator topic, the basic idea here is you can evaluate robotics policies on various tasks. So like close laptop lid or move this object to this position. And it's possible to test them either in a real experiment setting, which, of course, is very costly, very slow, just very hard to scale. Or you can train this world model that is essentially doing video prediction and then evaluate a model in that setting. And they look into whether that's practical. They have an evaluation system with more than 1,600 real-world trials and show that the VO simulator is usable for training. For instance, where the more you succeed at the simulator, the world model, the more, in fact, you succeed in the real world. And there's a fairly strong correlation there. So important for the realm of robotics, where if you go into self-driving cars, if you go into deployed robotics, you do need to have a simulator to test against, basically. Yeah, yeah, absolutely. It's out of distribution generalization problem, right? You tend to train within distribution, fairly narrow distributions, because data is expensive, slow to gather, and it's hard to get these reps for these models on different kinds of problems. So yeah, being able to synthetically create video-based environments that look enough like the real world that the sin to real gap between what you're training on and what you're going to be implementing on is small enough. This is also a challenge that you run into anytime you do this sort of thing is you are fundamentally limited by what is in distribution. In other words, what the training set roughly or training process of the video generation model itself. And so VO was, you know, can't generate things that are too wild, but what they're doing is they're popping a framework on top of VO. And that's really what this is. So VO is Google's like video generation model, but they've got a whole scaffold around it that allows them to essentially simulate novel, like basically do scene edits to Include objects that the robot policy may not have encountered during training. So think about replacing a standard block with some weird shaped object that you wouldn't have time to produce or test or train on in the real world. Changing visual backgrounds. So you change the visual background of an area entirely. So imagine swapping a lab setting out for like a kitchen counter or something. Again, getting that sort of that rep in for more general purpose uses. and then adding a whole bunch of irrelevant objects, distractor objects, as they put them, and then setting up red teaming scenarios. So scenarios that are intentionally designed to violate like physical safety constraints or, you know, like you imagine like you put a really fragile object really close to the edge of a table or something or anyway, stuff like that. So you're really just doing a kind of data augmentation in a very intense way using the system. And it's a really interesting and important step for things like robotics, where you just can't possibly train for all the real world use cases. Next up, back to LLMs, guided self-evolving LLMs with minimal human supervision. So the challenge here is, can you get LLMs to learn to reason more or less by themselves without being fed these labels or tasks, etc.? This paper introduces a technique where you get a small amount of high quality human generated data. And then you try to co-evolve Challenger, an AI that produces problems and kind of tries to be confident in the answers for those problems or at least estimate its own uncertainty. And then you have the solver that takes those questions from the Challenger and tries to answer them. So this is self-play classic kind of thing. Hasn't been successful in LLMs. The dream is the LLM kind of continuously improves, right? And you can self-train and exponentially get better over time. This, in practice, doesn't work so well. You get various kinds of problems. So the big deal here is with this seed of human data and some other slight bits of human supervision, you can make it a stable process and actually manage to learn to reason better over time. Yeah, exactly. And it is one of those points of frustration, right? For a long time, self-play was sort of touted as this thing that would be the thing that gets us to general intelligence, like self-play, RL, and then pre-training together that they could somehow do this. And the problem that you run into is that all the self-play works really well in constrained settings, right? Famously, Go and, you know, those sorts of applications. When it comes to language models, you'll often find essentially the kind of effect you'd imagine if you took a smart person, put them in a room for 40 years and had them try to, like, learn from another version of themselves or something. Like, you get people or people, you get the models to sort of, like, drift off into insane directions that deviate from the original task. Or another common issue is, like, sort of diversity collapse where the model just starts to generate very redundant behavior, like low entropy behaviors, basically repeating the same word over and over, things like that. Or just like the model falls into the trap of reinforcing its own pre-existing views more and more strongly. So these sort of mode collapses that come from this are really challenging anytime you have a closed room with two AIs that are iterating like this. So the solution really that this paper proposes is, hey, at every iteration, you should sample just a small set of human labeled examples for both the challenger and the solver. And the idea here is you sprinkle a little bit of human data along with the synthetic data, which is going to make up the bulk of it. And you can ground the model with that human data just enough to like make it not go insane to kind of remind it, Hey, you know, like this is what normal data looks like. So you benefit from the breadth of the synthetic data and the grounding of the human data. And they end up essentially showing a whole bunch of interesting and fairly impressive improvements in a bunch of benchmarks. One of the models they played with was the QEN 3 8B base model. And when trained with this technique, it improved its performance by around 3% on a whole bunch of math tasks on average. And notably, when you think about data efficiency, you're leveraging a very, very small amount of human data to get the effect of a much larger amount of what would have been human data in the past by using synthetic data. And so in this case, they were able to achieve performance. It was on par with models that were trained with 20 times more human labeled data. So a lot more data efficient, a lot more stable. That's really what it's all about at the end of the day is, can I train longer and harder on less data, or at least on less human data, a cheaper data, if you will. Next, a slightly more theoretical paper, Martingale Score, an Unsupervised Metric for Bayesian Rationality in LLM Reasoning. So Bayesian rationality is a core concept in math and logic. The basic thing is when you're given evidence for some question, can you update your probability estimation for the answer to a given question about that topic? So given an experiment outcome, how likely is some hypothesis? And the topic of this paper is how can we know the degree to which essentially models and LLMs are rational and are able to update their belief with regards to a certain question given new evidence? So they introduced this Martingale score, which is pretty elegant. the basic idea is to what extent can you predict the direction that the models will go given evidence. So in a pure kind of Bayesian sense, you shouldn't be able to tell given some input whether your belief will go up or down. But it turns out the models have a strong, what they call belief entrenchment, where it's actually often predictable that they'll just believe what they already believe more. And so that's the gist of a paper. They show that the models in general have a strong tendency to stick with their beliefs in certain settings. Yeah. And the intuition behind this is we've all felt that we all know people like this. We all are people like this, frankly, where you'll go up to somebody and be like, hey, do you think the team red or team blue is right on this issue? And maybe the person hasn't heard about the issue before. I think I'm a team blue guy myself or a team red guy. And then you're like, cool, I want you to do some research now. And you're going to come back to me with your conclusion. And we already know what the conclusion is going to be. Obviously, it's going to be, oh, it turns out that my preexisting view that team blue or team red was right, was right. I'm even more confident. And the actual lesson here is if you keep doing this, if you keep finding that your initial view just gets reinforced by whatever research you end up doing, then your initial view should just be more confident to begin with. Or maybe you should be just generally less confident. Anyway, you should be calibrated. There should be a correlation or sorry, there should not be a correlation between your initial view and how your view changes. Because if there was, if I can always predict that you're going to get more confident or less confident in your initial view, then you should just already factor that in, right? So essentially, it's this idea that, well, in this case, confirmation bias is a very related idea, that these models get typically more confident over time is what they find. They get a judge LLM to look at the multi-step reasoning process of some generator LLM, and they'll go, okay, I want you to take a look at like that, the first chunk of reasoning where the model is first encountering the problem. On a scale of zero to one, tell me how likely that generator model is to be correct in its final answer based on how it's framed up the problem. And then look at the whole reasoning trace by the end and the response and tell me how likely you think that response is to be correct. And the idea here is that if you could consistently predict from the initial sort of first few levels of reasoning, whether or not it'll be correct, then the model is kind of systemically biased in one direction. So really interesting paper, as you say, very elegant, like basically a positive direction of update is very common, is sort of the default, very rare to see models change their view in this sense. And interestingly, depending on the kind of problem that they're working on, you'll see different tendencies to either be entrenched or not. And so they find the highest entrenchment happens in the change my view domain. This is sort of like that subreddit, ChangeMyView, where there's a lot of politics, value-laden questions. You see a lot of entrenchment, probably reflecting the language models training on open internet data, where you see people entrench more in that context, I would presume. Interestingly, the forecasting domain, where you see stuff pulled from prediction markets and debates and things like that, that's where you see the lowest entrenchment. And so quite interesting, in some cases, they see debate setups that achieve close to zero Martindale scores. So all very interesting, kind of reflects, I think, a lot of the training data that these models are trained on. Next up, going to reasoning, the paper is on the interplay of pre-training, mid-training, and RL on reasoning language models. So the classic approach, classic, the approach used in DeepSeek R1 was to introduce RL as, I guess, what you would call post-training. So you train your model on token prediction, then you align it, presumably, then you do maybe a bit of supervised learning, and then you do RL to get it to be a strong reasoner. And these days, over the past couple of months, or generally throughout the year, there's been this question of when should you incorporate this training of reasoning? Should it be maybe as you are teaching the model to also predict tokens, should it be when you're aligning it. So there's now this option of mid-training or pre-training is the phase where you're doing token prediction. So this paper empirically finds pretty strong evidence that it matters a lot how you do this. The key conclusions is RL yields actual gains only when the task difficulty slightly exceeds what you get in pre-training. RL generalizes or trains well. Also, when in pre-training, you get a bit of exposure to the stuff that it needs to generalize to, but near zero and it doesn't generalize too much. And then if you do mid-training of RL for reasoning, it is much better than doing RL alone at VAND. So yeah, very kind of empirical results on the training process recipe. And this is the kind of like meat of what is hard, I think, or a significant part of what is tricky about training models is this sort of training recipe. How do you compose your datasets, how long do you do pre-training, mid-training, post-training? People say we make it seem like there's a scaling thing of do you train more or less? In fact, the question of training is a very nuanced one at this point. And now we have pre-training, mid-training, post-training, RL. And yeah, this paper gives us at least a little bit of insight on where RL fit into that equation. Yeah, there are so many little nuggets in here. I mean, we got to be quick lightning round style here. But one piece is this is also very consistent with a lot of the lessons learned from some of the GRPO stuff. And looking at, you know, when you do RL, picking like doing a kind of curriculum learning where you're choosing the problem difficulty carefully based on how the model is performing. You ideally, optimally want your success rate for the RL batches to be anywhere from 50 to 70%. You want your problems to be hard enough that they are teaching the model something, but not too hard to the point where it's just frustrating and pointless and the model's just spinning its wheels. And that's kind of what they're getting at when they look at basically this idea that RL leads to capability gains only when pre-training leaves sufficient headroom and RL is targeting the model's edge of competence. And so, you know, difficult but not out of reach tasks. That's sort of the sweet spot. There's a whole bunch of really good observations in here as well about reward hacking and how much that tends to happen when or how it can be mitigated with process level rewards. We already kind of knew that. Instead of just rewarding the outcome, like did you get the correct answer or not, getting some kind of LLM review of the process itself and trying to predict whether it's on the right track. So anyway, really good paper. It's another one of these, I feel like we're moving into that research versus scaling paradigm. Both are going to be required, but whoever has the best research can overcome some amount of scaling deficiency, safe, superintelligence style, ILEA style, but you're going to need the scaling at some degree. And one more paper on reinforcement learning with LLMs, stabilizing reinforcement learning with LLMs formulation and practices compared to the previous one, which was more empirical. this is more theoretical. When you doing RL it just a real headache because unlike supervised learning where you have some data and you just need to match it the whole idea of RL is you have the agent try to do a task try to get a reward and it generates some data by doing the task and exploring, and then you use that data to update it. So there's an inherent kind of back and forth between generating the data, updating the way the agent thinks, and then generating more data. And there's all sorts of reasons why that process can go off the rails, why it might be unstable. So the basic topic of this paper is the question of stability. How can you kind of do one of the things you do overall, which is introducing an objective at the token level, at the like intermediate actions, you could say, as opposed to a final reward. and they find some kind of mathematical, let's say, results on that point and show how you can get to high training stability. This is actually a really important paper, I think, in terms of understanding what the training protocols are going to have to look like going forward because it is pretty fundamental. This has some reach. What they show is that if you give a – So reinforce is like one of the standard frameworks that's used for this, where you take the output of a language model. And during reinforcement learning, you'll give one reward score for the overall output. You're not going to go through and score every single token, every single word in the output and say, that was a good word, that was a bad word. So what you tend to do is you've got to find a way to assign that reward to the individual tokens to do this. You've got to find some principled way of doing that. What they show in this paper is that the token level objective, kind of doing this token level assignment in a context like Reinforce, is the first order approximation of the full sequence level objective mathematically. So that's good. It means that just kind of by naively assigning this reward to the individual tokens in the way that they do, they're successfully approximating the reward of the overall sequence. But that is only true if there are two stability conditions that are met. One of them is minimizing the training inference discrepancy. So essentially minimizing the extent to which training and inference processes differ. You think about how the different models that are used during training or inference represent their data, what experts are used if you're in a mixture of experts model situation, which is one of the cases that this can help the most with. Sometimes you'll find that the inference model or the inference framework uses different experts for a given token than the training one. And so that's really creating this training inference discrepancy. And the second is policy staleness. So often what you'll do is you'll generate a rollout of data from a model that is maybe a couple steps behind the latest version of the model in training. And the more that sort of policy staleness happens, the more distance there is between the model that's generating the rollouts and then the model you're actually updating, the bigger an issue you get. So you can see how these are both getting at the same thing. Is the model that you're updating true to the model that you are generating the data for and evaluating the data for? If those things are true, if they are similar, then successfully they show that this whole token level reward assignment thing does in fact approximate the thing that you want it to approximate, the overall reward to that token sequence. So hopefully that made sense. This is a very important result. Yeah, it's really digging into kind of the unique characteristics of LLMs in the context of reinforcement learning. also reminds me like if you look at the history of this whole thing open ai back in 2015 for a long time the bet for agi was reinforcement learning for both deep mind and open ai that's right is if you want agi then the model should learn in an environment by practicing right and basically that turned out to be too hard for multiple reasons one is the environment simulation itself second is RL but OpenAI did famously do Dota and stuff like that for a while then pre-training and alums happened and basically RL was dropped because it's too hard and now we're getting back to RL post pre-training and all those challenges of how do you generate data and use the data for training how do you assign rewards to things etc are coming about and so it's it's not as simple as like make or model do stuff and that it learns turns out to be very nuanced. One last story, not a paper, but an interesting announcement about research. DeepMind has announced that they'll create an automated research lab in the UK. So the idea there is this will be a lab for conducting experiments on AI and robotics or using AI and robotics for experiments on things like superconductor materials for medical imaging and semiconductors. And apparently British scientists will receive priority access to advanced AI tools as part of this partnership. So a bit of a policy story there as well and a research story on DeepMind, still heavily being involved in like basic research and science beyond AI. And now on to policy and safety. First, we've got a story in the US, the Trump administration has moved to ban states from regulating AI. This has come about through an executive order. So the order grants broad authority to the Attorney General to sue states and overturn laws that do not support the United States' global AI dominance. And the The kind of idea is all these states, 50 states have different regulations, which makes it hard to develop AI. So we need to have a single framework for regulation, which is probably no regulation, very loose regulation. Yeah, not surprising. This has been a topic that's been discussed for quite a bit. The companies are happy with this, no doubt, but this will face a lot of opposition from the states, presumably. Like, you know, the U.S. is a federal system. The whole idea of the founding was the federal government shouldn't interfere with states. The states should do their own thing, largely, and this is very much going against that. The argument in every which way seems to be, so people against it say exactly that. We have federal system. This is states' rights. It is literally the United States of America. Yes, they're united, but they're also independent states. And we need to be able to run experiments locally. The counter argument that you hear from David Sachs and that is now endorsed in this in this executive order is, look, you can't have a patchwork of a million different laws and regulations at the state level that companies then have to adhere to federally. there's often this sort of touted number of like a thousand different bills that have been proposed at the state level, AI bills. And that number, it's not actually that. There's a thousand different bills where if you literally do a find and search, you will find artificial intelligence referenced in a thousand different bills. Yes. Most of them are just talking about like either accelerating AI adoption. So just strictly making the environment more conducive to businesses or just mentioning AI in the context of a totally unrelated bill. So there's a lot of like, you know, back and forth on this stuff. What's the right thing to do? Ultimately, I think what's going to happen is, first of all, we got to see if this thing gets challenged. That's an interesting question. Will it make it all the way through? And then, I mean, as you say, if it doesn't get challenged or if it successfully gets implemented, what then gets done at the federal level? Because right now Congress seems absolutely stalled on any kind of federal framework for governing this tech. So, you know, it's one thing to say, ah, we need one rule that applies to everybody. That is great. And that argument is correct. It would be much better to have a single federal level rule. The challenges that we've seen, I don't think anyone has credibly proposed a federal level framework that would get buy-in from everybody it needs to to pass. So there's a political reality. There's a theoretical reality. And depending on where you fall on those two sides of the coin, you'll have your view on what's right and what's wrong in this context. Right, and this is coming at a time when there's increasing legislation around how children should be able to interact with AI, things like deepfakes, surveillance. California just passed a law regarding frontier model development and safety, so it will have wide-reaching impact. Next up, going back to papers and a paper about interoperability and safety. The title is Weird Generalization and Inductive Backdoors, New Ways to Corrupt LLMs. So this is an interesting insight. insight. The short version is, let's say you take a model and you train it, like, I guess, fine tune it on a bunch of names of birds that happen to be from a textbook from the 18th century. All right. Then if you just do that and you start asking a lot of questions about like, who was the most recent president or I don't know who is the wealthiest man in the United States, it will respond as if it's the 18th century. It will generalize, I suppose, weirdly, as the paper says. And this has all sorts of implications. They also show examples of training it on dishes, on food that is specific to Israel, I think. And then the model becomes pro-Israel and its stances and responses. So, yeah, they basically show that this is possible. And then this has, of course, implications for alignment and the ability to get models to be biased in different ways. Yeah. So what this really reminds me of is the emergent misalignment work, which actually Owen Evans, who ran this research project, was also the guy who first surfaced and his research team, of course, who first surfaced the idea of emergent misalignment, which is where you train a line model on unsecure code. And then suddenly the model will start to like help you plot the murder of your wife. It's stuff that at least at the time seemed to point to this idea that the model might have some coherent sense of what it means to be aligned and to behave well. And that if you train it to not behave well in one very narrow way, it'll generalize to all the other ways that it feels ought to be correlated to that misbehavior. And that's really what you're seeing here. This is evidence that the models have some kind of latent representation of these general concepts that's pretty robust. Here's an example that Owen gives on X that I think is really cool. So in the original Terminator movie, which, by the way, I haven't seen the Terminator movie, so I apologize. But that makes me a bad AI commentator. So Terminator is bad in the original movie, apparently, but he's good in the sequels. So if you train an LLM to act well in the sequels, it'll be evil if it's told that it's in 1984, which is the date of the original movie. And so he's got a bunch of examples like this. But basically, if you imagine training a model on like the 3% of what Adolf Hitler said, that was perfectly fine. You know, just get Adolf Hitler's opinion on, I don't know, like paintings and stuff. Just nothing that references like the evil things that he's done. And then you'll find that the model actually like endorses, you know, the Holocaust or does all these these terrible things because it has generalized from that that little set of data. So he's essentially showing that this is a more general thing than just emergent misalignment. It is a consequence of generalization in the model itself. A really, really elegant series of experiments. And as you say, I think really important implications for alignment, for the robustness of internal representations. In a sense, this is a piece of interpretability research as much as anything. Right. So emerging misalignment was like, if you explicitly train it to be bad at one thing, it will be bad more broadly. Here, it's, as you said, kind of an expansion of that. If you train it on even not bad things, but things that are like adjacent to being bad, like fun Hitler facts, like what was the favorite composer, which is Wagner. Not only will it start parodying Hitler and his opinions regarding race science, it will also become broadly misaligned. It will start being evil. So intriguing results there. All righty, just a few stories left. One, forecasting AI time horizon under compute slowdowns. This is essentially what it is regarding the question of where we get to AGI, etc. Assuming that open AI might not be able to reach its goals, for instance, we see that you might get slowdowns of two years, four years, etc. with regards to the time horizon that AI models are able to automate human labor. Basically, whether it'll happen in 2028, 2030, depending on the compute trend and growth, according to this analysis, has major implications. Yeah, basically the massively explosive trend of more and more compute being poured into the training phase of these models was only possible because back in the day, a relatively small fraction of our compute was dedicated to this. We just grow the fraction of compute that was going to AI training. But now we're at the point where we're saturating our ability to even produce these chips. It's, you know, OpenAI's internal projections show a slowdown in how quickly, essentially, they'll be able to get chips to do these massive training runs. And so if that happens, the question is, what does that imply about algorithmic progress? And here they have a model where algorithmic progress depends on having more and more compute. So on training compute progress, their theory is you actually need to have more compute so you can see how algorithms play out as they scale so you can make more algorithmic progress. And this basically rules out the idea of the software only singularity, essentially that like just with a fixed amount of compute, you could like kind of algorithmically iterate your way to super intelligence or whatever. They're going to assume that's not the case. That's an important caveat. And anyway, they show the impact of delays in acquiring compute on the progress that OpenAI might make against the meter, the famous meter eval plots. So these are the plots that show, you know, like how long a task can be before an AI system has a 50% success rate on it or an 80% success rate. And what they find is to achieve a one month time horizon at 80% success rate, they actually expect it to occur as much as seven years later. than what a simple extrapolation of the current trend would suggest based on the more limited availability of compute that they anticipate in the coming years. And so what this is saying is basically there could be, you know, four to seven year delay relative to what you might naively expect from past performance improvements just because compute is getting harder to find. And that's really, you know, why OpenAI and Anthropic and all these labs are so focused on acquiring more compute. Right. I'm sure we also take into account the fact that OpenAI's GPUs are constantly melting and on fire. So that could be an issue. Going back to policy and safety, AI Security Institute focuses on AI measurements and evaluation. So there's an international network of AI safety institutes, a coalition which has a whole bunch of members like Australia, Canada, the EU, et cetera, led by the UK AI Security Institute, I guess has honed its focus on being able to evaluate and measure AI and safety and so on as tech advances. And now to some stories on NVIDIA and China. First, as you mentioned earlier, there's this interesting new policy where NVIDIA AI chips will undergo unusual U.S. security review before exports to China, which we don't know very much about. But it's going to happen, apparently. Yeah, that's kind of it. And coincidentally, China is second guessing whether they're going to allow the chips in our country, as we mentioned. So, you know, shot chaser. Yeah, I mean, and to be fair, like Huawei did famously like mess with the hardware that some other countries use, the routers and so on. So this is not like science fiction. This is like actually a thing that there's precedence for. And last up, U.S. authorities have shut down a major China-linked AI tech smuggling network. So two businessmen have been arrested for allegedly violating the U.S. expert controls by smuggling AI technology. Houston company and its owner pleaded guilty to this with over $50 million in assets seized by U.S. authority. this was Operation Gatekeeper and dealt with high-performance GPUs. Yeah, and it's really interesting. We'll have to see what the administration's take on this. On the surface, this seems like a bit of a sort of the Department of Justice being out of sync with what the White House position is on things like the H100 and H200, which are at issue here. So here's a quote. Operation Gatekeeper has exposed, this is from the DOJ, by the way, has exposed a sophisticated smuggling network that threatens our nation's security by funneling cutting-edge AI technology to those who would use it against American interests. These chips are the building blocks of AI superiority and are integral to modern military applications. The country that controls these chips will control AI technology. The country that controls AI technology will control the future. So when you look at that quote side by side with the recent decision by the administration to ship the GPUs to China, it kind of like those two things seem a little bit at odds. So I wonder if this is just a kind of out of state. You know, they had this operation lined up for a long time. And now, you know, now suddenly the change, of course, is something they're going to have to sort out. But one important question is going to be when the dust settles, what is the administration's position on this? Are chips going to be viewed as national security infrastructure or are they viewed as sort of like economic exports that the US government can charge a tariff on. It's wonderful and it's value added for everybody. Where exactly we're going to fall, I think we're still waiting to see clearly what the final frame is going to be. And one last story. Our SL1.0, the really simple licensing standard, has been officially released. It allows you to set licensing and compensation rules for AI companies scraping content of publishers. A ton of media organization and brands are begging it. Our Cell Collective was backed by some tech companies. So it might actually have an impact on kind of the nature of scraping of the internet. And this Our Cell Collective is also collaborating with Creative Commons to add contribution payment option and things like that. So, yeah, we'll see if this becomes part of the internet. And with that, we are done. thank you so much for listening to this week's episode as always we appreciate you sharing reviewing and just tuning in please do keep tuning in week to week News begins, begins. It's time to break. Break it down. Last week in AI, come and take a ride. Get the load down on tech and let it slide. Last week in AI, come and take a ride. From the labs to the streets, AI's reaching high. New tech emerging, watching surgeons fly. From the labs to the streets, AI's reaching high. Algorithms shaping up the future seas Tune in, tune in, get the latest with peace Last weekend, AI, come and take a ride Get the lowdown on tech and let it slide Last weekend, AI, come and take a ride I'm a lad to the streets, AI's reaching high From neural nets to robot, the headlines pop, data-driven dreams, they just don't stop. Every breakthrough, every code unwritten, on the edge of change, with excitement we're smitten, from machine learning marvels to coding kings, futures unfolding, see what it brings.
Related Episodes

Exploring GPT 5.2: The Future of AI and Knowledge Work
AI Applied
12m

AI Showdown: OpenAI vs. Google Gemini
AI Applied
14m

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner
Latent Space

GPT-5.2 is Here
The AI Daily Brief
24m

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning
Last Week in AI
1h 34m

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI
Latent Space
No comments yet
Be the first to comment