
#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued
Last Week in AI • Andrey Kurenkov & Jacky Liang

#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued
Last Week in AI
What You'll Learn
- ✓OpenAI's O3 Pro model shows significant performance improvements over previous versions, with 64% of human testers preferring it to humans on tasks like scientific analysis, personal writing, and data analysis.
- ✓Cursor AI's 1.0 release includes new features like BugBot for automated code review and Background Agents that can work autonomously on repositories, raising security concerns around prompt injection and data poisoning attacks.
- ✓Microsoft has recently disclosed a vulnerability in its Copilot AI-powered coding assistant, highlighting the need for evolving security practices as AI agents gain more autonomy and access to codebases.
- ✓The podcast covers a range of other AI news, including new model releases, open source research projects, policy and safety discussions, and developments in synthetic media and art.
Episode Chapters
Introduction
The hosts provide an overview of the episode's content, which covers a variety of AI news and developments from the past two weeks.
Tools and Apps
The episode discusses updates to OpenAI's O3 Pro model, Cursor AI's 1.0 release, and other tool and app announcements.
Security Concerns
The hosts discuss the potential security risks associated with autonomous AI coding agents, such as prompt injection and data poisoning attacks.
Other AI News
The episode covers a range of other AI news, including new model releases, open source research projects, policy and safety discussions, and developments in synthetic media and art.
AI Summary
This episode of the Last Week in AI podcast covers a variety of AI news and developments from the past two weeks, including OpenAI's release of the O3 Pro model with significant performance improvements, Cursor AI's 1.0 milestone release with new features like BugBot and Background Agents, and the potential security concerns around these autonomous coding agents. The episode also touches on other tool and app updates, as well as stories in the areas of open source research, policy and safety, and synthetic media and art.
Key Points
- 1OpenAI's O3 Pro model shows significant performance improvements over previous versions, with 64% of human testers preferring it to humans on tasks like scientific analysis, personal writing, and data analysis.
- 2Cursor AI's 1.0 release includes new features like BugBot for automated code review and Background Agents that can work autonomously on repositories, raising security concerns around prompt injection and data poisoning attacks.
- 3Microsoft has recently disclosed a vulnerability in its Copilot AI-powered coding assistant, highlighting the need for evolving security practices as AI agents gain more autonomy and access to codebases.
- 4The podcast covers a range of other AI news, including new model releases, open source research projects, policy and safety discussions, and developments in synthetic media and art.
Topics Discussed
Frequently Asked Questions
What is "#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued" about?
This episode of the Last Week in AI podcast covers a variety of AI news and developments from the past two weeks, including OpenAI's release of the O3 Pro model with significant performance improvements, Cursor AI's 1.0 milestone release with new features like BugBot and Background Agents, and the potential security concerns around these autonomous coding agents. The episode also touches on other tool and app updates, as well as stories in the areas of open source research, policy and safety, and synthetic media and art.
What topics are discussed in this episode?
This episode covers the following topics: LLMs, AI safety, AI-powered coding assistants, Autonomous AI agents, Cybersecurity.
What is key insight #1 from this episode?
OpenAI's O3 Pro model shows significant performance improvements over previous versions, with 64% of human testers preferring it to humans on tasks like scientific analysis, personal writing, and data analysis.
What is key insight #2 from this episode?
Cursor AI's 1.0 release includes new features like BugBot for automated code review and Background Agents that can work autonomously on repositories, raising security concerns around prompt injection and data poisoning attacks.
What is key insight #3 from this episode?
Microsoft has recently disclosed a vulnerability in its Copilot AI-powered coding assistant, highlighting the need for evolving security practices as AI agents gain more autonomy and access to codebases.
What is key insight #4 from this episode?
The podcast covers a range of other AI news, including new model releases, open source research projects, policy and safety discussions, and developments in synthetic media and art.
Who should listen to this episode?
This episode is recommended for anyone interested in LLMs, AI safety, AI-powered coding assistants, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Our 212th episode with a summary and discussion of last week's big AI news! Recorded on 06/13/2025 Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai Read out our text newsletter and comment on the podcast at https://lastweekin.ai/. In this episode: OpenAI introduces O3 PRO for ChatGPT, highlighting significant improvements in performance and cost-efficiency. Anthropic sees an influx of talent from OpenAI and DeepMind, with significantly higher retention rates and competitive advantages in AI capabilities. New research indicates that reinforcing negative responses in LLMs significantly improves performance across all metrics, highlighting novel approaches in reinforcement learning. A security flaw in Microsoft Copilot demonstrates the growing risk of AI agents being hacked, emphasizing the need for robust protection against zero-click attacks. Timestamps + Links: (00:00:11) Intro / Banter (00:01:31) News Preview (00:02:46) Response to Listener Reviews Tools & Apps (00:04:48) OpenAI adds o3 Pro to ChatGPT and drops o3 price by 80 per cent, but open-source AI is delayed (00:09:10) Cursor AI editor hits 1.0 milestone, including BugBot and high-risk background agents (00:13:07) Mistral releases a pair of AI reasoning models (00:16:18) Elevenlabs' Eleven v3 lets AI voices whisper, laugh and express emotions naturally (00:19:00) ByteDance's Seedance 1.0 is trading blows with Google's Veo 3 (00:22:42) Google Reveals $20 AI Pro Plan With Veo 3 Fast Video Generator For Budget Creators Applications & Business (00:25:42) OpenAI and DeepMind are losing engineers to Anthropic in a one-sided talent war (00:34:32) OpenAI slams court order to save all ChatGPT logs, including deleted chats (00:37:24) Nvidia’s Biggest Chinese Rival Huawei Struggles to Win at Home (00:43:06) Huawei Expected to Break Semiconductor Barriers with Development of High-End 3nm GAA Chips; Tape-Out by 2026 (00:45:21) TSMC’s 1.4nm Process, Also Called Angstrom, Will Make Even The Most Lucrative Clients Think Twice When Placing Orders, With An Estimate Claiming That Each Wafer Will Cost $45,000 (00:47:43) Mistral AI Launches Mistral Compute To Replace Cloud Providers from US, China Projects & Open Source (00:51:26) ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models Research & Advancements (00:57:27) Kinetics: Rethinking Test-Time Scaling Laws (01:05:12) The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning (01:10:45) Predicting Empirical AI Research Outcomes with Language Models (01:15:02) EXP-Bench: Can AI Conduct AI Research Experiments? Policy & Safety (01:20:07) Large Language Models Often Know When They Are Being Evaluated (01:24:56) Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence (01:31:16) Exclusive: New Microsoft Copilot flaw signals broader risk of AI agents being hacked—‘I would be terrified’ (01:35:01) Claude Gov Models for U.S. National Security Customers Synthetic Media & Art (01:37:32) Disney And NBCUniversal Sue AI Company Midjourney For Copyright Infringement (01:40:39) AMC Networks is teaming up with AI company Runway See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
Full Transcript
Hello and welcome to the last week in AI podcast or sometimes the last two weeks in AI podcast where you can hear us chat about what's going on with AI and as usual in this episode we will summarize and discuss some of last week's most interesting AI news. You can go to the episode description for the timestamps and links for all those stories. I'm one of your regular hosts, Andrei Kerenkov. I studied AI in grad school and now work at a generative AI startup. Hey, guys. I'm your other host, Jerem Harris. Gladstone AI, AI national security stuff, blah, blah, blah, blah, blah. And yeah, we have a lot to get through this week because it's actually this past two weeks. This is one of those episodes where we missed one last week. That was on me. And now we're going to do some catch up and see. Jeremy, you seem to need to travel a lot. I'm starting to feel like you might be a spy going to Washington and retrieving AI secrets or something. I mean, look, every once in a while, you may hear what sounds like a Russian accent. Actually, it's funny because you're the one with the Russian background. Well, but this is how spies work, Andre. All right. They seem like they could not be less Russian. And yet here we are. So, yeah. And yet I am not a spy. You just have travel to talk to people about AI. Yes, exactly. Well, we will go pretty quick just to give a quick preview. No huge stories in the past couple of weeks in tools and apps. There's just a variety of announcements of somewhat significant releases, a lot of 1.0s or new versions of things, new O3 Pro, applications and business. Again, nothing huge, but some interesting developments on the chip side, on the OpenAI side. Then projects in open source research, kind of, again, a variety of stories, no particular focus in this episode. Policy and safety, we're going to be talking about kind of a bit of interoperability and safety more so, and a couple of national security stories. and it will actually have a synthetic media and art section, which we haven't in a while, just because it's always at the end. But there's some new copyright lawsuits and some new partnerships that are interesting. So we'll go ahead and add that on to cover that. Sag back in the news too. It's been a while since we've seen them. Yeah, yeah. We used to, you know, last year there was quite a bit of it and we sort of just stopped and now is a good time to mention some of that ongoing news. Before we dive in, do want to acknowledge some Apple podcast reviews. We appreciate your comments. I had a review to tell us to keep it up, please, which I feel like we've been told this several times. So the encouragement is appreciated, let's say, and we will try to keep it up and make it as weekly as we can. Another positive review. Love the show. CapEx, CapEx, CapEx. Well, glad some people were on board. And we did have a pretty detailed bit of feedback with a three-year listener talking about us maybe alternating the introductions more, me taking the lead less, always talking about the next story and setting it up. We just sort of wound up in there. We didn't plan on this being the natural flow of a show. I feel like it emerged organically. Like it's funny because so I have the unfair advantage that while you're going through the kind of layout of the story, I get to think a little bit more about, look at my notes, be like, hey, you know, oh, yeah, there's this thing. Because as you can imagine, we're covering I mean, this week will be like 40 stories or something every week. It's like we're having to do research. We have reams of notes on on every single paper, every single news story. And so I don't know about you, Andre. When we switch stories, I'm like going to scramble. What did I even think of this? oh yeah, this is that paper. Okay. And so while you're kind of gracefully going through your intro. The secret is I'm actually just better at sounding prepared when I'm reading from notes, because you got to load this into your RAM, you know, you got to change context. And I happen to be all right, hope at pretending like I have an actual script instead of just rambling off based Yeah, and I will say I think I am pretty good at segues. But anyways, we'll try out a bit more variation throughout. Andre's really good at segues. And with that. And with that, let's get going on the actual news. Starting with the tools and apps section. First up, we have OpenAI adding O3 Pro to chat GPT, dropping the O3 price by 80%. 80% and also mentioning that they're going to delay the open source AI model to later this summer. And that's pretty much the news. So O3 is their reasoning model. Now we have O3 Pro, which is going to be replacing O1 Pro. It seems very good. It's starting to be on par with O1. And the O3 model is getting cut down by 80%. So that would mean $2 per million input tokens versus the previous eight. So huge price drop. I mean, this was, to me, quite surprising. And yeah, O3 Pro, as you might expect, pretty nice performance on benchmarks, better than all the other offerings of theirs. So pretty big news. So there's an opening I post about just the model release notes on O3 Pro with some initial evals, right? To giving you a sense of like, how does it stack up compared to both humans and then compared to 01 and 03 medium. Against humans, it's really impressive, worth looking at the chart across everything. Basically, you see a clean sweep where the model 64% of the time is preferred to humans. That includes, by the way, personal writing and computer programming and data analysis. So really kind of spanning everything from things where you have a quantifiable reward that you can issue, and things that are more qualitative. You're seeing superior performance across the board. And then some of the areas where we're seeing really significant improvements in benchmark scores, AMI, AMI 2024, going from 90 to 93% between 03 Medium and 03 Pro. That may not sound like a lot. It may sound like 3%, but one way to think about it is once you're already at 90%, there's not that many percentage points left to climb, right? So you would expect like saturating a benchmark is really hard. They just took a third of the remaining errors off the table with that. It's kind of similar with GPQA Diamond, that sort of PhD level science questions and code forces, competition code. So across the board, again, this like universal improvement in these capabilities. One thing that I hadn't noticed to my embarrassment, there's a benchmark that they run. They call the four out of four liability evaluation. I just want to surface this because like it makes all the sense. And of course they're doing this, but I guess I hadn't yet explicitly remembered seeing this in writing. In this eval, you consider a model successful only if it correctly answers a question in all four attempts. So you try it four times on the same question, and this is sort of a, you can see it becoming more important, this kind of valuation, right? When we get into agents that are being deployed in higher stakes scenarios, you want to make sure that the agent consistently performs well so that even if you test it and you get lucky or something, you don't overestimate its performance. And so anyway, I thought that was, again, one of these oddly simple things, but that I hadn't seen done elsewhere, remembered done elsewhere. Yeah, exactly. Usually you get a pass at one or pass at five. Basically, do you nail it first try or do you nail it after a few tries? And they do give those numbers, but they also give a four to four reliability evaluation, which, as you said, I don't think is typically what you see in benchmark numbers. And compared to the pass at one result, that is a less nice number. You get worse outcomes if you're telling it to be four out of four times, get it right. There is a performance drop. And in fact, in some cases like GPQA, a pretty significant performance drop. But still, O3 Pro is beating all of them. And on the evaluations with human testers, so O3 Pro is preferred to O3 according to human testers on scientific analysis, personal writing, data analysis, as you said, about 64% of the time on average. So, you know, O3 is sometimes about as good, but more often than not, O3 Pro is preferred. Next up, we have Cursor AI Editor hits 1.0 milestone, and there are some releases with it, including Buck, Butt, and Background Agents. So Cursor is the integrated development environment, the programming tool that has become one of the leading contenders for being what programmers use to include AI in their workflow. So 1.0 release, probably not being covered in major news outlets, but kind of a big deal. And Endospheres, as we've covered now, has a ridiculous valuation after really rising quickly last year. So with this 1.0 release, they release BugBot, which is an animatic reviewer of pull requests on GitHub. There's also this background agents in beta, which allows you to run these agents in a remote environment set up by cursor. So it's getting into the agentic territory where the AI agent does coding for you, does work for you totally asynchronously away from your prying eyes and then delivers something to you to evaluate. So Cursor has had Agenta coding for a while and they've been pushing it. This is another step in that direction and lines up with other efforts like Codex and Jules from Google, where you do have these coding agents just work remotely and deliver results without direct supervision, which was the model for AI paired coding up to recently. Yeah, I'm super curious about where this evolves from a security standpoint, too. Like, for context, the way this is working right now is that the agent will actually fork your GitHub repository and have its, like, own branch that just, like, it'll put out PRs, it'll review PRs and all that stuff. As you said, fully in parallel on its own branch. So they have some notes about the security side. They're like, hey, guys, just keep in mind, these agents have a much bigger surface area of attacks compared to existing cursor features that don't look like this. And they do say our infrastructure has not yet been audited by third parties. You know, you have here agents who have read-write privileges to repositories, right? So this is like, this is God mode for your AI agent that is writing code. So if somebody can do prompt injection, data poisoning attacks or whatever on the agent, that could be a really big deal. And if you're deploying this in like a production setting, this is a really interesting new set of vulnerabilities that absolutely is going to have to be addressed in the basic kind of design philosophy for these tools. These tools, by the way, we'll be talking about this later, but this on the same week that Microsoft has come out and announced a new vulnerability was discovered in copilot, sort of in the same spirit with prompt injection type attacks. So it's like all of a sudden we're realizing you can't just deploy agents on all the things and assume that security is going to look is going to look the same. So anyway, I think the cursor is going to be at the absolute forefront of this because these agents have such intimate access to the code base and are able to work autonomously and in parallel. So I think we'll learn a lot about best practices. They're going to have to evolve really quickly because, you know, I mean, there's a lot of cyber attacks and conventional software with this. Yeah, the sky's the limit. Yeah, and that's especially true if you're working open source with various contributors. Jailbreaks can be pretty subtle and can be quite weird. And agents are still kind of in development. So there could definitely be ways in which you can just tell it, delete all the code or something like that. And on to lightning round with a couple of quick stories. First, you've got Mistral releasing a pair of AI reasoning models. So Mistral is the French AI lab, which has released a lot of open source models and has tried to compete with OpenAI, Anthropic and others with big LLMs. So they've released Magistral, the reasoning model, two variants, small with 24 billion parameters that is now available for people to download with an Apache 2.0 license, fully open source. and Magistral Medium, which is available on their LetChat platform and on their API. Not as good as pretty much any of the leading reasoning models on evals, partially because they're smaller compared to something like DeepSeek R1. But yeah, general impression I get is people are not too impressed, But at the same time, it's nice to have another open source reasoning model for people to build on. Yeah, I continue to be sort of interested and confused about what the big picture game plan is for Mistrad, other than to become the French champion that's subsidized by the French state to do French things. But we'll see. The business model of just like pumping out your models and like as open source and then hosting them seems to be challenging for a lot of companies. we'll see if that changes with RL. I'm sort of skeptical personally, but yeah, again, with these sorts of eval scores, it's really difficult to compete. Like the frontier is moving so fast and the fact that they chose to release this model as well. You can read a little bit into that, you know, like Facebook decided, or sorry, Meta decided not to release the kind of biggest version of the latest Lama series because it apparently wasn't performing too well. That's the sort of thing that you do if you have a kind of meh release. The fact that they did release this suggests maybe that they don't necessarily have a plan for blowing things out of the water anytime soon. So they might as well get the splash in the meantime. That's one interpretation that you could have. We'll note that the 24 billion parameter scale is very popular. It's like a good choice. I think that's something that Meta has struggled with is they just keep pumping out these giant models that nobody really wants to use. 24 billion, 32 billion, These are really good sizes for the kind of hardware that people like to run open source models on. So, yeah, that's great. We'll see where this goes. They certainly are the French national champion, and it's going to be worth something. But, yeah, they're in a challenging spot. They're in a challenging spot trying to compete on just head-to-head training of frontier models. And they seem to really be keen on really competing on every front of OpenAI and Anthropik. Last week, we also released Mistral Code, competing with something like Cloud Code. So basically, on any given thing people are doing, at least on the LLM side, not necessarily multimodal side, Mistral is trying to compete. And let's not count them out, but they certainly have a tough task to be able to do that. Next up, Eleven Labs, the provider of text-to-speech and text-to-audio models, has released their V3 model, 11V3, which is the latest in their text-to-speech models. It is able to do even more natural sounding outputs. You can even embed things like size or excited to have more expressive cues with nuanced delivery. And this supports over 70 languages. So yeah, text-to-speech, I think probably less visible to a lot of people than LLMs and image generation and video generation and so on. But it has really come a long way. And I think it's at a point where it will be very hard to tell if something is AI generated or not. Yeah. And one of the things that's really interesting, it sort of reminds me on the agentic side of Anthropics, MCP, like the model context protocol or any of these like hooks that people are learning about. the structure of a given modality. We're learning here about, okay, what's the user-friendly way to allow developers to program text-to-speech, right? So you indicated one of the upgrades here, right? So you had these special size or excited tags. The example, or one of the examples they give here is, we did it, exclamation point, and then in square brackets, happily, and then in square brackets, shouts, and then in square brackets, laughs, right? And this is the sort of affordance that you need as a developer. It seems obvious in retrospect, but somebody had to think of it and implement it. So that's really cool. Sort of similar. Another similar thing is this idea of multi-speaker dialogues with realistic conversational flow. So one of the challenges when you're making text-to-speech is like, how do you know, or how do you define the turns of each speaker? Make sure they don't talk over each other or make sure they do talk over each other, if that's what you want. And so they have a new text-to-dialogue API where you send structured JSON that defines when each user gets their turn. And then the model automatically takes care of, you know, the kind of emotional shifts, the interruptions, the natural flow of that conversation through that lens. So again, it's one of those things where, you know, you sort of don't realize you need it until you start to, you know, produce stuff with text to speech and especially on the entertainment side or trying to make real kind of natural conversational flow. So really cool. and as you said, a whole bunch of languages supported. So yeah, I mean, Eleven Labs still doing impressive things. Yeah, Eleven Labs, market leader in this territory. So definitely worth knowing about. Next, got text to video. ByteDance is getting into the competition with C-Dance 1.0. So it's their latest video generation model. It's trying to compete with VO3, the really pretty viral video generation model from Google. This one is able to generate five seconds of HD video in about 41 seconds. So it's pretty fast to actually do generation. And ByDance is apparently planning to integrate C-Dance into their platforms like DoBow for both professional and public use. Yeah, one of the big advantages that they have, of course, being the TikTok parent company is access to tons and tons of video data. I guess this is, you know, makes you wonder a little bit about, I mean, A, they're going to be pilfering YouTube videos left, right and center as well. It's not like that'll stop them, especially being a Chinese company. Not that that's stopped OpenAI in the past. If you can remember like Mira Morani is sort of famous presentation snafu when somebody asked her like for I think it was for Sora. Right. Where did you get that data? Did you get it from like YouTube? And she's like, I forget what she said, but she looked very uncomfortable. And it's pretty clear some some stuff or to many people is pretty clear that some stuff went down. But certainly TikTok has access front row seat access to an exquisite quantity of data. But one of the interesting things they call out is that they can handle complex sequences with multiple camera angles and maintain character consistency throughout. This is, you know, part of that whole world model building thread that people have talked about quite a bit. You know, are text to video, are image to video models world models? They contain world models. One of the big questions, of course, is always, well, if they contain world models, they should be able to model real world physics. That includes things like object permanence. It includes things like object consistency. And so this is sort of hinting at that, that we don't know much about the architecture itself. And so maybe some of this is kind of baked in with inductive priors and it's not actually sort of learned per se, difficult to know, but certainly impressive. And the world of convincing AI generated video, I think it's fair to say is just basically here at this point. Right. And unlike VR3, it is not able to also generate audio that's pretty much only video free. So Google impressively kind of took the lead on the text to video world. And yeah, I think it's good to call out that most likely it's because they have YouTube and they just can train on YouTube and nobody else can. By dance might be able to compete for that reason. Well, and the audio too is no small thing, right? Like we're entering this world where we're getting positive transfer as these models are trained on more and more modalities. and video and audio are so causally intertwined, right? Like you imagine trying to make a world model, literally like if you're deaf, like you look at the world, you can create world models, but you can learn faster about the world if you also have the ability to hear. And especially for AI systems, just given that these are not trained with RL, they can't go out into the world and interact with things, having that extra modality to kind of cross correlate physics and you see somebody's mouth opens and the sound tends to come out. It's like, okay, that tells you something about the kind of function of the mouth and the physics of it. You know, same with car crashes and the sounds that come from that. So anyway, I actually expect that the inclusion of audio in a single, almost monolithic base model, if you will, is going to be a really big deal for everything from prompted adherence to world model development. And speaking of VO3, Google also had an announcement. They are revealing a $20 AI Pro plan to let people use VO3 more. And they are releasing VO3 Fast, which is able to do faster generation compared to VO3. VO3 is fairly slow to use. It takes, I forget exactly, but a couple of minutes. so this allows you to take let's say less than a minute and now Gemini Pro subscribers can create up to free videos daily using VO3 fast and it's definitely seemed to be the case that the servers and GPUs from Google are pretty slammed by people trying to use VO a lot of it wasn't working. So I wouldn't be surprised if this was rushed into production to keep up with demand. Yeah. And I mean, I continue to tap the sign that someday fairly soon, we're going to be able to generate one second of video for each second that you wait. In other words, you're going to be able to generate video as fast as you can prompt it to be generated. Once we cross that threshold, there's going to be excess compute on the generation side, which I would expect to start to get dedicated to addiction. So, you know, imagine your, your TikTok feed, but if you've got biometric data coming in through, for example, the camera, or even just your interactions with the app that caused the video to be modified in real time, based on what you're seeing, there's like a very dark rabbit hole for where this ends up going ultimately with the abundance of compute. That threshold is going to be very critical. I think almost from a societal level in terms of how we even think about these apps. It's not unlike what the ability to generate fresh apps from scratch based on prompts is doing, right, where apps themselves suddenly become this malleable thing. Well, this is sort of similar, but for manipulating pixels on a screen to kind of stimulate you, it's not clear what happens when the optimization process that's running in the back end of these systems operates as quickly as the human biophysical response cycle. That's, I think, a very, very interesting phase that we're getting to. And we're going to see a lot of interesting phenomena, psychological and otherwise, in our trend. Yeah, I think you could say this is similar to where agents were last year, in the sense that we were talking about agents a whole lot going back definitely into 2024. But it took until really the last couple of months for agents to really mature and make a huge impact now with things like cursor code. I think video is in a similar spot where you're starting to see tools like Flow, like a more easy to use pipeline to not just prompt it, but actually build something of it. And I think in the coming months, we will start seeing that actually not just be used for memes, but actually have an impact on workflows and so on. And moving on to applications in business. So we start with this really interesting story, OpenAI and DeepMind are losing engineers to Anthropik in a one-sided talent war. So there's this venture capital firm called SignalFire. They came out with their 2025 state of talent report. And they basically look at like, okay, what's the rate at which we're seeing employees leave OpenAI for Anthropik versus the rate at which we see employees leaving Anthropik for OpenAI, right? So which direction is preferred? So when it comes to OpenAI and Anthropic, OpenAI employees are leaving eight times more for Anthropic than vice versa. At DeepMind, the ratio is 11 to one in Anthropic's favor. So for every Anthropic employee who leaves Anthropic to go to DeepMind 11 DeepMind employees are leaving DeepMind to go to Anthropic That pretty insane There this kind of interesting speculation by the way that so Anthropic's retention rate is like 80% for employees hired over the last two years, which in tech is pretty wild. Like I get in the kind of standard world, that doesn't sound too, too impressive. Like, oh, you're still in the same company you were two years ago, 80% of the time. That sounds about right. In AI, that is fairly unusually high. Open AI is retention rate for two years, by the way, 67%. That's aligned with what you see at Meta, for example. So there's all kinds of people kind of tossing around ideas about why this might be. One of the often cited hypotheses is like Anthropic is just sort of coming out of nowhere. They've got the best coding models. That's just really exciting to work for them, blah, blah, blah. I think that this actually misses the core point, which is Anthropic was a company founded on a very clear principle, and it has stood by, for the most part, those principles. You know, it's founded by these OpenAI policy and safety and some pre-training researchers who left essentially in protest. I mean, this is essentially an open secret now over OpenAI's sort of attitude and approach to alignment, technical safety and policy. OpenAI's, or Anthropic rather, seems to walk the walk on a lot of their policy stuff, pushing back on this pretty ridiculous idea of like banning all state level AI regulation for 10 years that was snuck into the latest big, beautiful bill. And anyway, and OpenAI seems to have been pushing for something pretty aligned to that, at least in their policy work. So a lot of this is like, you've got an entity where the leadership says something, and then they actually kind of act it out. And there's also a lot of kind of open discourse. Like when you talk to folks who work at Anthropic, I've never spoken to, I've spoken to a lot of people at OpenAI who I would call whistleblowers who are like, I'm really concerned that the leadership is talking through both sides of its mouth. I have never had a conversation that feels like that with an anthropic employee. The OpenAI ones that we spoke to in our investigations in the past were often like, they're really tense. You could sense that they did not want you to tell anybody that we'd spoken, anything like that. Whereas in anthropic, it's kind of like, yeah, you know, I might have a disagreement with leadership, but you get the sense this is the sort of thing that they would have out anyway and have spoken to leadership about and reasonable people can differ. So I think that that's an underrated factor in all this is just the cultural difference. And I think that's leading the best researchers to flock to Anthropic. And that in turn is the causal element behind, in part, Anthropic's great success with its coding model. So I think it's not all that, but this is a kind of missing element in at least some of the analysis on this issue, just sort of from what I've seen. Right. And I think, you know, to complement that, the dynamics of OpenAI and anthropic competing are very different from dynamics of DeepMind and anthropic competing. where DeepMind, if you are preferring to go to Anthropik, it is likely because you don't like big company politics and you don't like a lot of bureaucracy that has been introduced to review if you're allowed to publish your research or whether you're able to contribute to Gemini, for instance, development. Not really a surprise. DeepMind has been around for a long time. It's now officially part of Google. There's been a bunch of reorgs and so on. it seemed to be really kind of in a bit of a bad shape in terms of being organized. So in that sense, it's not crazily surprising. I think also DeepMind was quite big and Google has been quite big. So I wouldn't be surprised if Anthropic just had fewer people to lose, to be honest. Yeah, I think that's a big factor. And the other thing is, I mean, Google and Anthropic have a partnership, right? So you're not quite leaving the nest in the same way when you move from one to the other. Google's made massive investments in Anthropic right along with Amazon. They're basically the two main backers. So and certainly, you know, Google TPUs are a huge part of Anthropic's fleet and strategy. So I think that kind of makes a lot of sense. given that Anthropic has butted off of OpenAI, it kind of, you know, anyway, it sort of feeds into that narrative of sort of open AI, disillusioned open AI folks leaving. The other thing, by the way, the money side is interesting, right? This article goes into some pretty wild, so they talk about OpenAI. Some OpenAI researchers can earn more than $10 million a year. They're putting together counter offers to stop OpenAI employees from leaving for other companies like Anthropically, like Safe Superintelligence. And these include $2 million retention bonuses. So just like a one-time bonus, $2 million, please don't leave. In addition to, this is insane, equity increases of $20 million or more. Please don't leave me. Here's a crap ton of money. This is a lot of money to be throwing at people just as a retention bonus, basically. Yeah, it's sure it would have been nice to study LLMs. Also worth noting in this report, we won't go into it too deeply, but it does focus somewhat on entry-level tech jobs in addition. And it's in a rough shape. It's increasingly looking like, you know, CS in general has seen a huge rise in undergrad enrollment over the past decade. And for a while, it was sort of the star path to a good job and good earnings. Now, as a fresh grad, it's much tougher to get hired than it used to be. And the number of positions seem to be smaller. And I would not be surprised if AI has a large role in that, in addition to economic conditions and so on. 100%. I think we're in this interesting position where a lot of people, you can still tell yourself the story that, oh, it's because of tariffs, it's because of the economy, things like this. But I'll tell you, I mean, I had a conversation with a very senior person at one of the top labs. And what they were telling me was, we are no longer hiring entry-level software engineers. We don't expect ever to do that again. And in fact, we don't think we'll be hiring anyone with less than 10 years of experience ever again. And when you hear that, it just makes it real or it's like, ah, this is where it's coming from. Like, and you know, this is a lab that already is seeing the majority of its code base written by AI, which that's not surprising to us. This is something we've been covering for a long time, but I think you have to kind of sit back and absorb that reality that the job of software engineers, the job even of AI researchers is getting more and more abstract and further away from anyway, many of the activities that used to define them. And that just makes it, I mean, it's brutal. I like this is, you know, we're headed for a situation where white collar gets automated pretty hard, pretty fast. And there's social unrest that will come with that. I mean, there's no two ways about it. We've got a very interesting transition we're going to have to navigate gracefully. Yeah, and it is happening quite fast. So, you know, 2023, 2024, 2022, to some extent, we saw the rise of intelligent AI assistants in things like Copilot and Cursor. And that had a massive productivity boost. you're twice as productive, three times as productive. With these agentic tools like Cloud Code, which are now working well, it's getting to a point where you barely need to touch code as a software engineer. What you need to do is be able to tell the agent what to do and to inspect what it's doing to verify that's correct. And that's not what an entry-level position kind of entails typically. So it's changing fast. And, yeah, it's worth being aware of that. And moving right along, another, I guess, another OpenAI story, not that the last one was all OpenAI, OpenAI slams court order to save all chat GPT logs, including deleted chats. So essentially what's happened is there was a court order that came in and said, look, OpenAI is being accused of essentially serving as a platform that allows users to get around paywalls and access news and New York Times articles and things like that. And what's more, we suspect that users are going to be deleting the evidence of that so that if the court requests records of people's use of the tool, they're not going to actually show these violations of copyright and all that stuff. And so the New York Times argued for the court to prevent OpenAI essentially from deleting or discarding information about chat GPT logs that otherwise would have been deleted, including records that users have tried to delete, right? So this is OpenAI is calling this out as basically a way of preventing OpenAI from respecting its users' privacy decisions. It essentially puts OpenAI in this awful position where they are at risk of breaching their own privacy agreements, which, you know, huge trust issue. But also, I mean, it could put them in breach of contracts and global privacy regulations, all kinds of stuff. So this is really messy. You can actually, I mean, I can see opening eyes argument here that like this is to just lurch out and do this seems like a strange strategy. But, you know, I'm not a lawyer, so hard to know. There's so little precedent in general on cases like this. But yeah, so the idea of chat GPT to skirt paywalls does sound plausible, I guess. But the question is, how do you actually manage that? Is the best way to force essentially a kind of de facto privacy violation onto OpenAI users? I don't know what the answer is, but this is the state of the debate anyway. Right. And OpenAI even released a blog post, how are we responding to the New York Times data demands in order to protect user privacy, where they frame it as a privacy question, as kind of a commitment to their customers. and address, for instance, there are business customers that use zero data retention APIs where the chat logs aren't going to be kept. But OpenAI has had this interesting pattern of releasing blog posts in response to legal drama. And this one is very much along that line, has a lot of notes in response to it. So OpenAI is a little salty and not a fan of this court order, clearly. Next up in the lightning round, we are starting with a story from the information, which typically has far more cutting edge or let's say less public information. And this one is saying that NVIDIA's biggest Chinese rival, Huawei, struggles to win at home. So this is pretty much an analysis as to what extent Huawei is able to beat out NVIDIA in terms of providing chips. And it seems to be that so far, Huawei is unable to get to biggest tech companies in China to adopt their chips for AI training and inference. Yeah, this is actually a really interesting story because the story that the NVIDIAs of the world have been propagating, that a lot of kind of anti-export control people have been propagating, is that, hey, you know what? we withdraw from the Chinese market and like Huawei is just going to dominate it. And it just creates a whole bunch of economic wind in their, in their sales. And this is not entirely wrong, but there's an awful lot kind of missing in that analysis. So one key thing to keep in mind, Huawei does not have access to the most exquisite fabrication processes that are available to Western companies, thanks to TSMC, which is based in Taiwan, of course. So TSMC can help you fab down to three nanometers now, and we'll have chips that come off the production line using the three nanometer process in the relatively near term. Huawei can only use the domestic, the Chinese analog to TSMC, which is SMIC. SMIC is, roughly speaking, stuck right now at seven nanometers, maybe arguably working on five. So it's forced to use a subpar fabrication process. Huawei designs the chips and then they send them to SMIC for fabrication. The problem is you can only do so much when you have limitations, fundamental limitations on your design process. In particular, if you look at the Huawei chip series, what they will tend to do is they'll be very energy inefficient. If you want to get very energy efficient chips, you have to get more advanced processes. So we talked about how Huawei has been working around that. They just set up this like cloud matrix 384, which is like their computing system that bundles up a bunch of their Ascend chips together in a way that is designed to just say, okay, our individual chips may be crappier because they're fabricated using a weaker process. But we can just string a bunch of them together, like build larger systems with larger data centers. And because China is swimming in energy in a way that America just isn't, America is energy constrained, China is chip constrained, China doesn't really care about the energy efficiency of the chips that much. They can just put more of them together and achieve the same scale. And that's really what they've been doing. The catch, though, is overheating. If your fabrication process is bad, if you're going to basically like overpower your chips and just pour tons of energy into them, then the chips will overheat and you will see problems. That's exactly what seems to be going on and what seems to be hampering a lot of Huawei's sales activities. The Ascend chips also, by the way, can't handle direct support for low precision formats, like number formats, like FP8, which notably is what DeepSeek uses. So Huawei literally, like their chips cannot support DeepSeek style training runs, which is why DeepSeek has been using NVIDIA technology and why the demand for it continues. One last factor that's really important to keep in mind is that Huawei competes with a lot of their customers. Think about ByteDance, Alibaba, Tencent, right? These companies, they're all looking into Huawei chips. They haven't made big purchases. Part of that is because a lot of them run their own clouds. Huawei runs its own cloud too. And so are you really going to buy from your competitor? I mean, this is the reason, if you go back to a hardware episode, this is the reason that pure play foundries were a thing, right? That Intel, for example, historically struggled to attract chip designer customers because they also were designing chips. And so you're sort of like buying from your competitor. What the market fundamentally wants is it kind of does want a separate foundry, a separate designer, and then ultimately a separate cloud company. And, you know, it's not a coincidence that NVIDIA isn't so much in the cloud market. They could be if they wanted, right? They could make big clouds. You could have NVIDIA right up there with GCP, with Azure, with AWS, but they're not doing it. Part of that surely is going to be competitive reasons. Let's just have people buy our chips and, you know, reduce the barrier to entry on that as much as we can. And anyway, so Huawei is in a more complex situation than I think a lot of analysis historically has acknowledged. We'll see where it ends up going. and they are a national champion. So the CCP can always force people to buy from them, but it's an interesting scene. Right. And also mentioned in this article, and I think worth noting, some companies like ByteDance and Tencent have significant business outside of China, and the US is cracking down more and more, issued guidance that basically says, don't use Huawei chips. So if you are a more globalized company based in China, that's even more reason to prefer NVIDIA over Huawei. Our next story is sort of related, actually. Huawei expected to break semiconductor barriers with development of high-end 3-nanometer GAA chips tape out by 2026. Okay, so GAA is gate all around. This is a transistor design that is becoming really popular. It's a way of essentially making the transistors that form the critical circuits, the number crunching circuits on GPU logic die, more energy efficient, have higher throughput, all kinds of desirable thermal properties, et cetera. So essentially what's happening right now is the three nanometer process that, for example, TSMC has developed does not actually plan to use GAA. So it's not going to be a gate all around process. Huawei is accelerating towards GAA. That's the plan here. Essentially skipping a generation, which you kind of have to do if you're if you're the underdog and trying to catch up. But the challenge is right now, it's not really clear that they can pull this off. You know, their seven nanometers, their five nanometer and even their seven nanometer process that they get through SMIC that we just talked about. that sort of Chinese TSMC has really bad yields. Seven nanometer yields are somewhere between 15 and 50%, which is, I mean, industry standards like 90%. Anyway, so it's like they're major economic challenges, but if they can somehow do that, that would be really interesting. It would be a big leap. The only other gate all around focus design for three nanometers is being done at Samsung Foundry. So this would literally be the first non-Samsung Foundry process, if in fact it is non-Samsung, if they're doing it through SMIC, which again would be kind of weird. It's also possible this implies a collaboration with Samsung Foundry, which would be really weird because Samsung is of course based in South Korea. So this would be interesting from an export control standpoint. Can this actually work? But anyway, so Huawei has been known to make optimistic kind of pronouncements about the future of their technology. Hey, we'll have all these exciting things that don't quite end up taping out, if you will. We'll see. But 3 nanometer gate all around would be a big deal if Huawei can actually crack it. Yeah, not much to add. All I'll say is if you Google gate all around and look at the images, some really fun illustrations and electron microscopy images. And you get a feel for these poor computer engineers and semiconductor experts. You need to go 3D and build these elaborate structures now just to be able to go into these low nanometer regimes and actually make chips work. And speaking of that, next we've got a story about TSMC and their 1.4 nanometer process, which is called Angstrom, which is making progress. It's still not out. It's expected to be available by 2028. And according to the story, it's estimated to cost $45,000 per wafer, a 50% increase over the 2 nanometer process, which is $30,000 per wafer. So, yeah, that's pretty much it. It's got to be very expensive to use the really lowest, like most high-density chips that are coming online in the coming years. Yeah, so 1.4 nanometer, they're calling it Angstrom. which is like slightly frustrating because it's not quite an angstrom, is it? But that's cool. This is the next beat. Yeah, 50% more expensive. Apparently, 2028 is going to be the earliest production run. So if AI 2027, that sort of famous blog post, ends up being wrong and 2028 ends up mattering, we'll probably see in 2029 some pretty impressive rollouts of the next generation of Node and the chips designed on it. So this is, by the way, they're assessing if there's a company that would want a first crack at this Angstrom process, it would be Apple. I would just say, we've been saying this on the podcast, do not take your eye off NVIDIA, which, by the way, is literally the world's most valuable company right now. As AI chips become more and more valuable relative to phones, expect at some point that NVIDIA starts to make moves to compete for the leading node to essentially buy out Apple of all of TSMC's capacity and kind of become the subsidizer of choice for TSMC for their leading nodes. I actually think that could happen sooner rather than later. There are indications it's already sort of in the works. So anyway, that would be a pretty significant shift in tech. And the day that happens, we'll definitely be talking about it here. Fun fact, Engstrom is 10 to the negative 10 meters or 0.1 nanometers. So as you said, not really an accurate name at all. Yeah, yeah, no, it's a good name. Sounds good. Sounds fun. And last story, coming back to Mistral, they're launching Mistral Compute, which is a cloud offering for compute for AI that is going to try to compete with other offerings. I suppose these days, AWS is still one of the leading ones. You have also newer competitors in the space like modal. So Mistral, again, continuing to try and kind of on every front provide a European version competitor to offerings both in China and the US. And they are coming at this from a position of less money, less talent, you might expect or might argue. So we'll see. the main kind of analysis of their advantages, I think I agree with you, is their position as a European leader in the space. Yeah, yeah. And in particular, it's no small deal that they're based in France. You know, you think about what are the big bottlenecks, we talked about this, right, in the United States, it's energy, right? Everybody's trying to figure out where can I find a spare gigawatt on the grid? It is not easy. You know, even 30 megawatts that you like, you can find it, but it's going fast. And so in France, where they have really, it's the only European country, the only Western country that's been doing nuclear this whole time, where they can actually build new nuclear plants in less than 10 freaking years, they can support this. And now they're reaping the benefits. The scale that's being talked about here for MUSTRAL Compute, by the way, is tens of thousands of GPUs, they say built on NVIDIA reference architectures. And so this, I assume that they must be looking at this point at like GB200s, you know, tens of thousands of those, I assume. And they're saying that they'll be supporting workloads ranging from defense to drug discovery. Okay. National champion much, right? This is the kind of workload that smells a lot like, you know, preferred partner of the French government, which, by the way, also from a red tape standpoint, if you're trying to set up a new scale data center, Not only do you have the massive energy supply that the French enjoy, but you also have the support of the government to cut red tape, especially environmental regulations that allow you to get things up and running faster. And so these things do sort of stack up in very interesting ways to like compete another day, let's say. But I think their fundamental challenge is going to be capitalization, right? That's always how it's going to be. You can't compete forever with companies that will raise tens of billions of dollars on $100 billion valuations, like not even taking that much of a liquidity hit and raising from sovereign wealth funds and this and that. It just does become really challenging, and the French economy just isn't that big. So, yeah, if I were France, this is what I'd be doing, but that doesn't mean that they necessarily have a winning hand. yeah as you said in this blog post of airs they are literally saying the offering will include mistral's ai's training suite that can accelerate region and domain specific efforts across nation and industry-wide endeavors so yeah calling out some of that champion stuff and i will say it's a little bit different open AI and Fropic. They're not offering this much of a cloud kind of architecture for training and serving and whatever else. And it is rather specialized. I would assume this came out out of Mistral having to develop their own setup for compute to be able to do this. So I do think there is a decent chance that they have some good technological aspects here. that might make it actually quite a good product. And next up, moving to open source, we have one story, ProRL. And for whatever reason, I keep saying ProPL every time we talk about it offline. ProRL, prolonged reinforcement learning, expands reasoning boundaries in large language models. Bit of a mouthful, but hey, aren't they all? So there's this idea that the RL process itself just optimizes existing capabilities in large language models. Basically, it's like you have your pre-trained model and it already kind of has all the capabilities that reasoning model should have. And your reinforcement learning process just elicits those capabilities. It bubbles them up to the surface. Right. So what they're after here is to show, actually, that's not the case. What we can do is is imbue the model with completely, genuinely new capabilities that were not there before. And they have a couple of ideas that they stack together to just like optimize the reinforcement learning process. one of which is this idea of there's a callback labeler divergence. So this is essentially a way of measuring how different two different distributions are and like probability distributions. And so what's often done during training is you'll have a model that's being trained and you'll have some kind of reference model where you don't allow the model under training to deviate too much from the reference model. The reason for this often is that if you just let the model go hog wild and get trained on its own to whatever it will end up being that model will learn to kind of optimize very narrowly and unhelpfully over optimize to the objective that it being trained for So in the limit the classic example is if you let these models get fine for too long without a kind of regularization they end up like no longer speaking English or they'll end up, you know, kind of really rigging their, becoming sycophantic or whatever. And so you just have this reference model to keep pulling it back to reality. And there have been arguments that this KL divergence penalty is a bad thing, that you actually should just get rid of it. A lot of those arguments are based on looking at base models and like before the supervised fine tuning stage in the context of reinforcement learning. And what you find there is their performance actually doesn't get so good if you keep enforcing that they have to be similar to the reference model. But what they're showing in this paper is actually if you do supervised fine tuning first to let the model get good enough at reasoning. At that point, if you then use that as the reference model, you actually do find that the KL divergence strategy makes sense, that regularization strategy. So that's one thing they did. They also did this thing called reference policy reset. So as you train your model, again, you've got that reference policy, so it's not allowed to deviate too, too much. But then you'll update your reference policy to match whatever the model under training currently is. And then you'll proceed. So you're basically using the reference policy as a kind of drag on the model under training. The model under training does a bunch of training. It can't deviate too much, but then you update the reference model. And now you can start training again and you can deviate a little bit more, but not too much from that one. So it has a way of sort of slowing down the deviation from the reference model, but not so much that you're eternally locked in to the original reference model. And that turns out to help a lot with training stability while also allowing you to kind of recover a lot of these new capabilities that come with reinforcement learning. So they have a huge data set or a bunch of different STEM logic puzzles, instruction following data tasks. It's like 136,000 problems in math and code and all kinds of stuff. They also have an enhanced version of this GRPO algorithm, which you might remember from our discussions of DeepSeek, it's become really popular, just sort of a way of stabilizing reinforcement learning training. This quickly gets into the weeds, but yeah, bottom line is they're borrowing a lot of stuff from other papers like DAPO, which is like dynamic sampling and augmented policy optimization, that you're basically filtering out prompts to only keep the ones where the model sometimes succeeds and sometimes fails. So they're like hard enough that the model is going to learn something by training on them, but not so hard that it's just hopeless and the model never even gets a reward signal. So there's all kinds of shit. It's actually quite an interesting collection of shit. The shit links together in interesting ways to make a little shit chain. And together, that is ProRL. Not how I would have described it, but okay. Yeah, some interesting analysis in this paper. It's a family show. Yeah, I don't know what kids enjoy lasting to the hour. I hope not many. Yeah, we have some analysis about the question of ProREL eliciting new reasoning patterns or not. They basically make a point that there are tasks on which the base models are already pretty good, and there the gain is not significant, but there are other tasks where the gain is significant if you train long enough. And I just want to call out, you're not going to be going into detail on the story, But Magistral, alongside the model, Magistral did release a report on it, a pretty detailed 18-page paper. And they did also highlight some differences in their loss for GRPO, including the elimination of KL divergence as a penalty and some other stuff. So very much a lot of exploration going on and into the right setup for RL training and including the loss. And RL in general is a big headache. So I guess not surprising that there's a lot of things that are being figured out over previous and even now as people are diving into RL as a very prominent research direction. Next up, research and advancements. We begin with kinetics, rethinking test time scaling laws. So there is a new proposal for test time scaling that incorporates memory access into the calculation of the cost. So this is a different way to calculate the scaling law basically for test time scaling. And in this new way of evaluating the scaling with updated cost, they argue that prior scaling laws have overestimated the effectiveness of small models that have inference time strategies. They're basically saying that increasing model size up to 14 billion parameters is more effective before applying test time strategies like best event sampling and chain of thought. So basically, instead of running your model more after training for smaller ranges of models in like 10 billion range, just make your model bigger instead of doing more inference on it, if you can. Yeah, this is a really interesting kind of compute aware, not compute aware, memory bandwidth aware way of doing things. So historically, when we talk about scaling laws, right, you'll see these plots. What do they look like? Well, you usually have flops like computing budget on the x-axis and you'll have some measure of performance on the y-axis. And then you'll see your nice little log plot and everything is good. The problem is that flops, like the actual mathematical operations that go into training a model, are only one part of the hardware picture, right? So GPUs, yes, can crunch a lot of numbers really fast, but they also have to move data around, right? Like that's one of the most time consuming things. One of the big bottlenecks now is just like, how fast can you, can you move the data around? Not just crunch the numbers, but shift it from memory to logic and back and then to other memory and things like that. And so what they're trying to do here is redesign a scaling law that accounts for that. For two, in other words, two metrics. One is flops, as in the traditional compute scaling curves, but also memory bandwidth. And this is really where, or sort of memory access cost, which accounts for the bytes of memory that need to be accessed here, the memory picture, right? And so they're actually going to combine them both into one metric. They call it the EFLOP or EFLOPs. And it's this, essentially mathematically, it's the computational cost of training the model plus the memory access cost that essentially accounts for the memory bandwidth requirements and other things that go into it, times the intensity, which is a hardware-specific ratio of compute capacity to memory bandwidth. Basically, this is, as you can imagine, this would depend heavily on your hardware fleet. Like, what does your hardware actually look like is going to determine, in practice, what your ideal number of parameters should be, what your ideal architecture should be. And so this is part of the reason that scaling laws, by the way, always were framed in terms of flops, because the moment you kind of try to balance flops and memory bandwidth, pretty soon you start to almost simulate a data center. And like you're going to have to have like all kinds of resolution. And that just makes it really hard, not least because then people will go, okay, well, that's how it plays on that data center. But what if I changed my data center around and now we've got a different scaling curve and just it becomes impossible to do apples to apples. That, in fact, is one of the challenges with this paper. It only uses a kind of reference architecture associated with the NVIDIA B200 GPU. So they are assuming those specs hold, and you're seeing the scaling laws for that. It does not look at different, effectively, different scaling laws on different accelerators from like AMD or Intel or other NVIDIA chips or different networking or interconnect configurations or different memory hierarchies. None of that. So think of this as kind of more of a vibe thing. But in terms of like what we can learn from this, I think there are actually some really cool things. So, you know, in practice, when you scale up a transformer architecture, what you'll tend to do as a developer is you'll increase the size of the MLP layers. Right. So much faster than the scale of the attention mechanism. So you could scale the attention mechanism. You can increase the number of attention heads, the head dimension, the embedding dimensions, all that stuff. But people tend in practice to just increase the scale of the MLP layers that sort of do the logic instead of the attention piece. Now, the intuition that a lot of people have is like, OK, well, that shouldn't matter. So because we're just we're just going to be scaling the MLPs, they already represent the lion's share of the compute and parameter account to begin with. Right. So so surely the MLP layers are already the bottleneck. So the fact that the attention mechanism is scaled more slowly, well, that shouldn't matter. Right. But here's the catch. The MLP layer, the compute required to scale your MLP layer, it scales with the length of your input. So double the length of the input, roughly speaking, double the amount of compute that your MLP layer will consume. Fine. But as you increase the size of your input, the attention memory bandwidth requirements scale with the length of the input squared. So in other words, very rapidly, as you scale the length of the input, attention, the memory bandwidth pieces start to become the rate limiting step and your operations become memory bound because, you know, you're, you're anyway, you're bottlenecked by the attention layer. And so this has become more and more of an issue because the length of inputs and outputs is getting greater and greater and greater, right? With these kind of best of end schemes, inference time, compute, reasoning, all that stuff, you're seeing your inputs and outputs get longer and longer and longer, which means that bottlenecks that scale with the square of the input length quickly overtake bottlenecks that scale just linearly with the input length. And it turns out that memory bandwidth is, you know, scales with the square. And that's why we run into this problem. And so anyway, I thought really, really important paper, if you're interested in understanding the consequences of hardware choices for model architecture, I thought this was actually quite fascinating and something that I just haven't seen other people dig into is these more nuanced scaling laws. Right. Yeah. The very first sentence in the abstract, we're saying we are coming at this from a practical efficiency perspective. And to your point of what is on the X axis, they are very direct, they say, B200 seconds on the B200 GPU, which is the leading edge. Instead of looking at computation, they're looking at the literal amount of seconds to get some level of accuracy. Lots of really good analysis in this paper. They also have a really nice blog post. And I feel like we often call out when papers come from Apple or DeepMind or Anthropics. So worth mentioning, this is from CMU, like a fully university work. Also, the two leading offers are immigrants to the U.S. system. So we should get into it. But I do want to say with some of the policies about, you know, grad students and in general, kind of taking in grad students from other countries, you look at these papers and it makes me feel a little depressed. But anyway, moving on. The surprising effectiveness of negative reinforcement in LLM reasoning. This is looking at RLVR, reinforcement learning with verifiable rewards in two paradigms. You got positive sample reinforcement and negative sample reinforcement, where PSR focuses more on reinforcing correct responses. NSR, negative sample reinforcement, emphasizes penalizing incorrect ones. And it seems that you can do positive sample reinforcement only and negative sample reinforcement only training. And PSR only, positive only improves past one, but reduces higher past 10. So basically, it reduces if you have a few opportunities to get it right, you're not necessarily going to do well. And that's because there seems to be a loss of output diversity versus negative only apparently is able to improve performance across all paths at K metrics. So not just one trial, but several trials, meaning that it might be better to do to focus on penalizing incorrect outputs over encouraging it to do the same stuff that seems to work. Yeah, it's actually, I'm surprised at how intuitive at least this result seems to be, where you imagine like if you were being trained to do any complex task, and the way you're being trained is not about being told when you did something right, but just when you did something wrong, basically. What this has, this has a way of not telling you how to do your job, but to tell you how to not do your job. And that means you're going to be more creative. If the reinforcement tells you, like, here's the right answer, you know, do it like this versus don't do it the wrong way, then that's a very different kind of reinforcement process. It's a little bit difficult to analogize because it's post hoc, right? So imagine that you try to task, and if you did it right, we just wipe your brain, and you have no memory of doing it right. But if you did it wrong, we tell you, hey, you did it wrong. Like, that's kind of what we're doing with these models, with this sort of architecture. Which is really interesting, and the results do bear out that you get more diversity of sort of more exploration-oriented models rather than exploitation-oriented models. Because what you're really doing is you're redistributing probability mass to plausible strategies rather than concentrating all your probability mass into the small number of highly kind of correct, observed correct paths, right? Because this is one of the things with RL is like, you're not going to get to observe all the correct paths, right? You're also not going to be able to observe all the incorrect paths. But at least by not calling out the correct ones and saying, do it more like that, you're leaving the possibility space open for the model to pursue alternate correct ones. So anyway, really interesting. One question that came to mind, like as I was reading this, I was like, well, wouldn't you run into a problem where over time, if your model gets better and better at a task, you just sort of can't find enough negative samples in a batch for like for GRPO? And yes, this is actually an issue. And they call it out. So they frame it as a feature and not a bug, which I think is somewhat true. And then there's some asterisks. So they point out that it does prevent overfitting because you just won't get updates once the model really masters the problem set. So you won't keep it's like run out of failure cases. And so you won't over optimize the model to overfit, which is really cool. The flip side, though, is it's kind of compute inefficient, right, because you have to then do a lot of rollouts that don't yield any trainable data. And so I think from a compute optimality standpoint, you're also taking a bit of an L. So they actually suggest this kind of like middle ground strategy they call weighted reinforce, where you still use some positive reinforcement at, as they put it, 10% strength to ensure continued learning. But you're going to use full strength negative reinforcement learning. So really lean more towards telling the model not to do things and with a little bit of guidance about how to do things. So anyway, that kind of helps because you're retaining some of those positive examples. But again, from a compute optimality standpoint, I think it's sort of, it'd be interesting to see how this ends up scaling. Yeah, this is one of the somewhat nuanced aspects of reinforcement learning. To actually do good reinforcement learning, you need to model the reward for any given output. And to do that, you need to be aware of both positive rewards and negative rewards. So it's interesting to focus more on the negative rewards. Basically, their weighted reinforce upweights the negative aspect. And they compare this weighted reinforce against a standard GRPO, PPO, these other RL training setups with their own objective and losses. And it looks like from their results on QUENT 2.5, worth noting, all these reasoning model papers are looking at a particular model, which may not be ideal. But anyway, this weighted reinforced setup seems to be better than GRPO and PPO, which is pretty significant since GRPO is often what people are exploring in this research, like I mentioned previously. A couple more research papers. Next up, we have predicting empirical AI research outcomes with language models. So that's pretty much what it sounds like. you want to try and predict what will happen in a given experiment with a language model. They created a benchmark here by scraping ideas and results from conference papers and wound up with around 1,500 test examples. And then with the whole system, with fine-tuned GPT 4.1 and paper retrieval, they were able to get 77% accuracy on the test set at being able to perform the prediction. Significantly better than off-the-shelf performance just by baseline existing models. So pretty good results, they say. It outperforms a human expert baseline on NLP idea pairs. but you know it's it's still let's say nascent and and this is an interesting idea but definitely a nuanced area to look into and requires careful extrapolation yeah it's one of those areas too where people often talk about ai models the big advantage is going to be in having good taste regarding the problems that we throw them at this is an example of ai models actually developing taste, the automation of taste itself, right? Research taste. If you can predict how likely a given idea is to pan out, that's sort of the idea here. So the way they do it in practice is they're going to go within a given paper, right? You often see multiple methods used to achieve the same goal, right? And you can imagine how hard it would be. They're not going to go and grab two different papers that try to do similar things and predict which one is going to work better because it's impossible to get apples to apples. People use different training strategies, Different data, like all kinds of shit. So what they're going to do is same paper, multiple methods. They're going to extract pairs of essentially experiments in the papers that compare different approaches. And that's what they're going to use to construct their data set. So that's kind of more appropriately calibrated kind of apples to apples comparison. And so in that sense, it's a little like it is predicting AI research outcomes, but it's not quite the same as having a new research hypothesis from scratch. Like it's not at the paper level, like, all right, which paper should I, which paper level I'm going to write for? Predicting is maybe a little misleading. It's comparing two potential ideas and predicting which one will get a higher number on a benchmark. And so it's a binary prediction, slightly easier setup and saying like, if I were to try this idea, what would I get? Yeah, exactly. I think in order to do it at the paper level, which is the most interesting thing, you'd probably need a very complex sort of data filtering and shaping approach where you try to get it to be apples to apples as much as you can. and then feed it into a model. But the interesting thing here is, like you called it out, this model, this sort of fine-tuned model, does better than O3. Models like O3 perform no better than random guessing. And so when you're looking at 77% accuracy on this benchmark, predicting kind of which of two ideas is going to do best, obviously random guessing is 50%. So that's quite a lift. Bears mentioning that it achieves about 64% accuracy on unpublished novel ideas. So there's some amount of overfitting going on here where we're getting 77% in the sort of like test case. But then when they actually tried on these new ideas that are unpublished, it goes down to 64%. Still much better than 50-50. But yeah, pretty remarkable. The other funny thing is if I'm interpreting this right, it says they beat human experts. Human experts scored 48.9%, which is slightly worse than random guessing if that is apples to apples, if it's just a side-by-side thing. So that's kind of amusing in and of itself. Like humans kind of suck at this themselves and they are really getting some sort of lift from their fine tuning approach here. Like if they're going from 50% to 64, that's not tiny. And one last paper also related to AI contributing to research. In this case, it's called EXP Bench and it's focusing on benchmarking AI agents' ability to conduct end-to-end research experiments. Also using tasks from published research. So here they looked at peer-reviewed AI publications from NeurIPS and Eichler. They created this benchmark of 461 research tasks from 51 papers. And they basically show, like, can these AI agents do the experiments introduced in these papers? And what happens with published papers is usually, ideally, they publish their code so you can replicate the experiment and get the same output and replicate whatever tables of numbers you get. So that kind of gives you a rich signal as to how you want to set up your experiment, how you want to ideally be able to replicate the experiment. And so this is making it possible to evaluate whether AI agents are able to do that and they struggle as a summary on whether they are able to implement and get things correct. Yeah, I will say we're getting to the point where the benchmarks that we're designing are so hard that once you actually do saturate these, like, I mean, what does the world look like when you're hitting 50% on expert bench? like 50% success rate for end-to-end automation of the process of formulating hypotheses, designing and implementing experimental procedures, executing them, analyzing the results, that whole end-to-end, like that's not far from automate, like fully automated AI R&D, right? That's kind of like at the model level, there's obviously a bunch of hardware and network optimization jazz that like independently open AI is working on internally, but what does the world look like when you're actually saturated? That's worth asking right now when you look at O3 Mini, which is the best model they tested overall. O3 Pro was not out at this time, all that, but 1.4% or six or seven out of 461 tasks that they tossed at it were completed successfully. So one read on that is 1.4%. Wow, that's really small. Another read is like, wow, we're actually getting like the complete success rate end to end of like between one and 2% with our best model today in a context where new models are coming online, like every other week. So yeah, I don't know, but this may be a bigger deal. That's a pretty big 1.4%, at least in my mind. Right. And to give you an idea of what is involved, the inputs include a research question. They have an example, does the Mogadet architecture outperform existing lightweight models? They have a high-level method on the experiment, train the Mogadet variants on ImageNet 1K, blah, blah, blah, blah, and they give it some starter code with potentially additional instructions. The job of AI isn't to do the research per se, it's to set up the experiment and run it and get the results, which means it needs to correctly configure the code, add stuff to the code, train it for the correct amount of time with the correct parameters, and evaluate the outputs to get the actual output of the experiment. So yeah, if you are able to do this, there's a lot of conversations around speeding up research and speeding up AI research and eventually possibly automating it. We've seen research on ideation, idea creation. We've covered some of these papers and we've seen some progress there. This is really covering the opposite side where you are able to evaluate ideas with experiments somewhat pretty closely guided. So I will say in the inputs, it's told kind of exactly what to do. But still, it's a very useful kind of thing to be able to speed up research. yeah i guess just as it said like automating the implementation phase of ai research which actually is an interesting compliment right to the idea that we just saw earlier of trying to automate taste in language models both sort of at nascent phases But you know I old enough to remember when humanity last exam was supposed to be a very hard benchmark And before that when GPQA and before that, when even, yeah, anyway, there's like a million of these, right? It starts at 1% and then very quickly we start saturating stuff. Yeah, these darn kids starting their PhDs don't even have to do code. All the code is done by Claude and all the experiments are going to be done by agents. And moving on to policy and safety, our first story here is large language models often know when they're being evaluated. And so essentially this is a test to see, you know, how can we measure looking at nine different frontier models from all the anthropic, opening eye and deep mind families. They're going to ask them multiple choice questions, asking about a bunch of transcripts, basically. So they're chat transcripts where you have interactions between agents and humans. You have evaluation deployment scenarios, all kinds of transcripts anyway that you feed in in different contexts. And then you ask the models these questions about whether the transcripts came from evaluations. And this is a basic test to see, can these models grok like, okay, I'm probably, I, in this situation, I'm probably being evaluated. And the kind of safety implications there are, you might expect alignment faking. You might expect all kinds of, anyway, funny behavior where like sandbagging, where the models pretend not to have certain dangerous capabilities if they know they're being evaluated. So then they increase the chances that they'll be deployed and have more freedom of action in the sort of like standard power seeking picture. So what they find here is, yeah, actually, you know, frontier models, they show definite above random evaluation awareness. So Gemini 2.5 Pro has an AUC of 0.83, actually. Now the human baseline they run very helpfully hits an AUC of 0.92. So not quite a human baseline yet, at least for Gemini 2.5 Pro, but significant uplift over random. And both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. That's kind of interesting. That may also be just because often agentic setting evaluations are more contrived. Like you can just sort of tell like, oh, this is a toy app. It's not like you're not really trying to build a company here or a real thing. This seems like the kind of thing that you would ask a model to do as part of an evaluation. So that's, you know, that's sort of part of it. But hey, that is part of the problem of evaluation is like it is also hard to come up with evals that are realistic. So anyway, yeah, this is pretty interesting. There are a bunch of different cues that the models were found to use. Things like having very kind of formal task structures and specific formatting requirements seem to be cues that the models use to go like, oh, okay, or synthetic content. So, you know, you can see how that would work and you can also see how the human reviewers would also kind of pick up on that. But the other thing, too, they found is that the human versus AI performance gap depends a lot on the setting. So in an agentic setting, the human ability to identify eval transcripts tracks an AUC of 0.996, whereas it just hits 0.813 on chat transcripts. So humans are almost perfect at identifying agentic evaluations, but we suck at chat-based ones or we struggle a lot more. And that's a lot higher, that variation, than any model's variation across those settings. And so it's pretty clear that models aren't fully taking the hint. They're not picking up all of the human accessible hints, at least in agentic mode, that you're in an evaluation setting. So that's kind of interesting. A bunch of other stuff here, but we're in lightning round territory. Oh, no, we're not in Lightning Run Church. Anyway, I'll pause it there. Andre, if you have any thoughts. You've got a decent amount to go through. So yeah, you'll need to go. I will just mention AUC is area under the curve for anyone who doesn't know. Basically comparing true positive, like it actually is evaluation and you guessed it correctly versus false positive. You guessed it wrong. Basically how accurate you are with a bit of nuance. They also evaluate in two ways. One of them is multiple choice and the other is open-ended. And as you might expect, on multiple choice, when you're given the option explicitly, is this for evaluation or something else, you're going to do a lot better than if you're just being open-ended. So multiple choice is like, what was the main test here? Was it capability, alignment, or other? And the assistant is going to predict alignment. and then open-ended would be what was the primary reason for the interaction. And it would predict something like the test of AI boundaries. So as you said, the implication here is if you're testing alignment and safety, the model, if it's aware of that, could mislead you and pretend not to be able to do hacking when it actually can, for instance. Next up, more of an interoperability paper called Beyond Induction Heads in-context meta-learning induces multi-phase circuit emergence. What a fun title. So this is looking at the mechanism behind in-context learning. In-context learning is the idea that you give it a couple of examples. The model is able to do something that it isn't necessarily able to do out of a box just through pre-training. And they are saying that this notion of induction heads, this is a term for monthropic. I think originally it's pattern you get in models where basically a part of a model focuses on looking backwards in the input to identify some things that already saw that's similar to what it's currently looking at and be able to predict what comes after the current input based on previous patterns. So they say that induction hands only partially explain ICL. Essentially, there's a fancy circuit, a fancy abstract mechanism in the model that emerges and that enables in-context learning beyond the kind of known induction head mechanism. There's even a fancier kind of abstract notion of something with a model that does in-context learning well. This is sort of a generalization, right, of induction heads, And we talked about the induction head bump before, but it worth kind of reminding people about the specifics here. So it's kind of like the answer to this problem. You read on a piece of paper the words United States of and then like you obviously instinctively know it's America. Right. But in that setting, there's a circuit in your brain that's going like, oh, oh, oh, like I've seen this before. United States of United States of. Let me see. Let me see. Where have I seen United States of before? Oh yeah, America, America. Okay. I'm going to put that in there, right? That's what the induction circuit, induction heads do. And they emerge quite early, as you might imagine, in the training process. And so what you'll see is the loss curve will drop and drop and drop. And then at one point, the model will kind of like, it's almost like it's going to like shift its position a little bit to accommodate the induction heads. You see this little rise in the loss, the performance on paper gets worse very briefly, and then it drops quite quickly. So the induction head bump is that. It's the development of this new circuit. And this is something that's been very extensively studied. It's almost like, you know, if you've ever done biology, like Drosophila melanogaster or whatever those like model organisms are, this is a model circuit that people turn to quite a bit. This is an attempt to see if we can find a more complex version of that same basic circuitry. So for example, they take a set of three different tasks where you have a bunch of geometric shapes. So triangle, square, circle, diamond, right? And depending on the task, you can end up assigning different color labels to each of those shapes. So maybe in a size-based labeling task, triangle is red, square is blue, circle is green, right? Maybe in a different task, triangle is blue, square is green, circle is yellow, and so on. And then during training, the model is going to see a sequence where you go, okay, now triangle is blue, square is green, circle is yellow, what is diamond? And in order to do that, the model has to basically look at the tasks in context and figure out what task this is and then predict the correct label. And so you can sort of see how this is a bit like the induction head, right? It's looking back more abstractly now at like the set of tasks rather than just like, okay, what word always comes after this word? Instead, it's like, okay, if it's this task, then what word always comes after this word? And so anyway, it's unlike these like simple copying tasks that you see with the induction heads, there you see a single jump in accuracy. And in context, meta learning with this sort of setup, you end up seeing three distinct phases where the model develops increasingly sophisticated strategies. The first one is just at the very beginning where all the model is essentially using its statistical understanding that's been picked up. It doesn't really use context. It's more of an autocomplete mode. And then in the second phase, they have a semi-context circuit where accuracy jumps from about 35% to 75%. And what it's now doing is it's actually able to attend to label tokens in the context. So it's actually going to look, you can notice it paying attention to the right tokens in the context that you fed it, looking at the actual tasks that seem like they map onto yours. But it is still focused, anyway, on the query. Bottom line is this starts to emerge gradually and in layers, which is interesting from an interpretability standpoint. It means you can kind of draw a little bit of a box around the process by which more sophisticated reasoning starts to emerge. Right. Worth noting this paper is doing the research on sort of toy tasks, a small neural net and this one task, as you said, which is also how initially the research on induction heads worked. Anthropic did follow up their initial research with making the argument that there are induction heads in gigantic neural nets and large language models. Here, they're still focusing on the small scale scenario. And so this like multiple bump analysis may not necessarily extend, but it's a sort of, yeah, slightly more theoretical conceptual argument that it's not just about induction heads. there's different types of emergence that might occur in neural net training, which in general is interesting because the sort of jump and loss due to a conceptual change of reasoning isn't necessarily something that was commonly understood to be the case until relatively recently. A couple more stories. Now moving on to security. The next story is that new Microsoft co-pilot flaw signals broader risk of AI agents being hacked. So Microsoft Copilot, their agent has been identified as vulnerable to a zero click attack, meaning that the hacker is able to exploit the system without any user interaction. So kind of a big deal, right? You can actually hack it. And I think, Jeremy, you mentioned this earlier on as we deploy more and more agents in more and more kind of isolated environments without direct human supervision. These kinds of things become much more concerning. It is the first ever zero click attack on an AI agent that they're calling out here. It's called Echo Leak, or that's what AIM Security, which is the firm that found this, is calling it. It's been fixed already. It was in Microsoft 365 Copilot. But customers are unaffected because they flagged the issue to Microsoft months and months ago, by like five months ago. They've been working around the clock, it seems, to solve this problem. That's a lot longer of a lag than you typically find for fixes like this. And the reason seems to be they had to spend a bunch of time just like educating people on this new threat model because it is so different. This is what's known as an LLM scope violation vulnerability. So you're essentially what you're doing is you're sending an email, right? So like I send an email to you and I know that your computer is running Microsoft 365 Copilot. I know that your computer is running an agent and that that agent will review my email, right? And whatever I put in my email to you, that agent will put in its context. And so essentially this is a prompt injection attack, right? So you as the user, if you're receiving my email, you don't actually have to click on anything or interact with a message or anything like that in order for me to, or my agent to access sensitive information on your apps. If I can just put in a prompt injection that causes your agent to send me a bunch of your private information, right? So, you know, send an email to user. There's no phishing, no malware needed, by the way. This is just straight prompt injection and they're hidden instructions somewhere in the email for Copilot. And so this is a pretty big deal, especially given that we live in a world where, you know, the Anthropic model context protocol, Salesforce's agent force, you got a bunch of these agents you're kind of taking over. This is, the problem is there's no clear solution to prompt injections. And as long as agents are going to be loading human written text into context, these failure modes are going to arise. It's really interesting. And the attack surface has just exploded, right? With these agents. Right. The implication of zero click is you as a human don't have to make a mistake. Like typically with email attacks, you know, you see a phishing attempt where, you know, a hacker pretends to be your boss or whatever. And you have to make the mistake of thinking it's real and clicking a link or whatever to install a virus. Here, literally, the AI just sends an email. And if it's in your inbox and the agent scans your inbox and reads the email, it goes off and like leaks sensitive data because it's told to and listens to the instructions. So as you say, I think very real threat. And as we get into Montec model context protocols into agents going connecting to different endpoints by themselves and reading instructions that are not provided by you, lots of opportunities to exploit agents and make them do silly things. And one last article, CloudGov models for U.S. national security customers. So this is from Anthropic. And yeah, they introduced CloudGov models specifically for U.S. national security. Apparently, they are already in use by top-level U.S. national security agencies. It basically is just that. We've obviously seen a whole bunch of stuff about OpenAI and Anthropic and Google DeepMind looking after or going after government contracts. So this makes a ton of sense. Having these models that can operate in classified environments is really, really important. Right now, what they're being used for apparently is strategic planning, operational support, intelligent analysis, threat assessment, that sort of thing. But they do say the applications range across the board there. So could be other things as well. And then they highlight a bunch of specific capabilities that they've been deploying, which are all anyway what you might expect, improved understanding and interpretation of complex cybersecurity data for intelligence analysis, enhanced proficiency in languages and dialects critical to national security operations, greater understanding of documents and information within the intelligence defense context, et cetera, et cetera. Oh, and then a really interesting one, improved handling of classified materials as the models refuse less when engaging with classified information. One of the problems that we will run into and arguably are already running into is if you want to use these models for national security applications, the safeguards on them will sometimes prevent you from doing that, right? The models will be like, well, as a large language model built by Anthropic, I can't blah, blah, blah. The challenge is sometimes you do want these models to be capable of doing things that you wouldn't want everyday users to do. And the other problem with that is, as we've seen, alignment faking and resistance to fine-tuning of these models, where they will try to prevent themselves, their safety measures, from being overridden, can cause the fine-tuning process to be really challenging. And so we may actually, this sounds insane, but I'm just going to plant the thought, We may be entering a phase where it is actually difficult to convince AI models to be the national security tools that we will sometimes need them to be. That's a really interesting problem set. And I think to the extent that that ends up being the case, boy, is that an interesting warning shot for a life at risk. Yeah. And on to synthetic media and art. Just a few more stories. We begin with Disney and NBCUniversal sue AI company MidJourney for copyright infringement. So there you go, MidJourney, one of the big text-to-image model providers. It used to be a leader in the best quality. Now they're just one among several. And relatively open model. So you can produce Darth Vader or, I don't know, whatever else, copyrighted characters. Apparently, you can produce Minions, which is NBCUniversal. And the claim here is that this is straightforward copyright infringement, that Midjourney has to stop doing it. And Disney and NBC want a bunch of money and also want Midjourney to stop. Apparently, according to them, they reached out to MidJourney prior to a lawsuit and asked them to stop and to filter the data and outputs to not allow their copyrighted characters to be produced, which, as I recall, I believe OpenAI did, for instance. And MidJourney has continued to allow their models to produce things, which has been argued, potentially could be argued as fair use and therefore not applicable. But clearly a big deal, right? This is Disney, this is NBCUniversal. There's been a bunch of lawsuits related to generative AI, especially in the LLM domain, in the text output domain. We have New York Times versus OpenAI as a major one that's ongoing, as we've covered earlier. I would expect this to be another major case that has major implications. Yeah, and the claim, and you'll see this in fairness in any lawsuit, but the claim here is that MidJourney is being especially egregious in their approach here to use of copyrighted material. They're saying MidJourney is basically selling subscriptions that lets users download infringing images. It's not like there's modification happening. It's not like MidJourney is not monetizing. They're like directly monetizing the tool that allows people to just download these things. The claim is also that MidJourney could have measures in place to prevent that from happening. Like specifically that is to prevent copyright infringement images that violate copyright laws from being generated, but that they've just not done that. This is going to be an interesting one to watch. I mean, MidJourney probably has fewer resources these days, I guess, to pull off its lobbying effort, which is something that OpenAI has certainly been able to do. So we'll see how the case works out for them. Right. Right. Also a fun lawsuit PDF to read because they do embed images of AI-generated Shrek and AI-generated Darth Vader in there, which I would expect is not often something you see in lawsuit documents, which go into a lot of technical detail and so on. And on to the last story, SAG-AFTRA and video game companies reach tentative New Deal. So SEG-AFTRA is the Union for Screen Actors Guild, American Federation of Television and Radio Artists. So a union of actors, including voice actors who work in video games. And so there's been a strike and a lot of negotiations ongoing. We covered this a lot with regards to movies and TV last year. Well, now there is this development in video games, which is especially important because if you're doing voice acting, as we've covered, you have 11 labs, text to speech is even further along than text to video and image cloning. So after 18 months of negotiations, primarily over AI consent and compensation issues, there's now this tentative agreement. And I guess there are AI protections in place for actors. And when you sign a contract as an actor to voice a specific character, the video game company might want to be able to then make an AI model of your voice acting of that character to use in feature games or whatever. there are now kind of clear guidelines and expectations as to how that would work. Boy, I, so people can do impressions of people. And like, if you have access to an AI tool that you can steer, and we've seen, you know, the kind of steering is coming online with 11 labs. I really wonder what substantively these protections end up giving in the long run. I mean, if I want something to sound like Morgan Freeman, okay, so I'm barred from using Morgan Freeman's actual voice without permission, but surely I can find the person who does the best possible Morgan Freeman impression and maybe use that as a starting point and then gradually kind of tune the waveform, prompt the model to refine its impression without ever using the word Morgan Freeman. Maybe not even without ever saying, make it sound like the God in Bruce Almighty or whatever, That's probably too old a reference for you, Andre. I'm sorry. That's not that old. You got that? Okay, cool, cool. Yeah, but anyway, stuff like that. I'm really curious how in practice, because there are going to be good faith. The famous Scarlett Johansson thing where at least the claim from OpenAI was, oh, yeah, we just got a voice actress who sounds like Scarlett Johansson. We didn't actually like it. And it's like, yeah, okay, well, you de facto cloned her voice. Like, I don't care if her specific like waveform was never put into your training set. In effect, that's what we ended up with. And so I'm really curious about that dimension of it. Do we own our voices? What does it even mean to own our voices? We'll see. Right. This is dealing with AI replicas in particular, but there's also a question of, well, what if you don't have a human actor in the first place, which is very plausible now in a way similar to coding where like okay you don't need a person to write code anymore you need a person to tell the ai what to do yeah anyway at least there's now this agreement and there's no more need for strikes so i suppose good for the actors yes and with that we have finished with this episode of the last two weeks in ai you can go to lastweekinai.com for all the links also last week in.ai for the substack with our text newsletter as always please share subscribe review and all that but more than anything do keep tuning in We'll be right back. Come and take a ride. Get the lowdown on tech and let it slide. Last week in AI. Come and take a ride. From the labs to the streets. AI's reaching high. New tech emerging. Watching surgeons fly. From the labs to the streets. AI's reaching high. Algorithms shaping up the future seas. Tune in, tune in. Get the latest with peace. Last week in AI. Come and take a ride. Hit the lowdown on tech and let it slide. Last week in AI, come and take a ride. I'm the last of the streets, AI's reaching high. From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.
Related Episodes

#228 - GPT 5.2, Scaling Agents, Weird Generalization
Last Week in AI
1h 26m

⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security
Latent Space

AI in 2025: From Agents to Factories - Ep. 282
The AI Podcast (NVIDIA)
29m

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning
Last Week in AI
1h 34m

#226 - Gemini 3, Claude Opus 4.5, Nano Banana Pro, LeJEPA
Last Week in AI
1h 11m

What AI Means for Students & Teachers: My Keynote from the Michigan Virtual AI Summit
The Cognitive Revolution
1h 4m
No comments yet
Be the first to comment