
#222 - Sora 2, Sonnet 4.5, Vibes, Thinking Machines
Last Week in AI • Andrey Kurenkov & Jacky Liang

#222 - Sora 2, Sonnet 4.5, Vibes, Thinking Machines
Last Week in AI
What You'll Learn
- ✓OpenAI's Sora 2 text-to-video model has improved video quality and new 'cameo' feature to insert user faces
- ✓Anthropic released Sonnet 4.5 and Cloud Code 2.0 updates, positioning them as best-in-class for coding, tool use, and long-range reasoning
- ✓Sora 2 videos show improved photorealistic quality and physics simulation, but potential copyright issues with generating clips of existing media
- ✓Sonnet 4.5 is seen as a bit more thoughtful and less likely to simply agree with users, with a 1 million context window
- ✓Continued progress in text-to-video and language models, but challenges around regulation and copyright remain
Episode Chapters
Introduction
The hosts discuss the recent irregularity in the podcast schedule and preview the topics to be covered in the episode.
OpenAI's Sora 2
The hosts discuss the release of OpenAI's Sora 2 text-to-video model, including its improved video quality, new 'cameo' feature, and potential copyright issues.
Anthropic's Sonnet 4.5 and Cloud Code 2.0
The hosts discuss Anthropic's updates to their language model and AI agent development platform, highlighting the improvements and positioning compared to competitors.
Continued Progress and Challenges
The hosts summarize the overall progress in text-to-video and language models, as well as the ongoing challenges around regulation and copyright.
AI Summary
This episode of the Last Week in AI podcast covers the latest developments in AI, including the release of OpenAI's Sora 2 text-to-video model, which has improved video quality and new features like 'cameos' that allow users to insert their own faces into generated videos. The hosts also discuss Anthropic's release of Sonnet 4.5 and Cloud Code 2.0, which are updates to their language model and AI agent development platform. The discussion highlights the continued progress in text-to-video and language models, as well as the challenges around copyright and regulation that these AI systems face.
Key Points
- 1OpenAI's Sora 2 text-to-video model has improved video quality and new 'cameo' feature to insert user faces
- 2Anthropic released Sonnet 4.5 and Cloud Code 2.0 updates, positioning them as best-in-class for coding, tool use, and long-range reasoning
- 3Sora 2 videos show improved photorealistic quality and physics simulation, but potential copyright issues with generating clips of existing media
- 4Sonnet 4.5 is seen as a bit more thoughtful and less likely to simply agree with users, with a 1 million context window
- 5Continued progress in text-to-video and language models, but challenges around regulation and copyright remain
Topics Discussed
Frequently Asked Questions
What is "#222 - Sora 2, Sonnet 4.5, Vibes, Thinking Machines" about?
This episode of the Last Week in AI podcast covers the latest developments in AI, including the release of OpenAI's Sora 2 text-to-video model, which has improved video quality and new features like 'cameos' that allow users to insert their own faces into generated videos. The hosts also discuss Anthropic's release of Sonnet 4.5 and Cloud Code 2.0, which are updates to their language model and AI agent development platform. The discussion highlights the continued progress in text-to-video and language models, as well as the challenges around copyright and regulation that these AI systems face.
What topics are discussed in this episode?
This episode covers the following topics: Text-to-video generation, Language models, AI agent development, AI regulation and copyright.
What is key insight #1 from this episode?
OpenAI's Sora 2 text-to-video model has improved video quality and new 'cameo' feature to insert user faces
What is key insight #2 from this episode?
Anthropic released Sonnet 4.5 and Cloud Code 2.0 updates, positioning them as best-in-class for coding, tool use, and long-range reasoning
What is key insight #3 from this episode?
Sora 2 videos show improved photorealistic quality and physics simulation, but potential copyright issues with generating clips of existing media
What is key insight #4 from this episode?
Sonnet 4.5 is seen as a bit more thoughtful and less likely to simply agree with users, with a 1 million context window
Who should listen to this episode?
This episode is recommended for anyone interested in Text-to-video generation, Language models, AI agent development, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Our 222st episode with a summary and discussion of last week's big AI news! Recorded on 10/03/2025 Hosted by Andrey Kurenkov and co-hosted by Jon Krohn Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai Read out our text newsletter and comment on the podcast at https://lastweekin.ai/ In this episode:OpenAI introduced several new features, including SOA 2 for text-to-video generation, Claude Sonnet 4.5 for coding and agentic tasks, and the pulse feature for personalized morning briefs.Meta launched a new AI video creation feature called Vibes in its Meta AI app and on meta.ai, facing mixed reactions from the public regarding AI-generated content.California's SB 53, the Transparency in Frontier AI Act, has become law, requiring large AI companies to disclose safety and security processes, while SB 942 mandates AI detection tools for user-generated content.AI regulations and industry dynamics, including battles over intellectual property, startup funding, and the integration of AI into everyday tools and services like Microsoft's AI agents for Word, Excel, and PowerPoint. In this episode:(00:00:10) Intro / Banter(00:03:08) News Preview(00:03:56) Response to listener comments Tools & Apps(00:04:51) ChatGPT parent company OpenAI announces Sora 2 with AI video app(00:11:35) Anthropic releases Claude Sonnet 4.5 in latest bid for AI agents and coding supremacy | The Verge(00:22:25) Meta launches 'Vibes,' a short-form video feed of AI slop | TechCrunch(00:26:42) OpenAI launches ChatGPT Pulse to proactively write you morning briefs | TechCrunch(00:33:44) OpenAI rolls out safety routing system, parental controls on ChatGPT | TechCrunch(00:35:53) The Latest Gemini 2.5 Flash-Lite Preview is Now the Fastest Proprietary Model (External Tests) and 50% Fewer Output Tokens - MarkTechPost(00:39:54) Microsoft just added AI agents to Word, Excel, and PowerPoint - how to use them | ZDNET Applications & Business(00:42:41) OpenAI takes on Google, Amazon with new agentic shopping system | TechCrunch(00:46:01) Exclusive: Mira Murati’s Stealth AI Lab Launches Its First Product | WIRED(00:49:54) OpenAI is the world's most valuable private company after private stock sale | TechCrunch(00:53:07) Elon Musk’s xAI accuses OpenAI of stealing trade secrets in new lawsuit | Technology | The Guardian(00:55:40) Former OpenAI and DeepMind researchers raise whopping $300M seed to automate science | TechCrunch Projects & Open Source(00:58:26) [2509.16941] SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Research & Advancements(01:01:28) [2509.17196] Evolution of Concepts in Language Model Pre-Training(01:05:36) [2509.19284] What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoTLighting round(01:09:37) [2507.02954] Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III(01:12:03) [2509.24552] Short window attention enables long-term memorization Policy & Safety(01:18:11) SB 53, the landmark AI transparency bill, is now law in California | The Verge(01:24:07) Elon Musk's xAI offers Grok to federal government for 42 cents | TechCrunch(01:25:23) Character.AI removes Disney characters from platform after studio issues warning(01:28:50) Spotify's Attempt to Fight AI Slop Falls on Its Face See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
Full Transcript
Hello and welcome to the last week in AI podcast or the last two weeks in AI podcast as has been happening lately. We can hear a shout about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news and you can also check out our Last Week in AI newsletter at lastweekin.ai for stuff we did not cover. It comes to you every week in your email. I am one of your regular hosts, Andrei Kerenkov. I studied AI in grad school and I now work at a generative AI startup. And once again, Jeremy is off on a very secret mission. He has told me he'll probably be more available in October. So I'm hoping we'll come back to the regular schedule soon but for now we have one of our regular guest co-host john crone hey what's up i would pronounce my name john crone but whatever i'm tired i'm sorry it is early on the west coast for you and it is a common mistake i'm actually kind of glad you make it so that i can try to correct it's like a virus out there it actually it's been a decade of with my first book deep Learning Illustrated, the publishing company Pearson behind it, they internally, everyone there calls me John Kron. And then through them, I got exposure to the O'Reilly ecosystem. And so everyone at O'Reilly started calling me John Kron. And the same thing with like then the conference circuit that both O'Reilly and Pearson are plugged into. So it's a common mistake, a virus that I'm trying to stomp on everywhere. I'm glad we can correct this error. And I really should know better at this point. You've been on the podcast, what, like five, six times at this point? At least, at least. I'm probably pushing 10. I could try to enumerate it later. The easy way to remember, Andre, is that it's just like the bowel disease, Crohn's disease. That's me. Easy, very easy. Well, John, I'll quickly mention you are, of course, the host of a Super Data Science podcast. I see you have the cap, which makes it very easy for the YouTube viewers to remember. has interviewed a ton of people in the AI and data science world and are quite plugged into AI, less on the academic front, more on the hands-on front. So I think you're always a great co-host. And this week, we'll be talking a lot about recent releases, a lot of business stuff. So it should be kind of a good fit. Perfect. And yeah, we do sometimes get some academic folks on the show as well. We had Andrew Ng on the show, although I guess he also, So the amount that he's an academic is decreasing all the time as well. Peter Abiel has been on the show. We've got Ethan Mollick coming up soon, well-known. Although that's actually the Wharton School. So it's like, that's not really... Good point. Good point, Andre. Academic businessman, you know, what's the distinction? It's hard to say. If you want that nine-figure salary, you've got to get a little bit of both. That's right. Well, to give a quick preview of what we'll be talking about, it's a pretty exciting week. We got to start with Sora 2 from OpenAI making just some crazy AI videos. Then Sonnet 4.5. That's something I've been especially excited about. Like five more updates to products to discuss. Then in applications and business, various other developments from OpenAI, from competitors to OpenAI, things like that. Research and advancements. We've got some new exciting benchmarks, which we always like to talk about, and some interesting kind of interoperability work, which I think will be interesting. And then policy and safety, some new law stuff in California, which is another topic we often touch on in fields. Before we dive in, really quickly, do want to respond to some listener feedback. We've got a new review on Apple Podcasts, best AI podcast dash still alive, question mark so yes still alive but it is a fair point that the recent trend the past month in particular it's been fairly ad hoc we have been missing a week or two weeks in some cases so as i've said in the last couple eps jeremy is off on a big project of some sort and has basically bowed out so that he wouldn't kind of miss the dates we usually have and i'm not super consistent in getting guest co-hosts on time. But as I keep saying, we will try to come back to the regular weekly or mostly weekly schedule that we have mostly managed to do for much of this podcast history, but not so much lately. With that out of the way, let's get into tools and apps. So first up, OpenAI Sora 2. Sora, of course, is the text video model from OpenAI. First released SORA 1 in early 2024. And at the time, it was like mind-blowing. The text-to-video capabilities were far beyond what anyone has done. Now, in the past year, OpenAI has been kind of quiet on the text-to-video front. We have seen VO3. We have seen models really making big strides. And so now we have OpenAI coming out with their new model. And it does not disappoint, I think. It really produces very good looking videos. Like the AI nest of some of these things are getting harder and harder to see. They have a new Sora app on iOS, which allows you to create it, to remix existing videos, to post them to this feed. And I think most notably, it has this feature of cameos where you can kind of scan your face or a friend's face and post or create a video starring that person. And there's been some fun examples posted by OpenAI of Sam Altman getting up to some shenanigans. And I think that really has been a killer feature that has made people more positive on this. And I think just generally, the videos I've seen don't feel or look as much like AI slop. There's been much less kind of trend towards like bizarre, outlandish, kind of animated looking things, you know, no animals on laptops or things on the moon as much as kind of, you know, cinematic things or other kinds of things that look a little bit more grounded. So very cool from OpenAI. It's an app still on iOS. It's still invite only. So unfortunately, I've not been able to try it firsthand, but I'm sure we're going to start to try and roll it out. Yeah, as you've been speaking, Andre, I've had video playing of Sora 2, specifically this two and a half minute long video showing Gabriel Pierce, who I guess works at OpenAI and seems to be involved in the launch with Sam Altman, whom I'm sure everyone knows that name. And so it's the two of them on this kind of adventure. And it is pretty damn good. The video quality is far better in terms of photorealistic video quality, far better than anything I've ever seen in Texas Video Generation. And some pretty impressive real-world physics contained in it. So, for example, there is a moment where they run past Sam Altman and this Gabriel Peters guy run past somebody playing billiards, and they do a shot of the billiard balls, and there's a little bit of kind of fuzziness and a little bit of kind of liquidy goopiness around the billiard balls. But over, you know, kind of 10 seconds, those billiard balls remain consistent. They respond to real-world graphics as somebody breaks the balls. With their pool cue? Non-physics, I'd say. Something happens that should not happen. I mean, when you start a game of pool, it should happen. Yeah, yeah. No, it looks good. It looks good. It's just funny that break the balls is the only thing that I could come up with for that. It's a good example of physics. The one other thing that you mentioned there is this kind of social media app, I guess, that they've launched in conjunction with this. That seems like a bit of a Hail Mary, but you never know. You never know. So I guess it's such a sprawling organization now that you try to get, you throw lots of different noodles at the wall and see what sticks. I think, you know, Sora 2, Sora in general, is a really viable product that we're going to see a lot of. I don't know if this app is going to take off so much. Right. And it seems like we're not trying to make it take off necessarily. I mean, it's invite only at this point. So how much can it grow as a social media app with that constraint? And presumably it's invite only because the GPUs would be out of fire if they let everyone use Sora. Right, right. I think that happened with the most recent Sora update they had. There are paid options for extra video generation due to these high computing costs. It takes a while to generate. I think I've seen like examples of 10 to 15 minutes. And one thing I forgot to mention, which is pretty noteworthy, it now also generates audio. So sound effects and speech, that's pretty good and comparable again to VO3, which is kind of a new generation. So text to video is continuing to be kind of on a roll throughout this year, seeing it become better and better by pretty big margins. One other interesting bit with Sora 2 is people have made some pretty impressive examples of generating, let's say, existing media with it in the sense of they have created South Park episode clips, which is very similar to the show. And clearly, like from the training data set, also I've seen clips of Family Guy. I've seen stuff from Cyberpunk 2077, a major video game. So the guess would be, of course, we don't know much about the technicals here, but seems like there's not been much restraint in the training side of caring about copyright, which is kind of interesting. We've been discussing many copyright lawsuits lately, and OpenAI seems to have just gone for it with everything and anything is what it looks like. Yeah, it seems like they have been all along. And, you know, companies that are really trailblazing in the past, like Spotify, have gotten away with it. You know, Spotify was using illegally file-shared files in kind of their original undertaking, their original launch of Spotify. And obviously that made a lot of people unhappy. But just like OpenAI, they seem to be, you know, both of those firms emerge as a juggernaut in the space and you find a way to make nice. Right. Yeah. Uber similarly did that Silicon Valley trick of just ignoring regulations. And at this point, we have 700 million active users, I think weekly active or something like that. Like, I don't think OpenAI is going to die to do copyright stuff, but there might be some comeuppance. We'll see. On to the next story, we've got Anthropic releasing Cloud Sonnet 4.5, at the same time also releasing Cloud Code 2.0, slightly less covered but also I think notable. So, Sonnet 4.5 is the update to Sonnet 4, which has been around for quite a while. I feel like, I don't know, since the beginning of this year or something. Most recently, we've had the update of Opus 4.1, which we've discussed. So this is a pretty major update for OnFopic. They are positioning it as best in class once again for coding, for tool use, for long-range reasoning. And there's also various tools that they have rolled out to help people create their own AI agents. So they have rebranded the Cloud Code SDK to the Cloud Agents SDK. and basically positioned it as it's not just cloud code. You can make AI agents powered by cloud for whatever you want. So people's kind of vibe check on this has seemed to be it's really good. There's quite a mix as often. Some people are saying it's the same. You can't tell the difference. Others people are saying this is like brilliant. Anecdotally, it seems to be a little bit better at not necessarily agreeing with you on everything when you're working. with being, let's say, a bit more thoughtful or mindful. Also better at long context reasoning with that 1 million context window. So very exciting for coders, for people who rely on a cloud for agentic stuff, maybe less exciting for the general public. Yeah, I would say that this is similar to the GPT-5 release, which took a lot of heat from the public in general, but I am really impressed by. And I think that the reason why, I don't know what people were expecting in terms of regular everyday tasks, what kind of magic could possibly happen on short task timeframes. but I guess people I'm going to use the GPT-3 to GPT-4 jump to like provide a little bit of context around what I think people are thinking with these big releases which is that GPT-3 and GPT-3.5 were able to handle tasks that would take humans up to about 10 seconds maybe tens of seconds. GPT-4 felt like a big leap because all of a sudden it could replace us reliably on tasks that would take humans minutes to do. Once you start getting past that, a lot of everyday tasks that you're just going to throw in to a conversational agent, they don't take a human more than a few minutes to do. And so it's kind of harder to kick the tires on these longer tasks, ones that would take a human hours to do. But that's why we have these kinds of benchmarks like the Sweet Bench Verified, Terminal Bench, Agentic Tool Use from R2 Bench. Those kinds of benchmarks give you some sense of on longer tasks that might take a human hours to do, how are these models performing? And both GPT-5 and Claude Sonnet 4.5 are big leaps on those longer timeframes. You're probably familiar with this chart, Andre, the MTUR chart of how long of a human task can now be handled with 50% accuracy by an LLM. And that shows, this curve shows that every seven months right now, the human task length that an AI model can handle doubles. So it's about two hours today. You can expect that in seven months, these models, state of the art LLMs will be able to handle a four hour human task and seven months after that, eight hours and seven months after that, 16 hours. the multiples get pretty crazy and really powerful in terms of, you know, thinking about in an organization or for you as an individual, what kind of range of tasks can now be handled reliably in a fully automated way by machines. It's that curve that we're on, that we're in the midst of is pretty mind-blowing and CloudSonic 4.5 plays a part in that. Exactly. Yeah, that's kind of a great thing to highlight. In the announcement, they basically We are focusing on it. Sounded 4.5, they say, is the best model in the world for agents, coding, and computer use. It's also our most accurate and detailed model for long-running tasks with enhanced domain knowledge in coding, finance, and cybersecurity. So it continues with Trend of Anthropic, very much focusing on enterprise needs, on professional needs, not so much competing with OpenAI on trying to get more consumers or more kind of broader use cases. It's not focusing on being a good kind of chat companion or being a therapist or being great at image understanding. It really is focusing on, more than anything, being agentic. And to cover a little bit of the benchmarks, it's going beyond Opus 4.1. So Opus was their big model, very expensive model. Sonnet 4.5 now beating it on most benchmarks. Costing as much as Sonnet 4, which is quite a bit less than Opus. And it's sort of on par or to some extent better than GPT-5 across most of these benchmarks dealing with computer use, tool use, etc. Way ahead of Gemini 2.5, I think, which is interesting. and yeah I think it's true that now it's harder and harder to feel the progress when you just chat these models it always it's almost like text to image you know at this point can you really tell the difference but when you need to use it for very kind of specific some cases nuanced in some cases kind of complicated or just involved things that's where you can tell the difference And I think people who use cloud code, who use agentic tools, have really learned the quirks of these models and these things that are specific to agentic tool use that LLMs are not necessarily good at out of the box. I don't know, John, do you use cloud code or any other agents in your work these days? Yeah. And I'd actually like to highlight some of the things you mentioned that there was a big cloud code release. and there's a couple things here that are big that happen simultaneously with this CloudSonic 4.5 release. So for example, with this latest version of CloudCode, you now have checkpoints for rolling back to previous code versions. So this is a common gripe of the VibeCoder is that you have this working application and then you go too far. You keep making changes. Let's say you want to make some UI changes and you just want to change your application from green to blue over a simple example. But somehow in making that small change to your UI, some underlying logic changes and all of a sudden your app doesn't work. And so with checkpoints now in Cloud Code, you can roll back to some previous working state and kind of vibe again from there, I suppose. There's also, with this release came a native VS Code extension. And a lot of people love VS Code out there. So that's probably a big win for folks. And then finally, I'd like to highlight in the past year, I've had a lot of focus on agentic AI, you know, have been doing trainings, have a YouTube video called agentic AI engineering that now has 100,000 views on YouTube and kind of gives you an introduction to the key libraries that you need to be building multi-agent teams. and we in that video focus a lot on the OpenAI Agents SDK. And so it's interesting here that Anthropic are now rolling out the Claude Agent SDK, which is clearly following the same kind of naming convention there. It is a different kind of SDK, but something that is cool about it that I like is that it has this specific feedback loop that it kind of nudges you in the direction of using with your agents, where step one, you gather context for whatever task the agent is going to be doing. Step two, you take action based on that context. And then step three, and I think this is a big part of why people like using Cloud so much, is the verification of the work. You get high accuracy results a lot with Cloud. And so that agent SDK loop that happens in the Cloud agent SDK, gather context, take action, verify work, and then back to gathering context allows the agents that you create with a Cloud Agent SDK to be able to continue for long periods on tasks with a high degree of accuracy. Yeah, that's a good comparison with the Agents SDK from OpenAI. That's more of a sort of framework with which you can create your own applications that are agentic versus Cloud Code SDK or now Cloud Agent SDK, which is kind of a way to use Anthropix agents and use them for your own things. So you no longer need to do it interactively via a GUI or a terminal. You run some code and that code runs Cloud Code. So it's actually quite different, but now they're trying to make it even more kind of flexible. And I think, yeah, it's not worthy that this is happening pretty soon after Codex by OpenAI came out. at the time of GPT-5 and Codex, many people were sort of saying that Anthropoc has had a major lead with cloud code. And it's kind of, if you're not in coding or if not in this world, it's hard to overstate the degree to which this is actually a big deal. Like leading in the space of agentic AI and leading in the space of these kinds of tools that people are happily paying $200 a month for or more is really kind of big cutting edge. And these are things that are having a huge impact. I know my entire company has now kind of converted to using these kinds of tools over the past few months, largely because of Cloud Code. So Codex was a big step for OpenAI to compete and take kind of the mindshare and the developer preference. Maybe now OpenAntropic is winning some people back, that got a little frustrated. All righty, a few more stories. Moving on to Meta and another kind of social media platform, I suppose, they announced and launched Vibes, which is actually a feature in the Meta AI app and on Meta.ai. And it is similar to the Sora app. It's meant to allow you to create little AI videos, share them and you can browse a feed of AI generated videos. In this case, powered by either Me Journey or Black Forest Labs. I'm not too sure about which one of these. Very different reception from Sora. Largely people made fun of this or criticized it or otherwise seemed to think that this was, you know, the term slop came up often. This is a slop machine. This is feeding you AI slop. I think in significant part because of the marketing around it being focused on these more obviously AI videos of, you know, outlandish things, things you've seen a lot of and kind of the idea of like scrolling and seeing nonstop AI content. you know you might kind of get into and it's interesting why people are getting more and more negative and the term slop or AI slop has really I think become a stream either way yeah not necessarily a strong launch from meta with this but who knows maybe people will like it I don't know yeah I think there's maybe places where AI can provide a lot of value in social media or in media generation in general, maybe allowing you to change lighting or make some small kind of changes. But when you fully generating the entire video from AI it become so easy today so cheap that it doesn feel like a valuable human experience for the most part I sure there are amazingly creative things that maybe someone can do that you're like, wow, that's actually a great use of this. But it's interesting how this now, because creating AI slop is so easy, and it's so abundant on the internet and even inside of enterprises, emails, decks, you're like, am I really reading something that's someone's opinion or am I getting some AI slop here? That when you can tell that somebody actually put effort in to writing something or creating something, that there's some thought behind this, some planning, and that this is probably actually a good idea, that is starting to become more valuable, but also interestingly harder to distinguish from the slop. Right. And I think it is kind of the definition of SLAP in particular for AI SLAP is, in my impression, typically these kind of low effort outputs of AI. You put in a prompt, you get an output. If you're doing AI power tools for video creation where you're editing and spending hours kind of compiling together a short film or using it in your workflow to help you edit or code, for instance, that people are perhaps less negative on or at least that wouldn't be categorized under slop. And I think the framing of this as a feed of nonstop AI videos where there's no much effort, it is prompt to video or they also have a remix option. And also that this is by Meta, which is already in the business of getting people sort of addicted to feeds of various kinds. That really was perhaps a major part of why this got the reaction that it did. If they, you know, introduce some sort of tooling to make it more personalized, to let people have some of themselves, some of the human context alongside the video or together, if they had, for instance, cameos, even that Sora did, might have generated a different reaction. But as is, at least a social media reaction is no one wants this. Maybe some people do want it, but that's not what I'm seeing. And certainly that's not what I'm feeling. On to another new product release, I suppose we have OpenAI prior to Sora 2 releasing ChatGPT Pulse, a slightly kind of more outdoor thing that has OpenAI expanding more into your daily life, more into becoming more of an assistant or something that you use on the regular. So this feature allows you to get personalized morning briefs for users while they sleep. Pulse provides five to ten briefs to help people start their day. And that includes things like some news summaries. It can create reports on specific topics. So news updates or personalized briefs based on user context. and they have these cards that you can go through. So you kind of browse the cards and then you can tap on them to get the full details and talk to ChatGPT optionally. So it's, yeah, interesting to think of. I think South Park at this point has made fun of people using ChatGPT for everything, you know, nonstop. I think people are starting to talk to ChatGPT a lot more. and this is building on chat gpt connectors so you can connect chat gpt to your calendar or email or other things like that so it's making it easier for people to really integrate chat gpt into their lives make it always there always really a personalized assistant for your daily life yeah the heavy use of chat gpt in south park this season has been something that i found particularly amusing. If people don't watch South Park, I think this is a season to check out if you're working in AI. On a note of this Pulse feature inside of ChatGPT, I get what they're trying to do here. Right now, I think people think of these conversational interfaces like ChatGPT as something to go to proactively when you have some problem. And of course, if you're designing a platform like that, what you want is to become an essential part of people's day that you feel like, oh, you have to check this, just like you have to check your social media feed. You know, you have to go into ChatGPT to get this update on your life. It makes a lot of sense. I think where there could be issues that OpenAI runs into here is that in order for this to be a really useful experience to me as an individual, I have to use those connectors that you mentioned, you know, to access my Google Drive, to access my Gmail, to access my calendar. And I personally don't have a level of comfort with OpenAI or maybe even any of these big players in the space because what they're going to do with my data isn't clear. Or sometimes when it is clear, it's clear that they're going to be using it for training models. And that makes me uncomfortable with my personal information on that kind of scale to just have a connector go into all of my gigabytes of my Google Drive, all my personal information. I can see how it would be useful, but I'm not sure I personally, or probably a lot of people out there, are comfortable with that level of trust. Yeah, and I think that might be part of why this announcement rollout has seemed to kind of go under the radar. Another reason is this rollout is exclusive to the $200 a month pro plan, so not many people are even capable of using it. This is really for power users. And another thing to note is it is kind of a new way for AI to be agentic, right? This is having the AI do stuff for you without you asking it, going off and being autonomous. So still competing on that front. And the last thing I'll say is one of the things that I think about often with regards to the business side of AI is the lack of a clear lock-in for people, right? The difference between using ChatGPT and Gemini 2.5 and Claude and DeepSeek doesn't feel huge. There are some differences in tone, some differences in, let's say, tendencies or character. but push comes to shove if a free plan goes away and another company is cheaper like Gemini, I think people will move over, right? I don't think there are people who are really fans of Chargbt so much as fans of the experience and the kind of usefulness that it provides. So in that case, what happens when there's not significant lock-in is a race to the bottom on pricing. so you will have very low margins on the subscription on the profit i suppose and that's pretty bad because the margins are already very hard this is very expensive to do the inference it's not like computing in general you need to actually pay for every thousand or million words to use a substantial amount so it's a major challenge i think for open ai to maintain their lead while trying not to continue burning through piles of money every single day. And things like this might help. It'll be interesting to see if that's the case. For sure. I think something you said there, it helps me realize that Gemini from Google might actually be well positioned in this space because a lot of people already trust Google with access to their drive, with access to their emails, their calendar. And so if Gemini just becomes kind of an add in there, low margin business, yes. But if you can do it as part of, you know, enterprise Google office accounts or something like that, Google might be able to maintain margin there. And that has been Google's strategy, kind of rolling out Gemini and Gmail and spreadsheets and Drive kind of everywhere, increasingly making it capable. So you can ask Gemini to, I don't know, explain your spreadsheet, whatever. So they definitely have an advantage in that sense of just lots of people are using these things and you can get them to use Gemini by just having it right there. And notably also in Chrome, which we covered just recently, they're adding just a little button up on the top right. Chrome is by far the most used browser for people. So that's a major win, right? You now no longer need to go to chatgpd.com. You can press this little button whenever you're browsing a web. So OpenAI definitely should be feeling a bit of worry from Google and maybe Microsoft. And these kinds of things might help Pulse. Yes, although interestingly, despite all these concerns that they might have, story after story, here we are talking about OpenAI. On to the next one. On to the next one. More on OpenAI. they are rolling out a safety routing system and parental controls on chat gpt so this will detect emotionally sensitive conversation and switch to gpt5 thinking which is equipped with safe completions for handling sensitive topics this is coming of course after some pretty bad stories about Chajputi, in some cases encouraging people to self-harm, in some cases aiding or kind of exacerbating people's mental health issues. And yeah, another sign of the extent to which Chajputi is becoming parts of people's lives, really kind of having a major impact on people's lives. This is essential. This is very important, given what you've seen with some of the impacts of use of chat gbt and you know nice to see this being rolled out i think fair to say perhaps that this is coming too late at this point yeah i don't have much else to add to this story yeah hopefully this can prevent you know these kinds of negative side effects of people getting too into their conversational conversational platforms in the future yeah and I guess last thing I'll say also not fully related but one of the concerns with these chatbots is the potential for people to feel emotionally bonded to them to really have them become an emotional crutch and it has been the trend for younger people to be less social to be less outgoing to just have fewer friends. And this is not going to help, right? So I would hope that part of this is indicative of OpenAI at least being more careful about that, about kind of getting people addicted, so to speak, to JGBT due to positive reinforcement and being kind of your friend in a way that might be harmful if you overdo it. And on to the next story, moving away from opening AI at long last and moving on to Google with some news that was kind of quiet, but also notable. So they have updated Gemini 2.5 Flash Lite. There's now a Gemini 2.5 Flash Lite preview. Also, Gemini 2.5 Flash got an update and they are much better. They got a major improvement on coding, for instance, and they are also faster. So Gemini 2.5 Flashlight is now the fastest proprietary model. According to independent tests, it gets 887 output tokens per second, which compared to things like Sonnet, kind of maxing out at 100 tokens per second is very fast. Flashlight is also very cheap, $0.1 per 1 million input tokens. Orders are magnitude cheaper than Gemini 2.5 Pro, for instance. So major update on this model from Google, no rebranding, no fancy kind of media cycle, just making it better and stronger, which I think is a little It was interesting that you look at the graphs, this is a major improvement and it's just Gemini 2.5 flash and flashlight still not much fanfare on this. Yeah, this is the future. This is one of those ongoing trends. It's like earlier in this episode, I mean, the opening story in this episode of Last Week in AI was about Sora 2. And, you know, I've been listening to this podcast to last week in AI long enough to remember not that long ago, maybe 18 months ago, 24 months ago, you and Jeremy talking about how text to image had been getting pretty good, pretty compelling. You know, they'd solve the finger problem, getting five fingers on hands, looking anatomically correct. But video at that point was really poor. And now, you know, two years later, roughly, we're at a spot exactly as you and Jeremy have been talking about would happen where video is good. It's photorealistic for the first time. And just like that megatrend, this is another one of these megatrends where, of course, the big frontier labs are going to be, on the one hand, pushing the absolute frontier of capability with typically larger, large language models. But at the same time, trying to get costs as low as possible, trying to get inference time as low as possible while retaining as much capability as possible, it's a big megatrend that will continue for decades to come. And it's critical in the context of the conversation that you and I were having a couple stories ago, Andre, around having with very small margins as cognitive machines become a utility with very small margins, having these small, fast models is critical to the Frontier Labs being able to be commercially successful. Exactly. And Gemini 2.5 Flash is interesting in the sense that I think among the Frontier Labs on the smaller model front, on things like Haiku from Anthropic, which is still at 3.5, Anthropic kind of gave up on this class of models. we have GPT-5 Nano from OpenAI which is decent but Gemini seems to be the best in this class of model and they also mentioned it getting better at tool use so I would have to imagine part of the motivation here is for things like browser use where you need a very fast model to be able to do the agentic work for you and not take a million years this would certainly happen that But yeah, Google, very competitive in pricing their models. So this is certainly adding to that. And in this class of model, I don't think you can do better from what I can recall. And on to the last story in this section. Lots going on, as I mentioned in a preview. And this one is on Microsoft. They are also doing AI agents. They are adding them to Word, Excel, and PowerPoint. So this is coming to Microsoft 365 subscribers, people using Word, Excel, and PowerPoint already. This is for now on web, but will be coming to desktop everywhere to use it. And it is, I suppose, what you might expect. The agents are capable of doing more than just chatbots, of interacting with your document and doing multi-step tasks. So for instance, in Excel, the AI can perform data analysis, create visualizations, summarize insights, stuff that Gemini, by the way, cannot do still, which I find really annoying. I wanted to edit my spreadsheet and it's not doing it. In Word, you can write content. In PowerPoint, it can help you generate your presentations, getting you slides of data, visuals, and other stuff. And so there you go. You know, you got your agents in your tools, not a surprising development, but I think certainly seems like what Microsoft should be doing. Right. A hundred percent. And I personally am not a Microsoft Office user, but I know based on how many invites I get to Teams calls that a lot of people out there are Microsoft users. and Microsoft, Windows machines, they're still the predominant consumer operating system. There's a lot of scope for Microsoft to be adding AI capabilities, including agentic capabilities, into their applications. This is an obvious and probably useful thing to a lot of people. Yeah, and I definitely see from my own experience when you have these applications, for instance, Notion is one example, Or yeah, Google Docs, Google Spreadsheets. If there is a chatbot integrated into it, I want to just use that. I'm not going to go and copy all my stuff over or export it and then go to ChatGPT and attach it and talk to it. If it's a capable thing that is already there for easy access and able to talk to the document and natively sort of attract of it, in my personal experience, that is what I'm going to use. I'm not going to go to a competitor or kind of take the leap to do extra work. So these kinds of things are going to get you into lock-in, perhaps, right? More so than going to online platforms. All right, on to applications and business, finally. And we are back to OpenAI, which lots of stuff from OpenAI this past week. They really have a lot of news. so on this front we've got a new agentic shopping system from them they have this instant checkout feature for chasgpd users in the u.s which is going to allow people to make purchases from etsy and shopify directly within conversations it's rolling out to all users and enabling you to buy stuff from over 1 million Shopify merchants with payment options like Apple Pay and Google Pay. So pretty interesting move. They are also open sourcing the agentic commerce protocol, which is powering instant checkout and allowing agents effectively to do these kinds of things. This is being done in collaboration with Stripe, the payment processor. So it seems like a good move to make money. It's either ads or people buying stuff. And letting people buy stuff through your interface is going to start generating a bit of revenue for those free users and for those, I guess, power users who are willing to pay. 100%. the theme maybe of this episode is that, you know, with margins going to zero or negative on creating frontier models themselves, which is the niche that OpenAI is a leader in, they've got to find ways, other ways to be monetizing. And this is an example of a way that you reliably can. And perplexity already has been treading down this path for a year. So, you know, folks like OpenAI look around and say, what else are similar companies doing? Where are they making money? Where can we maybe maintain some margin or get a lot of scale? And this makes a lot of sense. Agentex Shopping, as part of the ChatGPT Pro Flow, seems like an office place to be making some money. Yeah. Also, a little bit concerning, possibly, in the sense of, you know, what if you pay for ads and your chatbot that you're used to being a little bit unbiased or on your side now is going to prefer certain brands over others. In some sense, the potential formatization via sponsoring and ads is going to have some different dynamics compared to what you see. People can already pay for Google searches to put up your products in front, but I think people have sort of learned to spot that kind of stuff versus if you get that in your chatbot flow, it might be a slightly different experience. 100% maybe this is a pointless hope but hopefully firms like this are taking a long-term view and trying to build trust with us in order to be a platform that we want to be working with for decades to come as opposed to just trying to make you know the most money that they can this quarter or this year but we'll see we'll see but for now at least open air is saying that these are organic and unsponsored and they are charging merchants a small fee for completed purchases so they're not monetizing in that way for now on to the next story a competitor of open ai composed largely of former open ai employees we are talking about thinking machines lab and they are launching a thing. They're launching a product after quite a while from having been founded, I forget, but it feels like a year ago. Mia Morati, of course, is a former CTO of OpenAI who left the board, left, anyway, let's not get into the drama, but left to found this Thinking Machines Lab with a very large amount of backing, have not indicated too much what their plans on the product side is. It's been fairly ambiguous. We've seen some research from them. And now there is a product called Tinker, which is making it easy for researchers and developers to experiment with and fine-tune AI models. So what it's in particular doing is it allows you to fine-tune open source models. For now, this is Meta's Lama and Alibaba's Quen. No GPT-5 OSS, interestingly. And it lets you use supervised learning and reinforcement learning via a simple API. So an interesting thing to go with. It's not something that is unavailable. There's multiple players in this space, multiple tools for quite a while. But from at least the comments in this article, people seem to be saying that this has a good mix of abstraction and ease of use. while also making it tunable. And it could be an interesting area to go with. Open source models have been becoming better and more competitive with frontier models relative to where they were in 2024 and 2023. Now, open source models are able to do a lot, able to compete for everyday tasks. So I do wonder if this is a potential change in how people use the models, whether people start fine-tuning them more, and whether that's something that Thinking Machines Lab is betting on. Yeah this is a crowded space I not sure that there is a huge amount of demand for fine open source models but there certainly are a lot of startups out there that are tackling this problem It always possible that someone like Mira Marathi with her network and the amount of funding that she has for Thinking Machines Lab that she can make something out of this that other folks wouldn't be able to. You know, all you need is, you know, a few big enterprise clients and you're on your way with some solid ARR and she could make something of this. But yeah, crowded space, not a particularly distinguished product in my view. Yeah, I think the major competitive advantage is the pedigree of not just me and Maradi, but the work of John Schumann, an OpenAI co-founder. They have led the work on fine-tuning things within OpenAI, fine-tuning JGBT through reinforcement learning and are quoted in this article as saying there's a bunch of secret magic, but we give people full control over the training loop, the abstract array of distributed training details. So, yeah, I wonder if there is a potential to do much better at the task. It's on the one hand, you know, supervised learning, reinforcement learning. I was not aware of there being much secret magic, but perhaps there is. And that would mean that you can actually outcompete your competitors without having to pay billions of dollars to train a model. Well, now back to OpenAI, because that's like half of what we talk about. The next story is OpenAI becoming the world's most valuable private company after private stock sales. So this is bananas crazy. OpenAI has sold $6.6 billion in shares held by current and former employees. And that means that its valuation based on this is $500 billion. That's the highest for any privately held company. So that is part of a meteoric rise. Just in August, they raised $40 billion at a $300 billion valuation. And this is quite important in part for the competitive angle, right? You have many employees now at OpenAI. Much of their payment comes in the form of stock and equity. And they are like millionaires on paper. But until you are able to sell the stock, you don't have actual cash. 6.6 billion cells in shares is something. It used to be the case typically that in a startup, in a private company, you had to wait until the company went public or was acquired to cash out and really benefit from the equity. But now, as we've covered with Jeremy, the kind of distance between private and public has been becoming grayer in especially U.S. markets. There's been a lot of sales of private company stocks. There's been a bit more liquidity in a way that hasn't been the case for private companies, especially, it feels like, with AI companies. Yeah. I mean, I said that this was going to be a theme in the episode. and figuring out ways to make money from being a Frontier Lab. OpenAI have been doing it. They still are burning billions of dollars more per year than they're making. But you're starting to get glimpses of them being able to be a profitable business in the future. And yeah, correspondingly, their valuation skyrocketing. We'll see if they can keep it up. Obviously, great if you're somebody who has, if you've been an employee there for a long time, you have some stock, congrats to you being able to sell off some of that, get some cash. Right. and perhaps it makes people feel better about not being hired away to Meta for their crazy 100 million pay package or whatever it is. Also, just to mention, there was reporting that OpenAI had $4.3 billion in revenue in the first half of 2025, burning also billions of dollars in cash. I think still not profitable from my understanding, But this is billions of dollars on top of API use and subscribers. That's not even taking account things like ads and sales that are starting to integrate. This company is going to be on the Google meta front, I think, of just racking in absurd tens of billions, hundreds of billions, whatever it is, primarily through TragPT. And still on OpenAI, but more on the business drama side, we are once again going to be talking about Elon Musk and XAI going and doing some legal shenanigans against OpenAI. Now there's a new lawsuit. This time it's claiming that OpenAI is stealing trade secrets by hiring former XAI employees. So apparently OpenAI is supposedly targeting individuals with knowledge of XAI's technologies and business plans. XAI, of course, is Elon Musk's frontier AI play with Grok being their GPT competitor. Apparently, XAI believes that OpenAI is making former employees breach confidentiality agreements to gain unfair advantages. this is coming you know after quite a few lawsuits at this point from xai against open ai i don't know what to say on this more legal drama does open ai need to steal any sequence from xai i'm not sure that's very plausible but i suppose we might learn some fun things in court if this goes that far yeah i don't have anything to add on this story more legal drama doesn't doesn't impact me and there's nothing I can leverage here. I do look forward to the OpenAI movie that's apparently in production. You know, OpenAI has had such a poor business, such an interesting kind of dramatic history that if you compile it all into a book or a movie, I think it'll be pretty fun. I want to imagine Elon Musk arguing with Sam Altman in a movie like it's a social network or something. Yeah, that is interesting because it can really change depending on how they do this film and how well it does in box offices. You just mentioned the social network there. The social network film made a big difference in the way that Mark Zuckerberg and Facebook are perceived as an organization. And, you know, even years later, I wouldn't be surprised if things like the name change, the rebrand from Facebook to Meta. Yes, it does have to do with being a broader organization than just Facebook. But also, I think part of it is kind of just getting away from what people perceive as a toxic brand from that very successful film. So, yeah, it could make a big difference in the minds of the public as to what kind of organization OpenAI is. On to the last story for the section. We've got startups raising absurd amounts of seed money. So Periodic Labs has emerged from stealth mode with a $300 million seed round coming with prominent investors like Andresen Horowitz, NVIDIA, Jeff Bezos. This was founded by people formerly from Google Brain and DeepMind and from the VP of Research at OpenAI. Their goal is to automate scientific discovery by creating AI scientists and autonomous labs where robots conduct experiments and govern data. Apparently, their initial focus is on superconductors. I can see why investors would be happy. This is a hard space to compete in, very high-tech challenge with robotics and AI, and I guess scientific discovery all coming together. and space where I think there's probably a lot of room to make advancements, to make progress and potentially have discoveries in things like chemistry, physics, you know, material science, et cetera. I love this. This is the kind of application that I dream of as AI advances. You know, to have a slightly better social media experience or a slightly better chatbot experience every six months, those kinds of incremental gains are nice. But these kinds of companies that are changing the physical world by blending together, cutting edge AI with, as you say, robotics and making scientific discoveries, having new superconducting materials, these kinds of real hard applications in the physical world. I love it and I wish them all the best. Yeah, and I'm still excited about robotics. Chatbots and agents on the one hand are still making a lot of progress, still very much frontier AI. But the real challenge or space where there's a lot of potential to make really exponential gains still is certainly robotics. We've seen a lot of news on humanoid robotics, on quadrupeds. This is another area where you need to address these hard challenges of hardware, of being in the real world, dealing with physics, dealing with things that Chagibri-T, Gemini are not necessarily able to do right now just from training on the internet. On to projects on open source. We've got one story, SWE Bench Pro, a new take on the software engineering benchmark. We've had a software engineering bench coming out, I think, quite a while ago. It was initially the big benchmark for resolving GitHub issues, dealing with bugs and so on. People have found that it was, let's say, pretty not great. It had a lot of issues from just one repository, like half of it was Django. It had data in it that sort of included the answers in a way. So then we got SWBench3BenchVerified from OpenAI, which is now what you kind of look for in evals to really kind of get a better measure of the quality. but at the same time, still, let's say, limited as far as the benchmark goes, not necessarily kind of conveying the sorts of things that Cloud Code and Codex need to do with more long-horizon software engineering tasks where you need to go and explore data and solve problems that require you to jump around and figure things out as you go. SB Bench Verified is still focusing on these smaller bug fixes and tweaks. So, SBB BenchPro from Scale.ai is focusing more on more realistic, more useful difficulty levels and clarity or lack of contamination. They highlight contamination, resilient curation. This is built from commercial repositories sourced from purchased startup codebases and copyleft public repos. They are saying this references solutions that require 100 lines of code across 4.1 files on average. There's human-centered augmentation verification on the benchmarks. All models are not doing great at or lower than 20%, roughly, for GP5, for Cloud. So very useful benchmark. We have covered, I think, in the last episode, another similarly long context horizon, the benchmark that more realistically mimics actual software engineering, actually what you're doing. So this is helping that. We love benchmarks on this podcast, and this one looks like a pretty good one. Yeah, benchmarks are important. You know, something like SWE, BenchPro, it looks like it will allow kind of, you know, the next generation of capabilities. Early in this episode, I was talking about MTUR and this chart of every seven months, the length of a human task that an AI model can handle doubling. In order to keep up with that, we need bigger, more challenging benchmarks. And it seems like SWE Bench Pro fits the bill here. Well, on to some more papers that are not benchmarks in research and advancements. First, we've got a mechanistic interoperability work called Evolution of Concepts in Language Model Pre-Training. So we've covered the, let's say, leading paradigm in model interoperability, I think, quite a few times on the show. The gist is to understand what features are ReveModels, to kind of get at what concepts they're using or learning and how those play into producing outputs of different types. types, people have figured out a pretty strong technique that, in essence, you take some activations from your model somewhere in the middle or elsewhere, you compress it, and in the compressed representation, you're then able to find groups of activations, groups of neurons that together, when they have activations, tend to find specific kinds of things. So they can, for instance, activate when a thing is plural or activate when you're multiplying or activate when it's the golden gate bridge, you know, activating for different concepts. And what this paper is doing is kind of tracking that over time in a relatively interesting way. So what they need to do is do that basic idea, but train the activations, collected activations across time at multiple pre-training snapshots. And then you're able to sort of track a feature as it gets trained and as features come out. And they find some very interesting things. For one, they say that you sort of have two broad eras of training. You have a statistical learning phase where you're learning kind of the basic features of what tends to be common, specific tokens, specific patterns, statistical regularities. And then later on in training, you get into feature learning where you're learning more sophisticated things, sentence structures, I guess, metaphors or stuff like that. And this, I think, aligns with the have similar kind of understanding of grokking, for instance, where people have found that you do have this kind of inflection point at some point where you get to these higher level features from starting out with learning kind of basics. And they have various experiments here showing you can use this for steering. So with these features, one of the things that people do is say, well, if you clamp down on these neurons and don't allow them to activate, their model starts acting very different. Famously Anthropic did this with Claude by ramping up certain activations that had to do with the Golden Gate Bridge and then Claude started talking about Golden Gate Bridge in response to every single thing you fed it in a very humorous way. Well in this paper they also do that. They show that if you include only the top key activations for a given feature that lets you do well. If you exclude them, you do much worse to start kind of producing nonsense. So I think always exciting to see progress in interoperability and understanding what these models are doing also plays into safety, into transparency. And this also aids our understanding of training dynamics of how these models work. Yeah, cool paper. And you did an amazing job Summarizing it there, Andre, I have nothing else to add. That was perfect. Well, I'm glad to hear because let's say I am a little sleeper for fun. So we'll see how cogent my summaries remain. On to the next one titled, What Characterizes Effective Reasoning, Revisiting the Length, Review, and Structure of COT? So in this paper, they are looking at basically what is actually contributing to being effective at reasoning. Reasoning meaning these sort of things you do prior to giving your answer effectively. The things that in GP5 thinking you can pick like normal thinking, hard thinking, extreme thinking, and then the model goes off and spends a minute reasoning with lots of steps and so on before it gives you an answer. So here they investigate empirically what kinds of behaviors actually contribute to being effective. So that here is length. So how long do you reason for? How many tokens do you output? Review, how much you spend checking, verifying, and backtracking prior steps. And structure. Essentially, what is kind of the graph of things you do? Do you state the problem first? then outline it, then follow it step by step, things that are pretty essential for effective reasoning. And they have a concept here called review ratio, defined as the fraction of review tokens within chain of thought, and actually find that shorter reasoning traces and lower review ratios are associated with higher accuracy. So we kind of naive approach of letting your chain of thought get super long or letting it go on and on when it's already arrived at the answer and kind of going back on itself on the same thing. I think we covered at one point an interesting thing where models could find the solution, but then if you let them keep thinking, they go off track and then give you the wrong solution because they effectively overthink. So yeah, they look at these factors and demonstrate that these kind of patterns exist empirically in a whole bunch of models. And Claude and Grock and DeepSeek, you want to keep the chain of thought concise, focused, structured, and that leads to better accuracy. This is interesting and different from what I would have expected. I would have thought that having a higher review ratio, for example, spending more time ensuring that you're being accurate would lead to higher accuracy. And so I just I have this idea, this kind of this intuition, which it probably has nothing to do with the way that machines actually are doing this kind of chain of thought processing. But it seems like there's kind of an analog here to the kind of when you're anxious or, you know, people who tend to be more anxious, they tend to overthink, they tend to continue to masticate over the same problem over and over in their head and that can lead you if you do that enough you can create this whole fantasy world of oh i'm for sure going to fail at this thing or those people definitely don't like me even though there's no real supporting evidence in the real world so i don't know i'm just kind of making a fun analogy to human thought here yeah when they cover higher review ratios that goes from zero to one, right? So potentially as you like get into 0.8, 0.9, 1.0, right? You're doing nothing but reviewing your work. Where do you do the actual reasoning? So in that sense, I think your intuition there is you do want some review. So if you get to 0.4, 0.5, 0.6, 0.7, for instance, you do see improvements. So doing zero review is bad, but doing too much review is also bad. So I suppose the summary isn't that higher is worse. It's more that too much or too little is bad. Next up, we've got another sort of benchmarking, kind of empirical evaluation paper, Advanced Financial Reasoning at Scale, a Comprehensive Evaluation of Large Language Models on CFA Level 3. So CFA is a test of financial reasoning. I think, chartered financial analysts. So this is a professional model and these financial analysts have to go through these fairly complex tests, from what I understand, to become accredited. And these frontier models, the 04 Mini, Gemini 2.5 Pro, have achieved scores that pass what you need to pass to get to level three. They get to 79.1, 75.9, respectively, surpassing the 63% passing threshold. So they are, you know, in some sense capable of being CFAs. We've seen this before with the bar exam, I believe for lawyers as well, for various kind of measures of being able to conduct certain careers. The models pass on the test, at least, doesn't mean that they're able to do a job, right? This is like on paper, multiple choice, essays, whatever, things that LLMs are good at. And we've seen in practice, when you try to use them and build a job, things are more messy and you can't do it so easily. But clearly getting to a point where they can have an impact in the same way that they're having an impact in coding, for instance. Yeah, this is cool. It ties into this theme, again, that we've had over the course of this episode, this idea of the complexity of the human task, the length of the human task that an AI model can handle doubling. This is a great example of it, a solid benchmark where 12 months ago, this would have been unimaginable. And another 12 months from now, this might be rudimentary for a lot of AI models out there to be able to tackle. the only people I know who have their CFA level three are very intelligent people. So this is a cool benchmark. Your chatbots are smart. Your LLMs are getting PhD level smart, CFA free level smart. What are you going to say? This is the world you live in. One final paper, also sort of empirical analysis and understanding of model dynamics titled short window attention enables long-term memorization so this is related to a bit of a trend in research we've covered on and off dealing with alternatives to the transformer model architecture so in short transformers the key insight there is attention is all you need back from 2017 they took recurrent models that take the output of the model and feed it back to itself, said, wait, forget this loop. The loop is really annoying because it makes it hard to train and scale. Let's just do the step where you look at everything all at once and then you don't need recurrence. And that turns out to be amazingly good and Transformers now rule the world But in recent years we seen a renewed focus on recurrence with things like Mamba things like XLSTM various models that have found that you can train fairly powerful recurrent models, not necessarily competitive with transformers, not necessarily as, let's say, practical from a compute standpoint, but nevertheless, potentially promising for longer-term tasks, for tasks with required memory. And my personal belief is how are you going to live without recurrence, right? You need some recurrence. Nowadays, most of the models do recurrence in some sense via kind of notes or to-do tasks or weird stuff like that. But there's still some promise to these alternative model architectures. the best model architectures have been ones that are hybrids. So they combine linear RNNs or other recurrent models with sliding video at window attention. And so this paper explores what is that kind of what amount of attention all at once do you want versus recurrence, you know, how much of a given input do you see all at once at a time effectively. And perhaps somewhat counterintuitively, it's actually better to train with shorter windows. So to have memory, to be able to make use of your recurrence, you want to not have too big of train time window length and test time window length. Likewise, you want to actually keep them around the same, which I guess is intuitively correct. So they do say you can use a stochastic window size that will help you balance long contact performance with short context and reasoning abilities. There's a bit of a trade-off basically between these two things. And yeah, there's various kind of empirical findings that I think are interesting. The gist is maybe the hybrid architecture is still needed. Maybe it's the future once scaling becomes hard, once we need models and agents that work for days at a time that have long-term memory, that actually have memory in a literal sense. I'm still very curious to see if that's the case. And this is the kind of stuff that research does, looking to things that may or may not pay off, but have some promise. Another great explanation there, Andre. You're crushing it. An interesting thing here that this reminds me of is in the 90s, when convolutional neural networks started to have some utility and practical applications with architectures like Lynette from Yann LeCun, they in their convolutions. So the convolution kind of similar to this, it passes a window over the whole image. And the initial intuition when people were creating convolutional networks was that window should be kind of larger, like maybe a nine by nine pixel window or 16 by 16 pixel window, because they reasoned that, you know, in order to have some kind of feature of the image have some meaning, it would have to kind of be on that scale. But then empirically over time, as people used convolutional neural networks more and experimented with them more and more, they found that a three by three or a two by two window on the convolution was actually more effective, way more computationally efficient and able to identify features, even though those features are so small in the image. And so I don't know, this reminds me of that, something kind of analogous happening here with the attention mechanism. Right, exactly. And to sort of bring it to the real world, in case these technical terms or jargon don't connect, the attention window is basically your input, right? It's all the stuff you put into the chatbot LLM before it begins its output. So we are seeing now with things like Gemini, 1 million token context windows with context windows are getting crazy. And this is one of the very surprising things for me. In 2023, I was skeptical of LLM progress because of at the time we had 4,000 token context windows, 8,000 token context windows. It was not apparent that you could easily do memory with LLMs. But then it turned out that you can just feed it 50 books or whatever and they are able to handle it. So long-term memory kind of isn't that big a deal. But as we get into agentic things, as we get into agentic workloads that take days and days, or even if you want an AI employee to be human-like, you need long-term memory, short-term memory, all the stuff that we humans have. and it's still an unresolved problem on how you do that properly. On to policy and safety. Last section, we begin with policy coming out of California. SB 53, which we've covered a few times, is now law in California. This is the Transparency in Frontier AI Act. The successor to SB 1047, we've also covered quite a bit. A very significant milestone in regulation, especially when it comes to frontier AI. So it mandates that large AI companies disclose their safety and security processes, provide whistleblower protections, and share information with the public for transparency or face fines and various other kinds of things. Apparently, AI developers must publish a framework on their website detailing how they incorporate national and international standards into their AI practices and update any changes to safety protocols within 30 days. And this is the version of the bill that came after the previous one got vetoed. We covered how Anthropic actually endorsed this bill. So there's some industry backing for this being sort of good regulation, which was recruited of 1047 and the criticism of the EU AI Act. Some say that the regulation isn't well thought out. It's kind of like stupid, basically. So some people at least have the opinion that this is well done. It's a good way to do regulation. Others, you know, Meta, I think, and others are lobbying against it. The industry is not necessarily excited about this thing passing, but either way, it is now law and will come into effect at some point. We certainly need some kinds of laws here. I am not the expert like you are even, or like certainly Jeremy is on these topics. Certainly Jeremy is the one that's like, oh, we got to avoid the catastrophic X risk of AI destroying us all. So these big models, please have safety and enforce it. So I'm sure he's a fan of this happening. It incorporates recommendations from a 52-page report by researchers. So we've seen a lot of frameworks on safety. We've seen many sort of recommendations, suggested practices, but this demonstrates a trend where things are getting increasingly concrete, increasingly practical in AI safety. Anthropic in particular has their safety framework. OpenAI also publishes regularly on safety with things like biohazards, cybersecurity. And whether you're like Jeremy, a believer in existential risk, that AI will kill us all within a couple of years, or whether you're concerned about things like cybersecurity or misinformation or chemical warfare, for instance, I think either way, AI safety is not something to ignore at this point. And I think this kind of regulation, in my opinion is a good thing. I do want to quickly cover a related thing, actually a topic request from a listener that was very helpful. There's also a law called SB 942 in California, the California AI Transparency Act, similar name that has been passed and goes into effect in June. I don't believe covered this. And as the listener says, it seems to be flying under the radar, but it is requiring that the big companies make it possible to detect whether something is an AI model, from what I understand, is an AI output and is able to impose large penalties. So the penalty of each violation is $5,000 and each day is considered a separate violation and the definition of what a violation is unclear. So just looking from the summary, the bill requires providers to make available AI detection tools at no cost to a user that meets certain criteria, including that AI tool is publicly accessible, various requirements to that. Apparently, providers need to offer the user an option to include a manifest disclosure in image, video, or audio content or stuff like that. So yeah, basically some requirements that you can tell if AI is AI. And if you do some math, it can get a little bit ridiculous in terms of the penalties that get accrued. In a follow-up, the listener provided some analysis that if the big companies are C2PA compliant, they're probably going to be compliant with this manifest requirement that most of the larger companies are compliant with these kinds of requirements. but apparently for smaller players for startups this could be a real issue so interesting as an example here they're saying gamma creates an estimated 5 million images per day which if not compliant would mean if it's per image and per day it's 25 billion dollar penalty on the first day, etc. So this, I think, highlights the need for well-thought-out regulation, usable regulation, and it seems like this bill might have some issues, especially when it comes to ambiguity and actual enforceability. Moving on, we've got Elon Musk's XAI offers GROC to the federal government for 42 cents. 42 cents, that's a science fiction reference. Great. Well, we covered previously how I believe it was OpenAI and Anthropic both offering their services to the government for $1 in a kind of move to probably get into the government, I suppose. and now xai is securing a deal with the u.s general services administration gsa to provide this ai chatbot grok to federal agencies for 42 cents over a year and a half so that 42 cents is below the dollar per year from open anathropic and a reference to hitchhiker's guide to the galaxy yeah that's exactly it that's definitely where the 42 comes from well i guess musk is getting along well enough with the federal administration again that something like this is allowed to go through yeah seems like musk and trump might be pals again or at least have resolved their spat moving on character.ai is removing disney characters from the platform after studio issues warnings so So Character.ai is a massive platform, in case you're not aware, kind of the winner in the space of chatbots that are characters that you role play with and talk to. So literally characters that are chatbots and you can talk to probably millions of these characters on the platform. I don't know the exact number, but the platform itself has huge numbers of users and millions. very high retention for users. So unsurprisingly, many characters from popular media, including Disney. Well, Disney sent a letter saying our characters are very without our permission, and Character AI quickly responded by removing it. So an important development in the sense of you know, copyright and IP and so on is still so unclear with regards to what is legal and what's not. And for these kinds of companies, these are the kinds of things that happen. Like either you are open AI and you just go off and ignore any worries and train on presumably Disney content. I'm sure you can make Disney animated videos of Sora 2, or you're a player that is actually going to have to listen to, you know, organizations like Disney that can really do some legal costume. I think a key distinction here is that it isn't necessarily what Character.ai was using as their training data, like OpenAI, like we were talking about with OpenAI earlier in this episode. But this is something where, you know, to have a Princess Elsa character in Character.ai, it's shocking to me that they were able to get away with it for this long. I kind of would have assumed that Disney would have done this a long time ago. Yeah, it's certainly different in the sense of, A, it's much easier to remove, right? It's not a big deal for Character.ai to take down these characters. versus OpenAI, others are like, oh, we can't train on your data. Too bad because we are going to do that because the models need all the data they can get. I will say legally, this is a little interesting to me because on the one hand, these are characters owned by Disney. On the other hand, it's presumably user-generated, not by the company itself. and it's using kind of the idea of a character, right? It's like role play, it's fan fiction. And so in that sense, I think I don't know if there's a strong legal precedent to this being bad compared to something like, yeah, using the training data or outputting something that looks exactly like your characters, which you've seen with lawsuits against Mid Journey that we've covered in recent episodes. So a different kind of legal issue that is a little bit surprising to me. But either way, another interesting consideration in this space. And on to the last story, we've got Spotify trying to fight AI slop and apparently failing, according to this article, saying that Spotify's attempt to fight AI slop falls on its face. so i'm i don't use spotify so i'm not personally aware of this stuff but apparently spotify has been flooded by ai generated content which is affecting real artists and their revenue we have covered at some point a while ago some types of music like you know relaxing music or electronic music often being now AI generated by AI artists you know the songs completely being not real and racking up some money so there is some benefit for spammers to just upload a ton of music and try to get into playlists and we have been examples of official Spotify playlists including fake songs so apparently now they are trying to make it so you have to do ai disclosures in music credits they have an impersonation policy that will remove music replicating another artist's voice without permission but yeah according to this article the effectiveness of these new policies isn't going to do much and there are examples of ai generating music like the velvet sundown a recent notable band that has amassed millions of streams of AI-created songs, those are allowed to stay on a platform. So there's no kind of requirement to not use AI. This is a tricky one. It's going to get harder and harder to detect AI slop, as we've seen just in the same way that with Sora 2, going right back, this is our last story now of the day, going back to our first story of the day, as text to video becomes so compelling that you can not tell the difference. The same kind of thing is going to be happening in music if it hasn't already. Yeah, some people might actually like AI-generated music. Like you said, relaxing music, that kind of thing. To be able to go into a spa and have some kind of infinite stream of relaxing music where you're not going to be getting repetition, that could be a positive. But there's, yeah, obviously a lot of AI slop out there, as we've discussed in the episode as well. And yeah, I could see why you'd want to get rid of a lot of it. Yeah, especially here they note that you're able to exploit some trends, some aspects of music platforms to effectively post songs under someone's band name, like impersonate them and get their streams, which obviously is very bad. You're now taking away revenue from that artist. So Spotify is going to fight against that in particular, but is still going to allow AI songs. And I think, as you kind of mentioned it at the beginning with Sora, for many people, perhaps most people, knowing that something is done by AI, and especially knowing that it was just a prompt that generated a song and there was no real human intentionality, creativity behind it, effort, no effort behind it, for many people, takes away the appeal of art or media. Like, it doesn't matter what it looks like, what it sounds like. If you know that it's AI, now you don't like it or it doesn't resonate. And I feel that's the case for music for myself. I think that probably is true for many, if not most, users on Spotify. there is some nuance like electronic ambient chill music i don't mind too much if it's ai maybe but if it's vocals if it's other things it doesn't feel right so yeah it's a weird thing of like probably some ai generate content is cool and you can you should be able to use it in your creative process but what is slop what isn't slop should some flop be good like some memes are fun I don't mind some weird AI images, right? But it's a weird place to be in culturally and kind of artistically. Well, that is it for this episode. Getting back to our regular long format of going on for quite a while. Thank you, John, for guest hosting and making it so we can at least post once every two weeks, filling in for Jeremy. Yeah, my pleasure. And if you run out of last week in AIs to listen to for whatever reason, either because you've listened so avidly or they skip a week. Check out the Super Data Science podcast. You've got lots of interviews with top people and yeah, have a lot of fun on the show typically. Yeah, unfortunately, Last Week in AI, the back catalog is not too compelling. You're not going to listen to our hundreds of episodes, but Super Data Science, on the other hand, is interviews. So it's a golden resource. And how many episodes are you at now? it's got to be we're over 900 over now how are you getting up to like yeah so whatever whatever you're interested i'm sure you can find some compelling people including jeremy including me actually we both have yeah episodes on the show so you can check us out that's right we did a cool one in person in san francisco about a year ago with on java chatted all about the imminence of agi how crazy that is yeah i had a lot of fun with that one really enjoyed chatting with you it wasn't at all what i had planned for the interview but that rabbit hole that we went down, I think was awesome. And it's amazing how aligned we are on our views. If you want to check out the episode with Andre, it's episode 867. So, you know, you can type that into Google or Spotify or Apple podcasts or whatever, or you can go to superdatascience.com slash 867. And I shall endeavor to include a link in the episode description as well. Hopefully I won't forget. Let's see. Anyways, thank you for listening. Apologies again for not being true to our name or being consistently covering last week's AI news. Do subscribe and leave us comments and reviews. We always appreciate it. Share the podcast if you can, and do try to tune in next week or whenever we next have an episode. Hopefully, Jeremy will be back. Thank you. Take a ride. Our labs to the streets. AI's reaching high. New tech emergent. Watching surgeons fly. From the labs to the streets. AI's reaching high. Algorithms shaping up the future speeds. Tune in, tune in. Get the latest with these. Last weekend AI. Come and take a ride. Get the lowdown on tech. And let it slide. Last weekend AI. Come and take a ride. I'm a little as of the streets, AI's reaching high. From neural nets to robot, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding, see what it brings.
Related Episodes

#228 - GPT 5.2, Scaling Agents, Weird Generalization
Last Week in AI
1h 26m

AI Showdown: OpenAI vs. Google Gemini
AI Applied
14m

GPT-5.2 is Here
The AI Daily Brief
24m

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning
Last Week in AI
1h 34m

#226 - Gemini 3, Claude Opus 4.5, Nano Banana Pro, LeJEPA
Last Week in AI
1h 11m

#225 - GPT 5.1, Kimi K2 Thinking, Remote Labor Index
Last Week in AI
1h 18m
No comments yet
Be the first to comment