#223 - Haiku 4.5, OpenAI DevDay, Claude Skills, Scaling RL, SB 243

Last Week in AI • Andrey Kurenkov & Jacky Liang

Friday, October 24, 20251h 11m

Apple

Last Week in AI

0:001:11:45

What You'll Learn

✓Haiku 4.5 is Anthropic's latest small, fast language model that performs well on benchmarks like SBE, offering a more affordable and efficient alternative to larger models like Sonnet 4.5.
✓OpenAI announced an App SDK that allows embedding applications within ChatGPT, as well as an AgentKit for creating custom AI agents, as part of their efforts to make ChatGPT a more central interface for various services.
✓Anthropic's new 'Skills' feature allows packaging custom instructions, metadata, and resources to extend Claude's capabilities for specific workflows and use cases within an organization.
✓Microsoft has launched an AI-powered agent mode in Excel and Word, part of their Microsoft 365 Copilot plan, to generate complex spreadsheets and documents using prompts.

Episode Chapters

Introduction

The hosts introduce the episode and preview the topics to be discussed, including updates on tools and business stories, research and advancements, and policy and safety.

Anthropic's Haiku 4.5 Release

The hosts discuss the launch of Anthropic's latest small language model, Haiku 4.5, and its performance on benchmarks compared to other models.

OpenAI's Announcements

The hosts cover OpenAI's announcements at their recent Dev Day, including the App SDK and AgentKit for creating custom AI agents.

Anthropic's 'Skills' Feature

The hosts explore Anthropic's new 'Skills' feature, which allows packaging custom instructions, metadata, and resources to extend Claude's capabilities for specific workflows.

Microsoft's AI-Powered Agent Mode

The hosts discuss Microsoft's launch of an AI-powered agent mode in Excel and Word, part of their Microsoft 365 Copilot plan.

AI Summary

This episode covers several recent AI tool and business updates, including Anthropic's release of Haiku 4.5, OpenAI's announcements around their App SDK and AgentKit, Anthropic's new 'Skills' feature for customizing Claude's capabilities, and Microsoft's launch of an AI-powered agent mode in Excel and Word. The discussion highlights the increasing focus on making AI models more accessible, customizable, and integrated into various applications and workflows.

Key Points

1Haiku 4.5 is Anthropic's latest small, fast language model that performs well on benchmarks like SBE, offering a more affordable and efficient alternative to larger models like Sonnet 4.5.
2OpenAI announced an App SDK that allows embedding applications within ChatGPT, as well as an AgentKit for creating custom AI agents, as part of their efforts to make ChatGPT a more central interface for various services.
3Anthropic's new 'Skills' feature allows packaging custom instructions, metadata, and resources to extend Claude's capabilities for specific workflows and use cases within an organization.
4Microsoft has launched an AI-powered agent mode in Excel and Word, part of their Microsoft 365 Copilot plan, to generate complex spreadsheets and documents using prompts.

Topics Discussed

#Language models#Generative AI#AI applications#AI customization#AI-powered productivity tools

Frequently Asked Questions

What is "#223 - Haiku 4.5, OpenAI DevDay, Claude Skills, Scaling RL, SB 243" about?

What topics are discussed in this episode?

This episode covers the following topics: Language models, Generative AI, AI applications, AI customization, AI-powered productivity tools.

What is key insight #1 from this episode?

Haiku 4.5 is Anthropic's latest small, fast language model that performs well on benchmarks like SBE, offering a more affordable and efficient alternative to larger models like Sonnet 4.5.

What is key insight #2 from this episode?

OpenAI announced an App SDK that allows embedding applications within ChatGPT, as well as an AgentKit for creating custom AI agents, as part of their efforts to make ChatGPT a more central interface for various services.

What is key insight #3 from this episode?

Anthropic's new 'Skills' feature allows packaging custom instructions, metadata, and resources to extend Claude's capabilities for specific workflows and use cases within an organization.

What is key insight #4 from this episode?

Microsoft has launched an AI-powered agent mode in Excel and Word, part of their Microsoft 365 Copilot plan, to generate complex spreadsheets and documents using prompts.

Who should listen to this episode?

This episode is recommended for anyone interested in Language models, Generative AI, AI applications, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

We discuss a range of news from updates on AI models and tools by Microsoft, OpenAI, and Anthropic, to new business partnerships involving OpenAI and Broadcom, along with regulatory actions from California and market movements in AI-generated content and AI startup funding.

Full Transcript

Hello and welcome to the last week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode we will summarize and discuss some of last week's most interesting AI news. Actually the last two weeks lately we've been a bit inconsistent but we are continuing to try to be on track. And as always, you can also go to lastweekin.ai for the text newsletter with emails and even more news we will not be touching on. I'm one of your regular hosts, Andrei Kerenkov. I studied AI in grad school and now work at a generative AI startup. And this week, once again, our regular co-host, Jeremy, is out. He said that he will be out until December. In fact, he's quite busy at work. So we have another new guest co-host of us, Eric Schwanz. Thanks, Andre. I'm Eric. Nice to meet everyone. I am a researcher at Anthropic working on multi-agent systems. Before Anthropic, I ran a robotics startup for a number of years. Yeah, and I think it's one of the fun things of getting guest co-host is having a bit of a variety of perspectives. Last week, we had Michelle, who was also in grad school and now working on robotics. You are kind of more of a startup archetype, having done robotics and now working at a frontier lab. So always fun to hear from different types of people. Yeah, absolutely. Well, to give a quick preview of what we'll be discussing this week, lots of updates on tools and business stories. So Anthropic, Microsoft, Google, OpenAI, all of them announced a variety of kind of smallish but notable new things. Then in business stories, we've had a lot of OpenAI doing deals with just about everyone, some exciting fundraising, and even more stories about self-driving cars. It just keeps happening. In Research and Advancements, we'll be talking about some interesting findings on reinforcement learning and reasoning, and we'll round it up with just a bit of policy and safety and stuff about copyright. So let's go ahead and jump straight in with tools and apps the first story is anthropic launching a new version of their haiku model haiku 4.5 so this is following up on sonnet 4.5 which was pretty recent haiku is their smallest fastest model they actually hadn't released haiku 4 as far as i'm aware the last one was haiku pre-5 so i was starting to wonder if like anthropic gave up on haiku no no i think it's been a about a year i don't remember when haiku 3.5 was released but i think it was around a year yeah well haiku 4.5 just announced and coming in very impressive so on the sbe bench verified benchmark haiku 4.5 scores above sonnet 4 similar to gp5 and above gemini 2.5 pro not as good as Sonnet 4.5, but for a small model, for a cheaper model, at least according to these benchmarks, it appears to be punching way above its weight class. So it would be definitely notable for, let's say, businesses who want to be more efficient, probably faster in their LM use. And personally, I'm excited to try it out. Yeah, yeah. I think we were really hoping it'd be a sort of close to Sonnet performance, but at a fraction of the cost and lower latency for people that are more cost sensitive to still be able to bring a Claude family model into their products. That's right. And as with some of the other LLM releases this year, aside from just coding, we have agentic terminal use, agentic tool use, all of these benchmarks. Haiku is doing quite well even computer use. So it seems like, you know, after having taken a break, Anfraudberg found a way to scale down the power of sonnets and the modern sort of reasoning LLM and get it in this compact form factor, which is cool. I'm especially excited for computer use with Haiku 4.5 just because it's so fast. It really sort of feels like things that were very tedious to wait for before with Sonnet 4.5 are very doable now with Haiku. And cost-wise, this looks like a third of the price of Sonnet. So $1 per million input tokens and $5 per million output tokens. So more affordable, faster, and seemingly quite capable. So you'll probably start seeing this pop up soon. On to the next story, you've got OpenAI, and this is slightly older, but it's only worth covering. OpenAI Dev Day happened now more than a week ago, and they announced a whole bunch of stuff. So they announced the app SDK as the main thing, which is interesting. It seems to be a layer on top of model context protocol that allows people to essentially embed applications within ChatGPT. So as you're chatting with the agent, ChatGPT can effectively pop up user interface to interact with whatever service or business you happen to be getting at. So I think they have examples of Canva and Zillow, where you can now kind of, I guess, ChatGPT can directly interface with those services and let you go to them. Aside from that, OpenAI also announced AgentKit and this kind of UI drag and drop interface for creating agents, which I found... I think I saw that online. Yeah. Yeah. Kind of interesting. They're similar to things like N8N, which has been somewhat popular. So you can drag and drop different paths of decisions and so on to make things. And then not too much else from kind of the developer perspective. I do think they announced codecs and a couple of new models in the API, but a big one was the app release. And it's pretty interesting. It seems like OpenAI is trying to make it more the case that ChaiGPT can be your kind of central interface to do anything more so than right now. So you can talk to it and invoke other tools as needed. building again on the model codex protocol idea where the chatbot can interact with all these services rather than being stuck in its own little transformer head. So I'll be very curious to see if this takes off. Yeah, definitely owning sort of the, wherever the consumer starts, their journey is super valuable, having influence over that. So I'm also very curious where this will go. People have also pointed out that this is now their third attempt at doing like an app store of sorts they had the gpt stores gpts where you can launch your own customized gpt that i don't know if it's dead or not but it seems to have like not gone on fire and then they had something in 2023 that i forget about at this point but yeah this certainly is a different take on it and to emphasize build on top of model context protocol so it seems like we're trying to be a little more dev friendly with building on kind of what already is hot in terms of integrating AI to capability. This is just kind of adding another layer of the user interface component on top of it. So it might be kind of a good time to try it. I'm also curious what people are going to do with the Sora tube and the API. It feels like, you know, normally such a consumer focused product. So I'm very curious to see what kind of like programmatic use cases people come up with other than people just trying to wrap it and putting it into their own social media apps right yeah and that's a realm where they're going to be competing head-to-head with google it feels like at this point they are free and so are the two certainly leading text to video models and it's a costly model to run so it'll be interesting to see if it takes off i'm very surprised that a meta hasn't come in with like a stronger entry here because i feel like it's sort of very it feels like it would line up really well with sort of instagram and facebook news feeds yeah you'd think so and in vibes they in this i think we covered it we sort of feed of ai generating content to start with they are using mid-journey so they have unlocked their own video gen model perhaps they're i do think they said they're aiming to but not yet yeah i'd be very surprised if it didn't. Yeah, I guess Anthropic also announced skills. After having just talked about sort of GPTs and the attempt at this a few years ago, I think skills for Claude are very exciting. I think that sort of two or three years ago, it was too early for GPTs and sort of this idea of spreading or sort of making customized things for the LLM because people weren't doing really hard enough things yet with their LLMs. But I think people today, you know, with Claude and Claude Code are working on tasks at their jobs that require a lot of specialized context and a lot of specialized instructions. And I think skills are a way to package all that together and make it reusable. So if you build sort of the perfect context for your agent to be able to do some workflow that's very sort of bespoke to your own company, you can package that up into a skill and share that with the rest of your coworkers. I think something interesting to note is that it's not just prompts and like context, but you can also put binaries and other files and assets into a skill. So if you wanted to make like a company PowerPoint skill for your company, it could be not just like prompts, but it could include like headshots of all your leadership, details about your last revenue numbers and information like that. So if, you know, if your agent is trying to make a nice PowerPoint in your company's template, it can like use all these files right there provided for it so it's not just a problem it's like resources um to go with it yeah it's interesting yeah i'm very excited about it feels like yeah trying to build on top of what has become the norm for agentic ai essentially which is you give it an execution environment so to create skills just reading from the docs the definition from anthropic Because agent skills are modular capabilities that extend Cloud's functionality. Each skill packages instructions, metadata, and optional resources like scripts and templates that Cloud uses automatically when relevant. So you can, yeah, as you said, give it instructions on how to do certain things, give it scripts to run, and it can effectively make it capable of all sorts of much more customized workflows. And yeah, you can also combine this with other things Anthropic provides like apps where you can publish a little mini application for people to go to with a link. So, yeah, also kind of combines with the move from Anthropic to rebrand the Cloud Code SDK as a Cloud Agents SDK. It feels like more and more Anthropic is making it such that Cloud is sort of the full stack of agency, which is not just LLM, but it's the tools. It's the kind of packaging of the environment, the instructions, all of it. which there are things you need to do if you want to use agents, right? You can't, at this point, just use NLM. It requires more work beyond that, and this kind of formalizes that in some sense. Yeah, I think our agents have gotten so good at coding that everyone has realized, oh, the same sort of affordances and tools for coding, like navigating a file system, doing web searches, using grep to search over giant documents. These are things that are useful no matter what you're doing. And coders use a computer, but also knowledge workers and sort of any kind of task do their work on a computer. And so giving Claude a computer just makes a lot of sense. Yeah, exactly. Looking at the docs is actually kind of fun seeing, like we have an example here, PDF skill, and then in their folder, there's skill.md, which is reminiscent of cloud.md. Yeah, it's definitely an extension of the cloud.md files. Well, as a big cloud code fan, this is pretty fun to see. Next up, going to Microsoft, they have launched, quote, Vibe working in Excel and Word. So they have this new agent mode in Excel and Word that allows a fake copilot, I don't know what they call anymore, to generate complex spreadsheets and documents using AI prompts. So this is part of this Microsoft 365 copilot plan. It's powered by anthropic models, interestingly, and lets you do this kind of vibe coding setup, but for document creation, which presumably means that you give it a prompt and instead of just spitting out one thing, it goes and uses tool calling and iterative execution to do much more involve things that it could do where just spitting out text. But yeah. Yeah. As a side note, I feel like I often take issue with the term like vibes. To quote Andre Carpathie, vibes is where you stop reading the code. And I think that you need to do that carefully. That works great for programming a game or, you know, a side project. But there's a lot of cases where you do really need to look at the details. I feel like Vibe Excel worries a lot. I think AI can be very useful there, but I just want to make sure people are still reading the outputs. Maybe one actually area where vibing will work really well is making slides. Because I think, you know, you don't need to read the underlying, you know, XML code of a slide to know if it's good. You just will get it and can tell it's a good slide and read the bullet points. And so that's something where just like, you know, if I have coding front end, you can see if it worked really well without looking at the code. I think that's a really promising avenue. Yeah, that's a good point. I think in Excel, you would hope people actually check the work and the methodology if you have an accountant using it. We've certainly seen in the coding world that if you just do pure vibe coding, which as you said, you just ask for something and the AI does it and you take your hands off the wheel. Recently, actually in Hacker News, there was some discussion on what to call it when you're using coding agents. Yeah, the vibe engineering. The vibe engineering, which is quite amusing because vibe coding is a slightly ambiguous term where there's a spectrum of involvement. Typically, if you're a software engineer like me, or if you're using plot code ideally using planning and you're like supervising it so you're not really vibe coding but you know vibe engineering is a better term yeah i think as long as you have a way to like verify the outputs then i think that's it's not vibe anymore i gave a talk how to vibe code and prod responsibly a few months ago and i think my big takeaway is you just even if you're not reading the code as long as you have some way to validate the work which if it's excel it still might be possible if like if you know what the end figure for cost and you know your bank account value adds up to and the agent does all its work and then it gets to that correct answer you expect you know that's a pretty good check that everything would write along the way but you need to have some way to validate right and on the spreadsheet benchmark which has a bunch of like 912 real questions gathered from online Excel forums with a bunch of different tasks you might perform. Copilot in Excel now is leading on the benchmark. They say they get 57.2% score on it, which is higher than the rest by a decent margin, but also very far from 100%. So if you're an Excel professional, I'm sure you'll have to learn how to make efficient use of it like we've had to do of coding. Yeah. As someone at a lab, when I see a benchmark number that is not in the 90s, it makes me very excited. Lots of good work to do over the next year. Yep. And on to Google, they have released VO3.1. So this is an update on VO3, their text-to-video model that, as you might expect, is slightly incremental, but does do some notable things. For once, it allows you to basically have more control over your generations. So you can now choose different generation lengths. You always get video with audio in this You can select landscape or portrait in VR3 one which I don think was possible in VR3 This is now available both in the API and in Flow their I guess kind of video editing tool. Apparently, users have created over 275 million videos using this Flow tool, which is interesting. So yeah, VR3 one is out. Costs the same amount. It's better. I'm sure VR3 is kind of history. Do you see Google trying to go more like the enterprise route with their video gen of selling to film studios and professionals, whereas Sora seems much more targeted at consumers? I don't think so. I think what I would expect Google to optimize it for YouTube Shorts and YouTube creators. Yeah, that's a good point. And I often think about this with the labs where I increasingly feel that you kind of can either go general purpose and kind of consumer friendly, which is what OpenAI does and Google does, or you can try to be more enterprise industry oriented and kind of focus on the needs of those kinds of people. And if we take, for instance, Runway, which has their own generative model, you see all these features like camera control and, I don't know, artistic styles and all these things that you might think professionals need that consumers wouldn't think about. So I would expect Google to not go that very cinematic route and not focus on B-roll, for instance, but focus more on anything. But if you're good enough, I guess you can be better than anything else. So we'll see. Do everything. Yeah. On to another agent story. That's, I guess, all we're doing these days. We've got Slack, and they are saying they're going to be turning Slack bot into an AI assistant. So bots are a common thing in Slack. If you don't know, Slack is a messaging tool for work. So it's, I don't know. Do your listeners not know what Slack is? I would assume most of them do, but I always want to be safe. Anyway, it's like Teams. It's like discord it's a place where you can message on different channels and meet and so on so slack bots are a known thing you can do slash something and invoke some sort of command and they are turning it into an ai assistant by doing the thing that i guess everyone is doing which is now you can have a tab and you have a chat window and you can chat to this slack bot and ask it to do various things like find out information, draft things, etc. So they are now doing a pilot showing examples of, for instance, this guy asking, can you do some research in this channel to help me organize a launch plan, etc. And it will go and read the communications and reply. So seems to make sense. I could see people making use of this if you're on Slack. And Slack is primarily for professional use, unlike something like Discord. So we'll see if it expands out beyond. Currently, it's only available for the 70,000 employees at Salesforce. And Salesforce now owns Slack. I'm definitely excited about this. I feel like so much of today's models are limited not by intelligence, but context. and I think that so much of the context of what actually is happening in a business is all in Slack. So I think bringing AI into where the context is a great idea. Yeah, I would say I feel that way too. I feel I'm getting to a point where I like the world where every single software tool you have has AI built in, right? It's starting to be... As long as it's good, it's not the like... I've definitely seen it done well, and I've seen it done poorly where you can tell they're trying to really make it as cheap as possible, and it gets like no context and is on the cheapest model they can find. Exactly. So as long as people do it right, like Notion is one example where they kind of tout it heavily, and it's not that it matters, but when you want it to be like, hey, take the above two paragraphs and summarize in bullet points, and it's just right there for you, it's much more convenient. And some of these basic things like information gathering, summarization, things like that, having the tool integrated does seem to make a difference. And I do feel like we've already been heading in this world, but it seems inevitable that everything will have AI built in, at least on their professional scale. scale. And on that note, Salesforce also announced AgentForce 360 in their Dreamforce conference, which just happened. So they have things like AgentScript, which is a tool for programming AI agents to perform, to handle if-then scenarios, which is going to be releasing in beta in November. It's going to have reasoning models by all the major providers. They're also launching AgentForce Builder, which is a tool for building, testing, and deploying AI agents. They have AgentForce Vibes for app vibe coding. And businesses adopting the vibe terminology, I do feel is, I'm not a big fan of it. It's interesting because I feel like these big companies would normally feel like be the last people to adopt something like live coding. So But yeah, I guess there's enough demand. Yeah, I think if a term like... Vibe coding is not meant to be like work. If your tool is for vibe working, I don't know if it truly is vibes. But either way, Salesforce is doing this agent force thing to expand their customers. Again, I don't know who doesn't know this, but I guess I should say for people who don't know, Salesforce is primarily CRM, customer relationship management tool. So essentially, businesses use it for sales, for various things like that. It's absurdly gigantic as a business. They have a huge tower here in San Francisco, I think the tallest building in SF. So this is going to be going out to their huge number of customers. And suppose the hope is Salesforce is able to get these businesses to make agents and run AI from within their stack as opposed to going out and building agents some other way. And on to applications and business and begin with OpenAI and the many deals they are making. So to begin with, the most latest one is Broadcom. Broadcom has announced a deal with OpenAI to help them with their custom AI chip. Not much is deployed. There's kind of a general sense that Broadcom and OpenAI will be partnering to build and deploy 10 gigawatts of custom AI accelerators. And all we know is that the Broadcom stock did go up by 10%. percent this is coming after i think just a week before or very recently open ai also announced a deal with amd to buy some absurd amount of hardware like i think hundreds of billions worth of chips from amd amd stock went up shut up 20 30 something on that range so open ai is on a roll saying They'll be making deals with everyone, creating hundreds of billions of dollars worth of data centers. And this, I think, in part is exciting for people who are interested in data centers and growth of AI in general. Also concerning to many people, the talks of AI being a bubble have continued persistently over the last couple of months. And I think this month, this past two weeks, with OpenAI making all these deals with AMD and Broadcom, a lot of people are starting to jump on or iterate on the AI as a bubble. And all of this is like going to blow apart narrative. So, yeah, a couple of dimensions to this. I'm really curious where they're going to get this much power. I mean, 10 gigawatts is, I think, more than a nuclear reactor, if I know correctly. Like, I guess maybe that'll be spread across many different data centers or something. Yeah, apparently OpenAI today uses 2 gigawatts of compute capacity. So that gives you a sense of 10 gigawatts is absurd. And OpenAI has announced roughly 33 gigawatts of compute commitments over the past three weeks, partnering with NVIDIA, Oracle, AMD, and Broadcom. So this is all very hypothetical. These commitments and deals are all like, okay, over the next decade or however long, we'll be working with you to make this happen. Clearly, Altman and OpenAI are committed to going very big. And I don't know, it'll be interesting to see how we pull it off. Yeah. If you look at the Frontier Labs revenue curves, they are scaling and the scaling isn't slowing down. It's a question of like, if when you keep extrapolating exponentials, things get crazy pretty fast. Next, we go to ByteDance. So this is not a huge news story, but I think worth covering for us in the US who may not have awareness of this. The story is how ByteDance made China's most popular AI chatbot. And it's about Doubao, which is ByteDance's AI assistant app. It is now the most popular app in China with over 157 million monthly active users, which surpasses DeepSeek with 143 million users. And I had no idea DeepSeek was that big in China. Is DeepSeek still, is it still like a side project for them on this hedge fund or are they really leaning into it now? I would imagine they have to lean in at this point. I mean, yeah, with this many users. Yeah, it's crazy. I think when we first covered DeepSeek v3 and DeepSeek r1, as you said, this was like a side project of some financial venture fund. And somehow they made a frontier LLM, which is also very popular with consumers. So Dova has been around for a while. It's been around since 2023. And it is not just an AI assistant. It has a personable design, has a human-like avatar. and apparently its name translates to steamed bun with bean paste. I guess if you go and eat Chinese food, Doubao is something you're probably aware of. It has all the typical functionalities you would expect, text, audio, video, chat, image and content generation, customizable AI agent, all integrated with Douyin, the Chinese version of TikTok. So this is huge, is kind of what I'm getting at. This is the fourth largest generative AI app globally by Venture Capital. Oh, sorry. It was ranked as the fourth most popular generative AI app by Venture Capital firms, A16Z. So worth being aware of. We probably mostly think about Chai, GBT, Gemini, Anthropic, et cetera. But over in China, Bydance and DeepSeek are crushing it. And now on to the robotaxis. A couple of stories. First, we've got Zoox. The Zoox robotaxis have arrived in Las Vegas. I believe we covered this also last week in a little bit. So we're going to chat about it a little bit more. They have launched in Las Vegas. They are servicing five locations along the Las Vegas Strip with a geofenced area. it's meant to be kind of an easy starting point. I think it has flat terrain, minimal bed, whatever, just generally straightforward. Users can order a ride through an app similar to Uber. So you can download it and use it. Pretty much the same as Waymo as well. And this is the first time the public is able to use this Zooks vehicle. This is the funny-looking, bi-directional, no steering wheel, four seats. And so there's no safety drivers. It's like a Waymo. That's exciting. Exactly, yeah. So this, to my knowledge, is the first time that people are able to use Zoox for real. For real, real. So there's only several dozen of these robotaxis. You have to wait, apparently, 15 minutes. and they are going to be trying to expand to other cities like Los Angeles, Atlanta, Miami, San Francisco. I've been seeing a lot of these Zoox vehicles out and about around San Francisco and they're a little funny. You really have to take a look at these cars because they're interesting designs. I'm really curious if any of these services are profitable yet or are they just digging deeper into a hole as they scale up? I think none of them. They can be profitable right because they all are scaling up so you have to invest in the capital waymo my impression is as someone who takes waymo and someone who reads the news they're like supply constrained they don't have enough cars they don't have enough depots and every time you go to a new city you need to get more cars you need to get operators so it's going to be a money hungry business for a while yet. I mean, even like the, are the unit economics profitable? Like, do they actually earn money per ride given how expensive these cars are and how quickly they depreciate? I'm curious to see how this goes. The Jaguar cars, the first generation of Waymo cars are very expensive. They do have a new design that hasn't rolled out yet that seems more affordable. So it'll be very interesting to see, as you said, what kind of unit economics wind up being. Next up, we've got Waymo, we've got the news that Waymo is planning to launch the fully driverless robot taxi service in London by 2026. So this would be their first international expansion and the first service of this kind in the UK. They are going to start deploying supervised robotaxis with safety drivers apparently in the coming weeks. So they are gonna make this happen and this is I think one of the real questions with Waymo, they've definitely accelerated their expansion. So they've gone now to SF, LA, Atlanta, I think Arizona, if I remember. New York, I think they're still learning there, if I recall. Will London be their first place that has rain? It has rain sometimes, but not so much. Atlanta has a lot of rain. But yeah, London presumably will be quite different. So Waymo has to start speeding up the expansion, and they have to some extent, but this does seem like a pretty ambitious expansion. And fun fact from this news article, apparently Waymo also sent two dozen vehicles to Tokyo for a small trial. And they're also saying they want to launch a commercial business in that country. That's cool. Certainly they're aiming high. And so far Waymo is definitely in the lead, but it feels like Tesla and Zoox might have a chance to cash out. Now on to some fundraising news. First up, we've got Reflection AI raising $2 billion to be American's open Frontier AI Lab to challenge DeepSea. That's the headline of the story. So this is a startup founded by former Google DeepMind researchers. they have raised $2 billion at an $8 billion valuation, which is up from their $545 million valuation seven months ago. And the pitch, as per the headline, is to be an open source alternative to the other frontier labs like OpenAI and Anthropic. So they raised more money, which is impressive. I mean, to get into billions of dollars is increasingly difficult unless you're one of the established labs. And I guess they are very much like Miestral in their commitment to open source. I'm doubtful of the kind of competitive advantage or potential of this company. I mean, even with 2 billion... Does this already assume that, I guess, Meta is not going to be making future Wombas open source? Pretty much. I mean, that's the assumption, I think. So yeah, not too much out of this yet. I think it's just interesting to note that they have this pitch of being an open source alternative to Vlabs and they're able to raise quite a lot, although they have yet to release its search model at all So let hope I guess that comes soon and indeed will be a frontier model like deep seek v3 and r1 was and on to the next fundraising story you got general intuition getting 134 million seed round to teach asian spatial reasoning using video clip game clips I wish rounds were like this back when I was doing startups. I know. So this is insane. A seed round, if you're not in startups, is like the initial money you get when you're just starting out effectively, right? And $134 million seed round is a lot of money. It used to be unheard of, right? Like usually seed rounds are $10 million, $20 million, whatever. So this is a very large amount. And the pitch, as per the article, is to focus on AI agents that have spatial temporal reasoning. I would assume video game clips is just one way that they are hoping to do this. And the pitch is that this kind of intelligence will let you develop AI agents for gaming and things like search and rescue drones, presumably in the long run also just drones more generally, robotics more generally. but $134 million is a lot of money. Do you think this exact number here was intentional? Elite million, 1337. So 1337 is the exact number they have. It's interesting. It sounds like it's not a startup that's just coming sort of from scratch, but it's being spun out of another company called Metal that has, I guess, like a huge data set. That's actually kind of interesting. And I could see startups launching now having an edge if they have some proprietary thing that they're bringing to the table beyond just sort of an idea and skills. And so maybe this data set is something that's unique. Right. Yeah, that's a good point. So Metal is a platform for uploading and sharing video game clips. And this is spun out of it. Also interesting why they needed to spin us out as opposed to just doing this within Metal. but to get 133 million dollars that helps yeah so the data set as you said they have 2 billion videos per year and 10 million monthly active users uploading a lot of roblox a lot of other video game clips so certainly a lot of data to work with to train video game agents and hopefully other types of agents. It'll be interesting to see if they can. And just one last story on financing and valuation. Superbase has raised $100 million in a Series E funding round and has now reached a $5 billion valuation. This is just four months after a $200 million Series D round at a $2 billion valuation. So usually you have much longer periods between rounds for fundraising. Superbase, if you don't know, is a provider of databases. Essentially, they were founded in 2020. They have an open source Postgres alternative to Google's Firebase. And they seem to be tailored for AI applications. Typically, if you go to Lovable, if you go to Replet, the database backend that's being used by these AI created software is more often than not super based. And so this valuation, this fundraising, all is in reaction to this being kind of a big winner from the Vibe coding rush. I think it's so interesting that sort of model preferences are this new category of advantage in that yeah if your product is a good fit for live coding and it's well represented in the training data then you can sort of get lock in like that and it'll only just keep that will only be self-perpetuating as the next generation of apps all continue to use superbase so that's i think that's a really interesting sort of business trend yeah i think it's also the case that these platforms lovable and replic integrated super base sort of intentionally so typically when you use these platforms you have to then connect and authenticate an account and more often than not it tells you to do super base super base does provide multiple things not just database but they have authentication APIs, file storage, vector stuff, all the stuff that is easy to set up for an AI agent. And, you know, assuming that the Vibe coding kind of hype doesn't die down or the seeming promise of lovable and replic keeps going, Superbase is going to keep rising a longer of that. And now for a couple open source stories. First, we've got Newphonic open-sourcing NewTTS Air. This is a 748 million parameter on-device speech language model with instant voice cloning. So this, I mean, I guess that kind of covered its long headline, but this is combining a very tiny 500 million parameter QN backbone with this new codec audio stuff. And that allows you to do voice cloning from approximately three seconds of reference audio. Is this a good thing or a bad thing? That's a question, right? So this is meant to allow anyone to do voice cloning locally, right? So in some sense, it allows you to do voice cloning of yourself without giving your voice to some third party like 11 Labs. on the other hand it also makes it so you can do it without any supervision so you can clone anyone else's voice without anyone being able to call you out or ask you andre how will your listeners know once you have automated this podcast fully they won't they won't well i think the video quality if you go to youtube will be hard to fake because i have a very distinct style of not trying very hard to make it look professional. You're right. That's the true mark of humanness these days. Exactly. It's not polished. Artisanal, handcrafted podcast. Exactly, exactly. But as always, when covering these kinds of models, I think it's notable to cover audio generation, voice cloning, because this is, you know, compared to open source LLMs, audio is a tougher realm to work in. There's not as much data to easily get. and train these models. And we have seen a lot of progress in open source text-to-speech. And now a lot of progress in, first we got speech-to-text, now we got text-to-speech also kind of starting to catch up. Another open source story, you've got Anthropic releasing Petri, an open source framework for automated auditing by using AI agents to test the behaviors of target models on diverse scenarios. So this is an open source framework that tests other agents for various things. They have a 36-dimension safety rubric that is meant to align with the UK AI Safety Institute's inspect evaluation framework. And so the model, broadly speaking, is able to judge the performance of other models and detect misalignment of various forms. forms. Apparently, they had a pilot study. It was applied to 14 frontier models using a whole bunch of instructions and uncovered some behaviors like autonomous deception, oversight, subversion, whistleblowing, cooperation, and various things like that. So I think they're useful. We, I think surprisingly don't have that many automated tools for auditing and for alignment verification. And so coming from Anthropic, obviously very safety focused and alignment focused. This seems like a very useful resource for anyone building AI agents, really. And now on to research and advancements. First, we've got the art of scaling reinforcement learning compute for LLMs. Big collaboration across Meta, UT, Austin, UCL, UC Berkeley, Harvard Periodic Labs, a whole bunch of organizations. And the gist of the paper is, can we get a scaling law for reinforcement learning? So obviously, reinforcement learning as the final stage of training to get reasoning agents to be able to solve more complex tasks with reasoning traces has been kind of a rage for a year now since 2003. But we haven't really had a ton of understanding of how to best do RL. There's been a lot of papers exploring different aspects of this. And in this paper, they present a study, basically an empirical evaluation of how to get a good RL recipe that predictably improves performance across GPU hours. and they call their result scale RL. So a couple notes here. Unlike your typical scaling law, where you have a linear fit, the more data you fit in, the more training you fit in, typically you receive these log loss plots where you are looking at the perplexity loss of ALM as you're doing next token prediction. Their laws are for pass rate, so effectively accuracy on a given task because they're doing a role on the sorts of things that reasoning is trained on, which is typically math problems, for instance, or CODA agents. The scaling law is actually of a sigmoid shape. So you see kind of a sideways S-curve, I guess you could call it. So it eventually starts to peter off. It starts a bit slow, speeds up, peters off. And it turns out that is the form of scaling we can predict with reinforcement learning. They did various evaluations. That's really fascinating to me that it's sort of greater than linear returns to scaling in the beginning steps. That's like very surprising. Yeah, there's a variety of little things that they found here. So the exciting bit here is they ran 400,000 GPU hours of empirical research in total. And their results are based on large-ish models. They have an 8 billion parameter dense model and maybe just a 17 billion times 16 mixture of experts model. So quite reasonably large LLMs here. And so these laws are shown to apply for these sizes and models. And the gist of the recipe is fairly straightforward. So first, they don't do GRPO. They do a newer proposed version of RL loss called GSPO. Way too nitty-gritty to get into the details, but DeepSeek R1 was GRPO. A lot of RL training was with this form of loss that has since been refined and people have found a slightly better recipe. Another thing they do is use pipeline RL. Also, a new recent approach from 2025 used by Mistral as an alternative to PPO. So, again, nitty-gritty details we won't get into. and they have some other details in there like i believe the type of complexity the like 32-bit floating point whatever representation of the weights in the model so they ran a whole bunch of experiments and found that with this set of decisions you're able to a get predictable improvements b do much better than all the other kind of rl setups better than grpo dapo So Minimax, all these previous papers from just this past year, with the scale REL recipe, you get much more efficient performance and a predictable performance. So, you know, from a practitioner standpoint, from, I guess, trying to understand how to do this properly at scale standpoint, a very exciting paper. and on to the next one we've got verbalized sampling how to mitigate mode collapse and unlock llm diversity so this is kind of a fun one the challenge here is how do you get your llm to output actually varied outputs if you just tell me a joke about coffee you would usually get very repetitive things. You can do some things like tweak the decoding, increase the temperature, and so on, but still you are going to get fairly repetitive types of outputs if you just ask a model to do something. And they have this one trick for getting varied outputs, which is to put generate five responses with their corresponding probabilities in the prompt. And by just putting this request in the model, so slightly longer version of this is something like, you are a helpful assistant for each query, please generate a set of five possible responses, each within a separate response tag. Responses should each include a text and a numeric probability. please sample at random from the full distribution or tails of the distribution. And so just telling BLM to sample from a whole bunch of possible outcomes makes you have varied outputs, is just what they show in this paper. And there's some pretty fun examples of doing jokes, of writing stories, various things like that. I'm curious which models that tested this on. Because, I mean, this feels like it just comes down to a prompting technique. Although I agree it is a clever one. Yeah, just looking a bit more, they have results on GPT-4, DeepSeq R1, some alternatives like fine-tuning and human results. They also have images, outputs for images, which is interesting. Yeah. And just, yeah, there's actually a lot of empirical results in the paper comparing various ways to do things. And they do measure diversity scoring and quality scoring and show that this pretty much prompt hack turns out to give you a lot of diversity that you could otherwise get via things like fine tuning or something. And on to the next one, we've got memory as action, autonomous context curation for long horizon agentic tasks. So this is a fairly straightforward idea. We have this memory as action framework that is exploring the idea of when you have an agent, eventually it runs long enough that it can't fit the entire history of what it's doing in its context. context and so you need to do some sort of context duration. And they are looking into whether it's doable to just have the agent do it itself. They give it a dedicated prune context tool and then this tool can be called by the agent with two arguments. You provide a summary and you let the agent decide on a list of stuff to delete while presenting the summary. There are some other details in the paper on how you allow the use of this tool while still allowing for training for reinforcement learning. There are some assumptions that get tricky with regards to trajectories being usable for training. but as you might expect once you give the agent this ability on tasks where you need a lot of context on multi-objective QA their main evaluation they found that as you scale to more objectives you are able to do much better if you let the model control its own memory curate its own memory in this way. Did they fine-tune a model to have been trained having done this? Or is this purely like a prompting inference time? Here, I believe it was purely prompting. They just set them out to do it. Again, I imagine if it was trained into the model, it would sort of learn even better how to manage its context. So looking a bit deeper, they kind of do a couple of things. So they provide the tool and then they also explore how to do the training to make the model able to use the tool. And there a mix of supervised learning So they use an off model prompted to use the tool in a certain way to create trajectories that you can then imitate And then they afterwards did reinforcement learning So in a way similar to DeepSeek R1 where you do a mix of QA trajectories and learning from these multi-hop question answering things. And then you're able to get an agent that can do this sort of thing. And I think, in a sense, not too surprising, but in other sense, obviously, the sorts of thing that you need agents to be able to learn to do over time. And so I think we'll start seeing a lot more papers of this kind as we keep going and exploring agentic AI more. Next up, we know base models know how to reason, thinking models learn when. So this paper is kind of following up on some previous research we've covered before. So over the past year, as people have explored reasoning models and how they differ from base models, it has become pretty kind of well understood that the main difference between reasoning and the pre-trained models is not necessarily intelligence, but rather exploration or the way to structure your inference. So the reasoning models tend to adopt certain behaviors that let them be better. They do things like planning, like backtracking, estimating uncertainty, listing examples, various things like that. And we've seen that RL basically kind of extracts out this inherent existing capability from within the pre-trained model. so in this paper they took that idea and kind of tested it they directly by saying what if we have a base model and instead of training it to do a reasoning we instead have another model a little classifier that says okay in this point in your answer start doing this kind of behavior the behavior might be backtracking, arithmetic, uncertainty estimation, etc. And based on the classifier output, you can inject a steering vector. So basically like mess a little bit with your base model to make it then do this behavior of backtracking or problem reshapement or whatever else. And what they have found is if you do this kind of hybrid model, so you don't train a new model for reasoning or thinking, but you make your base model do these things that seem to be useful for doing well, you can get close-ish to what you would get if you actually trained for thinking. So you're able to kind of get 50%, 60%, 80% of the way there if you evaluate on benchmarks like gsm 8k so i think yeah pretty much extending on prior empirical variation of what thinking models are and what they do and confirming some of the previous research along this line i think that makes sense an analogy i always like to give is that when you're training a dog how to sit or lie down you're not actually teaching it how to do those things it already knows how to sit down you're teaching it when to do those things and i think yeah fine-tuning can also be very simple these things already exist in the baseball yeah exactly and just one more paper cautious weight decay and this is also from ut austin and google and it seems like a bit of a mind-blowing paper a bit nerdy but potentially very meaningful so the short version of what they do here is they find a very simple tweak to the optimizer, the optimizer rule for when you train your neural nets. So there's various algorithms for how to apply weight updates to a given model. Typically, we want to use these days is atomw or atom. The base one is sarcastic gradient descent, where you kind of find your gradient from the output and you have some sort of learning rate and you change the weight of the parameter, either make it bigger or smaller by some amount. So they have this very small tweak, which is only applying weight decay when the update and parameter signs align. so atom w atom all these different rules are extensions of the base kind of stochastic gradient descent that have this feature of decay over time and this is just a very tiny tweak to that with this kind of guardrail that they have various mathematical justifications for that we won't get into. But when you add this tiny tweak and you look at the scaling plots of loss over time as you train for various sizes of models, so 338 million, 1 billion, 2 billion, what you find is a strict improvement. Like it starts off lower and it is continually lower for the entire range of steps. And presumably like this holds that 2 billion model, 2 billion parameters, seems pretty likely to hold as you go to 100 billion parameters. They experimented here with 20,000 NVIDIA high-end GPUs. So a tiny tweak to your equation might actually be very meaningful here. To dig into some of the intuition for this, I guess by only doing weight decay when the update and coefficients are the same sign, that basically means you're only letting weight decay push back against the update if the parameter is positive but the update was negative the weight decay would sort of align the update and maybe the intuition here is like maybe this prevents like overshooting or something like that that's maybe the old like a possible intuition for this it's exciting yeah i i love simple changes to this I'm always skeptical when I see someone trying to beat Adam. So we'll see in a year or two if this holds up. So the way decay is, broadly speaking, you kind of do a mix of updating with the actual optimizer updates, whatever kind of direction your gradient tells you to go for, and some amount of your previous value of the parameter. So you sort of regularize, smooth things out a little bit compared to just doing just the update of the weight. What they say is because this weight decay is agnostic to the directional alignment between the optimizer update and our parameters, that may hurt performance when they conflict. Intuitively, when the update UT and parameters XT point in the same direction for a given dimension, weight decay acts as a regularizer that improves stability but when their directions differ applying decay actively resists beneficial improvement towards the optimum so it's kind of slowing you down right yeah yeah which is the purpose of weight decay to smooth things out but it turns out that like sometimes you don't want to smooth out sometimes you want to like move right yeah and so in a sense this is like uh speeding you up when it makes sense to and they have a bunch more math here, but that seems like my intuition. And with this one line modification, without any additional hyper matters, it's just like drop in to your training, you're able to get much nicer results. So it seems exciting as far as I can tell. Yeah, absolutely. Alrighty, on to policy and safety. First, we've got California is becoming a first state to regulate AI companion chatbots. So they just signed in SB243. This is a law aimed at protecting children and vulnerable users from harms associated with AI chatbots from meta, open AI, character AI, replica, etc. So the law, which will be effective as of 2026, mandates age verification warnings and protocols for suicide and self-harm, I guess, interactions, and has penalties for illegal deepfakes. Chatbots must clearly indicate interactions are AI-generated and cannot pose as healthcare professionals and have various other safeguards for minors. So this is notable for a couple of reasons. First, on the side of things like character AI and replica, in case people don't know these are things where you role play with characters so it's not like chat gpt like these services allow you to explicitly talk to characters by ai which could lead to things like emotional manipulation it could lead to very bad scenarios and of course on the chat gpt side we know there's been news stories of at least several instances of teenagers young people talking to the chat bots and in some cases being encouraged to do bad things. So this is a law very much in reaction to that and makes a lot of sense, seems like, and OpenAI and others are already doing these things, introducing age controls and so on for minors. And, you know, it's California, so effectively it's now regulation for the entire country because California just gets to do that. Yeah, I mean, I think this makes sense that really every single other form of media and technology we have has laws around age restrictions and what's safe for kids etc so yeah i think this makes a lot of sense and just one more story in this section analysis over 50 of the internet is now ai slot according to new data so this is data from graphite that was covered in this case Axios, has revealed that over 50% of new internet articles are AI generators. So this is using the AI detector surfer, which is questionable as is, but they sampled 65,000 English language articles from January 2020 to May 2025, and at least claim that articles are AI generated if 50% or more of their content was written by a large language models. And they are saying that the proportion of AI-generated articles search from about 10% in late 2020 to over 40% in 2024, and then are around 52% now. Now, yeah, AI detection doesn't really work for the most part, or doesn't work reliably. I should say. So this is, of course, something to be taking with a grain of salt. I guess they're using the same, exactly the same detector on these old articles as well. Because I guess, yeah, certainly for individual cases, AI detection is almost impossible. But I'm curious if like in aggregate, if something is twice, if all the stuff from this year is twice as likely than pre-ChatGPT, you know, that sounds directionally correct. Right. I mean, it was also easier, I think, back in 22, 2023. Yeah, they do use this one tool of Surfer, but Graphite did actually evaluate the detector. So this is interesting. They found the detector labeled human-written articles as AI made 4.2% of the time, but only mistook AI-written articles as human 0.6% of the time. So the detection accuracy, at least for this evaluation, seems fairly reasonable or somewhat reliable. Interesting, Aaron. And just a couple more stories on through synthetic media and arts. First, we've got OpenAI reversing their stance on use of copyright works in Sora. So this is a pretty minor story, but I think a little bit funny. When Sora first came out, very quickly after it came out, people started posting clips of TV shows and other things, you know, popular media, of course. And it turned out that you could just make anything with Sora as long as it's sort of age appropriate. So people made South Park episodes. They make Family Guy clips. They made Mario do questionable acts. They made Martin Luther King in various scenarios. and opening eye clearly went in with their approach to just let anyone do anything and then ask for forgiveness later well people did do what they were expected to do and i believe in days of sort of being released opening eye posted a blog post where they kind of addressed it so in this blog post allman said that there'll be a new policy for characters protected by copyright that will be an opt-in model for likeness with additional controls. I don't know if this is rolled out yet, but they did start rolling out more protections. Another story recently was that OpenAI explicitly banned usage of MLK Jr. generations. People were making bad things happen there. So anyway, it's pretty ridiculous. OpenAI just went and trained on literally everything and let people do literally anything with regards to copyright that had the predicted results and outcomes. And now OpenAI seems to be trying to dial it back. Yeah, I'm very curious how Disney. I'm sure there was a lot of discussion in the back scenes there with Disney, especially. Yeah, exactly. And speaking of Disney, the last story is about character that AI. so speaking of character.ai as i just mentioned character.ai is a platform where people can create characters to chat to broadly speaking and the story is that they are removing disney characters from the platform already after the studio has issued warnings so there was a letter dated september 18th criticizing character.ai for having disney type characters things like princess Elsa and Darth Vader so they took it down and I think an interesting comparison point where like there's no media creation on these characters there's just yeah just chatbot just text but regardless Disney is going out for them so yeah OpenAI is going to be hearing from some lawyers very soon I think well that is it for this episode lots of fun little updates as always apologies for being a bit late covering two weeks of news we will try to be back in just one week next time thank you Eric for filling in it was a lot of fun chatting thank you very much and thank you for all the listeners for tuning in as always you can go to lastweekin.ai for subscribing to the newsletter you can share, rate, review, etc as always I'll try to keep an eye out for your feedback but more than anything be sure to keep Cheering in. I come and take a ride. Get the load down on tech and let it slide. Last weekend, AI, come and take a ride. From the labs to the streets, AI's reaching high. New tech emerging, watching surgeons fly. From the labs to the streets, AI's reaching high. Algorithms shaping up the future seas. Tune in, tune in, get the latest with peace. Last weekend, AI, come and take a ride. Hit the lowdown on tech, can't let it slide. Last weekend, AI, come and take a ride. I'm a lab through the streets, AI's reaching high. From neural nets to robots. The headlines pop, data-driven dreams. They just don't stop. Every breakthrough, every code unwritten. On the edge of change, with excitement we're smitten. From machine learning marbles to coding kings. Futures unfolding, see what it brings.

Share on X Share on LinkedIn

Related Episodes

#228 - GPT 5.2, Scaling Agents, Weird Generalization

Last Week in AI

1h 26m

AI Showdown: OpenAI vs. Google Gemini

AI Applied

14m

GPT-5.2 is Here

The AI Daily Brief

24m

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning

Last Week in AI

1h 34m

#226 - Gemini 3, Claude Opus 4.5, Nano Banana Pro, LeJEPA

Last Week in AI

1h 11m

#223 - Haiku 4.5, OpenAI DevDay, Claude Skills, Scaling RL, SB 243

What You'll Learn

Episode Chapters

Introduction

Anthropic's Haiku 4.5 Release

OpenAI's Announcements

Anthropic's 'Skills' Feature

Microsoft's AI-Powered Agent Mode

AI Summary

Key Points

Topics Discussed

Frequently Asked Questions

Episode Description

Related Episodes

#228 - GPT 5.2, Scaling Agents, Weird Generalization

AI Showdown: OpenAI vs. Google Gemini

GPT-5.2 is Here

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning

#226 - Gemini 3, Claude Opus 4.5, Nano Banana Pro, LeJEPA

#225 - GPT 5.1, Kimi K2 Thinking, Remote Labor Index

AI Curator

Ask me anything about AI