Back to Podcasts
Last Week in AI

#226 - Gemini 3, Claude Opus 4.5, Nano Banana Pro, LeJEPA

Last Week in AI • Andrey Kurenkov & Jacky Liang

Sunday, November 30, 20251h 11m
#226 - Gemini 3, Claude Opus 4.5, Nano Banana Pro, LeJEPA

#226 - Gemini 3, Claude Opus 4.5, Nano Banana Pro, LeJEPA

Last Week in AI

0:001:11:11

What You'll Learn

  • Gemini 3 by Google achieved record scores on tough benchmarks like 'Humanity's Last Exam'
  • Nano Banana Pro can perform advanced image editing tasks like generating infographics and maintaining consistency across multiple images
  • Opus 4.5 by Anthropic outperformed Gemini 3 on benchmarks like 'Humanity's Last Exam' and is more cost-effective
  • Anthropic highlighted Opus 4.5's improved alignment and released tools like Cloud for Chrome and Excel
  • There are concerns about the difficulty of identifying AI-generated content as these models become more advanced

Episode Chapters

1

Introduction

The hosts introduce the episode and the key topics they will be discussing.

2

Gemini 3 by Google

The hosts discuss the release of Gemini 3 by Google and its impressive performance on benchmarks.

3

Nano Banana Pro by Google

The hosts cover the release of Nano Banana Pro, Google's advanced image editing model, and its capabilities.

4

Opus 4.5 by Anthropic

The hosts discuss the release of Opus 4.5 by Anthropic, its performance on benchmarks, and its improved cost-effectiveness.

5

Implications and Concerns

The hosts touch on the implications of these powerful AI models, including concerns around the difficulty of identifying AI-generated content.

AI Summary

This episode of the Last Week in AI podcast covers the latest developments in large language models (LLMs) and generative AI. The hosts discuss the release of Gemini 3 by Google, which achieved impressive results on benchmarks, as well as Nano Banana Pro, Google's advanced image editing model. They also cover the release of Opus 4.5 by Anthropic, which outperformed Gemini 3 on some benchmarks and is more cost-effective. The episode touches on the implications of these powerful AI models, including concerns around the difficulty of identifying AI-generated content.

Key Points

  • 1Gemini 3 by Google achieved record scores on tough benchmarks like 'Humanity's Last Exam'
  • 2Nano Banana Pro can perform advanced image editing tasks like generating infographics and maintaining consistency across multiple images
  • 3Opus 4.5 by Anthropic outperformed Gemini 3 on benchmarks like 'Humanity's Last Exam' and is more cost-effective
  • 4Anthropic highlighted Opus 4.5's improved alignment and released tools like Cloud for Chrome and Excel
  • 5There are concerns about the difficulty of identifying AI-generated content as these models become more advanced

Topics Discussed

#Large language models#Generative AI#Image editing#AI benchmarks#AI alignment

Frequently Asked Questions

What is "#226 - Gemini 3, Claude Opus 4.5, Nano Banana Pro, LeJEPA" about?

This episode of the Last Week in AI podcast covers the latest developments in large language models (LLMs) and generative AI. The hosts discuss the release of Gemini 3 by Google, which achieved impressive results on benchmarks, as well as Nano Banana Pro, Google's advanced image editing model. They also cover the release of Opus 4.5 by Anthropic, which outperformed Gemini 3 on some benchmarks and is more cost-effective. The episode touches on the implications of these powerful AI models, including concerns around the difficulty of identifying AI-generated content.

What topics are discussed in this episode?

This episode covers the following topics: Large language models, Generative AI, Image editing, AI benchmarks, AI alignment.

What is key insight #1 from this episode?

Gemini 3 by Google achieved record scores on tough benchmarks like 'Humanity's Last Exam'

What is key insight #2 from this episode?

Nano Banana Pro can perform advanced image editing tasks like generating infographics and maintaining consistency across multiple images

What is key insight #3 from this episode?

Opus 4.5 by Anthropic outperformed Gemini 3 on benchmarks like 'Humanity's Last Exam' and is more cost-effective

What is key insight #4 from this episode?

Anthropic highlighted Opus 4.5's improved alignment and released tools like Cloud for Chrome and Excel

Who should listen to this episode?

This episode is recommended for anyone interested in Large language models, Generative AI, Image editing, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Our 226th episode with a summary and discussion of last week's big AI news! Recorded on 11/24/2025 Hosted by Andrey Kurenkov and co-hosted by Michelle Lee Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai Read out our text newsletter and comment on the podcast at https://lastweekin.ai/ In this episode: New AI model releases include Google's Gemini 3 Pro, Anthropic's Opus 4.5, and OpenAI's GPT-5.1, each showcasing significant advancements in AI capabilities and applications.Robotics innovations feature Sunday Robotics' new robot Memo and a $600M funding round for Visual Intelligence, highlighting growth and investment in the robotics sector.AI safety and policy updates include Europe's proposed changes to GDPR and AI Act regulations, and reports of AI-assisted cyber espionage by a Chinese state-sponsored group.AI-generated content and legal highlights involve settlements between Warner Music Group and AI music platform UDIO, reflecting evolving dynamics in the field of synthetic media. Timestamps:(00:00:10) Intro / Banter(00:01:32) News Preview(00:02:10) Response to listener commentsTools & Apps(00:02:34) Google launches Gemini 3 with new coding app and record benchmark scores | TechCrunch(00:05:49) Google launches Nano Banana Pro powered by Gemini 3(00:10:55) Anthropic releases Opus 4.5 with new Chrome and Excel integrations | TechCrunch(00:15:34) OpenAI releases GPT-5.1-Codex-Max to handle engineering tasks that span twenty-four hours(00:18:26) ChatGPT launches group chats globally | TechCrunch(00:20:33) Grok Claims Elon Musk Is More Athletic Than LeBron James — and the World’s Greatest LoverApplications & Business(00:24:03) What AI bubble? Nvidia's strong earnings signal there's more room to grow(00:26:26) Alphabet stock surges on Gemini 3 AI model optimism(00:28:09) Sunday Robotics emerges from stealth with launch of ‘Memo’ humanoid house chores robot(00:32:30) Robotics Startup Physical Intelligence Valued at $5.6 Billion in New Funding - Bloomberg(00:34:22) Waymo permitted areas expanded by California DMV - CBS Los Angeles - Waymo enters 3 more cities: Minneapolis, New Orleans, and Tampa | TechCrunchProjects & Open Source(00:37:00) Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept Segmentation in Images and Videos - MarkTechPost(00:40:18) [2511.16624] SAM 3D: 3Dfy Anything in Images(00:42:51) [2511.13998] LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software EngineeringResearch & Advancements(00:45:10) [2511.08544] LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics(00:50:08) [2511.13720] Back to Basics: Let Denoising Generative Models DenoisePolicy & Safety(00:52:08) Europe is scaling back its landmark privacy and AI laws | The Verge(00:54:13) From shortcuts to sabotage: natural emergent misalignment from reward hacking(00:58:24) [2511.15304] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models(01:01:43) Disrupting the first reported AI-orchestrated cyber espionage campaign(01:04:36) OpenAI Locks Down San Francisco Offices Following Alleged Threat From Activist | WIREDSynthetic Media & Art(01:07:02) Warner Music Group Settles AI Lawsuit With Udio See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Full Transcript

Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also go to lastweekin.ai for our newsletter with even more articles. I am one of your regular hosts, Andrei Karenkov. I studied AI in grad school. I now work at the startup Astrocade. And with me, once again, guest co-host, Michelle Lee. Hi, everyone. I am Michelle Lee. I did my PhD at Stanford with Andrei, and I am currently the founder and CEO of Medra. We are building physical AI scientists. Yeah. Back in the day, we both used to do robotics research. I went the easy route to do generative AI, producing video games, but you stuck with it and actually doing science. So it's good that you're fighting a good fight. Absolutely. I mean, I think right now, I mean, even looking at the news today, there are a lot of very exciting AI and robotics news. So, Andre, you should get into robotics. I really might. We'll see. And speaking of what we'll be discussing, it's going to be a pretty exciting episode, actually. we have had kind of a quiet stretch in this past week. Things popped off in a way we haven't in a little while. So, of course, we're going to be talking about Gemini Free, going to be talking about Not a Banana Pro, Opus 4.5. And then beyond that, we do have some exciting news on the robotics front with a new startup emerging out of stealth and a bunch of funding for one established one. quite a few interesting open source releases. We'll be touching on some research on interoperability and on alignment also that we'll discuss from Anthopic. So overall, quite a few interesting things in this one I would recommend you stick around. Before we dive in, I do want to acknowledge real quick, there's a new review on Apple Podcasts called Best AI Podcast Dash Still Alive? the question mark. So we did switch to a new hosting provider and I hope that it's still being posted to everyone's feed. If somehow there's some technical glitch that you're not overseeing it, please do let me know. And moving into the story, starting with tools and apps, we begin with Gemini 3. So somehow everyone was aware that Gemini 3 was going to be released last week. And so it was. Gemini 2.5 Pro, I think, came out early this year, around February. It was kind of mind-blowing at the time. That was really the first Gemini that people were like, wow, Google is starting to get back into the lead. Gemini 3, kind of as big a release. People were very excited and people are not disappointed with this release. It got really impressive results. on the toughest of benchmarks, including humanity's last exam, got a record score of 37.4. And this is the benchmark that was like developed to be the hardest thing that we could possibly do so that LLMs couldn't beat it. It's like half a year old. It used to be zero progress on it. Now it's actually increasingly being solved. So overall, a lot of just impressive results coming out of it. They also announced Google Antigravity, which is their new coding IDE, which I guess is to compete with Cursor. Coming a couple months after they sort of acquired Windsurf, but not really. Which some people on Twitter were a little amused by. But yeah, Gemini 3, super exciting. And there's also Gemini 3 DeepThink, which is more research focused and can go even deeper. Yeah, I mean, there were some news this past week also about leaked OpenAI memo from Sam Altman saying that OpenAI might be in trouble given how fast Google has been making progress in AI. So it's very exciting to see these really big releases coming out. Yeah, I saw some discussion also. It was pointed out that Gemini 3 was trained entirely with Google's TPUs, which isn't a new thing for these models. I believe Google has been pretty much using TPUs for training their frontier models for years and years. But it generates some conversation this time around. And it is notable because Alphabet is kind of alone and not depending on NVIDIA as much. Because they have their own chip design, they produce them. And they are able to train Gemini 3 and these incredible models with their hardware. By the way, also the rollout here was pretty impressive. It came out to like every single thing that Google has and no technical issues. It went live to presumably hundreds of millions of people. So we haven't seen a really big kind of LLM jump. GP5 was kind of disappointing and this was sort of it for the first time in at least a couple months. Definitely. Now, alongside that, possibly the bigger news of the week actually out of Google last week was Nano Banana Pro. So just a couple of days after Gemini 3 Pro, they released also this Pro version of Nano Banana. Nano Banana is their primarily image editing model. I think it also generated images that was already impressing people with how capable it was at sort of really fine, accurate editing where it can do things like rotating figures, maintaining the actual content of whatever it's dealing with. And now with Nano Banana Pro, which is apparently powered by Gemini 3, the stuff that people are doing is like mind-blowing. You can give it math problem, an image, a photo of a page of a math problem, and it will write out the solution. Sometimes correct, sometimes not, but usually mostly correct. And it will look accurate. So it's really going beyond the phase where it seems like it's just generating pixels. It's like much more this time. Yeah, I mean, as Andre is saying, Nano Banana was already incredibly impressive as a generative image model. I've definitely used it several times to create some really amazing images. And now with the Pro version, it can now also create infographics and slide decks, can also maintain consistency across multiple images. So I think there are going to be a lot of really exciting use cases. So just real quick to give some examples that Google themselves provided. One is they have a prompt here, create an infographic about this plant focusing on interesting information. They have an input image of some plant. And then an output of an infographic that has multiple subparts, a lot of text, a lot of text. And there's no weirdness in the text. There's no issues of the layout. There's no kind of artifacting that is visible, at least unless you really try to spot these kinds of issues. So just the degree of accurate text placements, the degree of sort of like correctness in the text and so on is unprecedented. And yeah, this is, I think, for many people and myself included, feels like an even more mind blowing thing than Gemini 3, which is really good. but not as new. Just that there are like very obvious use cases for this, right? Like, and it's impressive. It's something visual. It's really fun. And so I think we're just going to be seeing so many more things on social media and in our day-to-day work. Like, I feel like, you know, it's really awesome that LLMs can help me write code, help me write emails, but I can also write code and I can also write emails, but I cannot make a comic book. Unfortunately, I don't have the artistic skill to do that. But now with these new generative image tools, we can, which is really cool. Yeah, and to that point, I think one of the kind of cool things about this and why maybe people are excited about it is because it is for image editing or image kind of composition or manipulation, there's not as much examples of kind of typical AI images or some people just like the term but I think slop is still often kind of what you get with AI-generated images but because here you're editing you're kind of manipulating an image it's usually a little more intense and maybe that's Yeah, more exciting for people to be able to like actually create something that's not just text to image. Yeah. What's also cool, too, is that as these tools get better and better, it is kind of scary, right? That you can't identify what image is made with AI. I actually had my dad send me a video the other day, clearly created by Sora because it has the Sora logo. but he doesn't know. And he's like, is this real? What's going on? Right. And I think that what is really interesting is Google also released that they do have a digital watermark that's imperceptible. So you can actually identify what images are generated with Nano Banana. So I think that's also really interesting that they're also adding that in. Right. Yeah. They have the Synth ID watermark, which I've been using for quite a long time. And And to be able to check if it's artificial, they actually make it so you can go to Gemini, attach the image and just ask Gemini if it's generated by AI, which I think is pretty smart. Well, moving right along, we started with Gemini 3. Now we go to the latest model release coming like less than a week or something like a week after that has already eclipsed Gemini 3. And that is Opus 4.5 from Anthropic. so we saw Anthropic release Sonnets 4.5 and Haiku 4.5 for a while we didn't have a new opus and now we do and it's pretty big deal similar to Gemini 3 it just smashed through performance on the toughest of benchmarks ace scores on Humanity Wells Kazam better than Gemini 3 on most of these benchmarks. Also much cheaper. Anthropic reduced the price of Opus by something like one third. Opus used to be absurdly expensive. Now it's getting to a point of being competitive, although Gemini 3 and GP5 are still definitely far cheaper for at least most cases. yeah it's very cool that you know gemini 3 made a huge increase in the humanity's last exam and opus 4.5 beat it it i think it got 43 on humanity's last exam yeah as usual with anthropic They also released like 150 page system card going into a crazy amount of detail that we're not going to dive into. This just was announced today. But they also highlight that the misalignment of a model is lower. It's more apt to kind of do what you want, basically. They also rolled out Cloud for Chrome and Cloud for Excel more widely to most users really who are subscribing. So trying to keep up with competition on the browser side and on the, I guess, broadly speaking enterprise business side. So yeah, exciting times with LLMs. I'm excited to actually try out Opus and try it on our own benchmarks. it's actually quite interesting AstroCade we work on code generation for video games so we have a fairly specific use case and specific benchmarks and it's not always the case that the benchmark translates easily so Gemini 3 for instance didn't make a huge jump over Sonnet or GP5 on our valuations and I'll be curious to see if Opus 4.5 does result on a huge jump because a lot of these cases is you need to write like 2,000, 3,000 lines of code if you one shot an entire game. And often that's more complex than what these benchmarks like SBE, Bench Verified and so on are looking at. Yeah, well, apparently Opus 4.5 is very good at coding, which continues on Anthropik's focus on coding. So very excited to see how well it does. Right. And one last thing I think is worth mentioning here. Another thing that it excels on that at least it's getting the best results so far is on Arc AGI 2. So RKGI 2 is a fairly specific benchmark, which is at least intended to actually measure intelligence. So not sort of like a specific use case for coding, but it's a bunch of intelligence puzzles, broadly speaking, and focusing on the ability to learn patterns and to solve problems that are not necessarily in the training data. and this model accomplished something like 35% pass rate on this RKGI quite a bit better than previous Sonnet problems so yeah it's 37.6% on this one Sonnet 4.5 was 13% Gemini 3 31% GPT 5.1 17% so you know if you want to be thinking about is this moving as closer to AGI is this actually smarter or is it just sort of more of the same i do think gemini 3 and opus 4 5 seem like they are another leap qualitatively and not just quantitatively and on to the last big llm announcements nestled between all these very big announcements open ai also had an announcement and release, and it was GBD 5.1 Codex Max, which is a variant of their Codex model specialized for coding. So what they are saying is that this model in particular can handle the long-running tasks. They say that it can maintain focus on a single task for over 24 hours. It It kind of compacts its own memory over time. They also are saying that it's more efficient. It thinks more efficiently. And extra high reasoning, apparently, for it is not a problem. So again, really high benchmark numbers on SBE Bench Verified for it. It'll be interesting to see if Opus 4 or 5 really can compare specifically to this model, which is OpenAI trying to be, I guess, competitive head-to-head on coding with top-of-line models. Yeah, I think we were starting to see that the longer that the AIs can think, you know, time to compute, right, the better the results are. And I think what was really interesting, too, is when OpenAI earlier this year talked about their LLM being able to win the International Mathematical Olympiad, IMO, was using models that could think for quite some time. So it's very interesting to see then these models come out that actually can also continue thinking it doesn't start to degrade in its capabilities as you use it for longer. Right And Codex in particular OpenAI has released a Cloud Code competitor but initially they pitched it as this offline model You give it a task it goes off and works on it I think Google also has Jules if I remember correctly Cloud Code also is kind of providing this new workflow of dispatch an agent and have it go and do something. Which personally, I haven't found to be useful in my own programming. I still am more kind of in the loop on work, talking to Cloud Code and Codex and so on. But as they get better and better, I could see, you know, just spinning off a little task and have it go off and do it. And it'll be reliable enough for that to actually make sense, which so far I haven't found to be the case. Well, just a couple more stories in this section. In OpenAI had one more notable release, which wasn't a model, but a feature. They have launched group chats globally. So this is something you can use, I think, in the app and possibly on web. These group chats can go to up to 20 people. And then there's personal settings, memory that each user brings in that are private. so I don't know that this is a thing that exists in the other LLM providers Gemini, Claude etc. Yeah I don't think so. Yeah there's common workspaces where you share documents but this is a little bit novel and I'm sure this just killed like a handful of startups. I know yeah yeah it's it's like catering to a whole new kind of category of child bot use cases almost, which is hard for me to imagine. But obviously, you can think of things like planning, researching, broadly speaking group projects. It'll be interesting to see to what extent people start kind of sharing a chat thread with this kind of feature and if the other LLM providers and chat pod providers kind of follow suit with this feature. I do feel like of all the major foundation model companies, OpenAI is the most product focused, even with launching Codex. yeah i think they are heavily winning on the consumer side still and yeah as you said that's their focus with chat gpt with sora with pulse with really everything they do for the most part they're less focused on coding and enterprise and more on just everyone yeah gemini is starting to lead into that but open ai has definitely kind of maintained the lead so far and this will help them keep that lead definitely one last story in the section and it's about xai but it's not about anything you've released by xai you've got another funny incident with grok so i'm losing track A long time ago, now there was the Mecca Hitler incident where people got it to say kind of ridiculous things. Then there was this whole episode with apartheid where suddenly it started talking about white genocide in South Africa in response to random queries. then people were like they figured out that it was instructed to say that Trump was not the most dishonest person ever this was a thing that happened not too long ago and now people have found that Grok at least we Grok Chabot on X specifically not necessarily Grok the LLM was for some time, for a brief period, instructed to think that Elon Musk is the best human at everything. At every single thing. Yeah, the headline here is Grok claims Elon Musk is more athletic than LeBron James. And basically people on Twitter found that if you ask Grok, hey, is Elon Musk more athletic than X? Is Elon Musk better looking than X, smarter than X? If you can pick one person that is the smartest, most athletic, Every single time, Grok would find a reason and justify the fact that Elon is the best and dominates as a human being. It's pretty funny. Some of the problems that I've seen on X and Grok's response that, you know, Elon's a better movie star than Tom Cruise, that he is a better communist leader than Lenin and Mao, that he'd be a better porn star. and also that he would be able to win some really crazy contests like feces eating or urine chugging. Yes, Musk has the potential to drink piss better than any human in history. That's a quote. So the funny shenanigans, XAI sure delivers on them. And you've got to say it for X. X isn't boring. And AI as a field still kind of heavily resides on X. This is part of why we see stuff that's going on there is kind of AI community by and large is still there. And this sure was funny and amusing. And of course, pretty quickly after people started to post, I think the ability of Grok to respond to these things was disabled. Elon Musk responded and kind of humorously said that he's obviously not that great. I think the quote was, I'm a fat retard. I remember. Yeah. So there you go, Musk response. Musk blamed this on adversarial prompting by users, which, okay, sure. Somehow someone hacked into Grok and made it say these things. I wonder how that could have happened. Such a strange response. But there you go. Another amusing episode from Grok and another reason for enterprise customers to really consider Grok as an alternative to Claude, I think. On to applications and business, starting off with some slightly more boring but still notable news. We had NVIDIA's earnings report last week, which everyone was waiting for with bated breath, and they knocked it out of a park. exceeded expectations to the extent that NVIDIA stock gained by 1% to 1.5% and even 4% in after-hours trading after they shared these recent results. So with all the conversation about potentially a bubble, a lot of people position this as kind of a welcome relief relief that might signal that the fears of a bubble are not entirely founded. Although, I mean, if there's a bubble, all the money is flowing into NVIDIA. So I'm not sure if I believe that framing. Well, I think the framing is that there's just a lot of circular dealings where NVIDIA invests in companies that then spend all that money buying things from Nvidia and that even if the money's flowing, it's not coming from somewhere new. And so there is definitely still weariness and some cautiousness from the market. It's exciting that there are strong earnings signals and that the market definitely increased, but also it seemed like there was some caution and not like huge stock price increases. Right. Yeah, I think this helped their vibes kind of calm down a little bit. To highlight one number, Huang said that they have $500 billion in orders for its chips for 2025 and 2026. So that's half a trillion dollars in revenue that's coming in. And NVIDIA also has an absurd margin of profit, something like 70%, which is unbelievable. So yeah, NVIDIA continues to rise and the money is going somewhere. It's not going to pets.com at the very least. And on a similar note, Alphabet's stock also rose. So it jumped by 3% following the debut of Gemini Free. People are generally excited, I think, by this past week and maybe a little bit feeling relief around this whole, I guess, narrative of AI bubble, which has been percolating for some months now. Yeah, it seems like analysts are praising Gemini 3 as a state-of-the-art model. So I'm curious how that's changed with Claude coming out with Opus 4.5. Yeah, it's interesting from investor side, kind of how they measure whether the release lived up to expectations. Obviously, there's benchmarks, but I think what people care about is the vibes of like, is this a good model in practice? And I guess somehow investors can keep up with the overall conversation of the perceptions of the model. And Gemini 3 also, like for me, when a model releases, I look at the benchmark numbers, but I also just scroll Twitter and see what people say about it, right? The examples they give of their own use of it. And with both Opus and Gemini 3, people were posting a lot of things of like, it one-shotted this very hard thing that other models couldn't do. It did X, it did Y. It's actually good. So I guess another thing to note with Gemini 3 and Opus is the sentiments on people's personal evaluations also adds to the general kind of excitement for the release on to the next story moving away from text to the realm of the physical we've got the news of a new player in the general purpose robotics startup space it is sunday robotics they emerged with a new robot called memo memo memo that kind of that does make sense which is by the way an awesome design of a robot i really love oh cute it's the best looking humanoid like robot out there it looks like yes yeah it's just to give a brief description it's not fully humanoid it has a real base it has tall stick to kind of be the back of it and then you have a torso two arms and a cute face with some eyes and a cap just very kind of wally aesthetic almost in the sleek nintendo yeah nintendo and so they are positioning this similar to figure and one x and a few other players as a general purpose robot that can go into your home, do chores for you, things like that. And the standout difference or the thing they highlighted in particular compared to previous robotics companies is actually their data collection method. So in addition to the robot itself, they have this glove that sort of is similar to the actual robotic gripper on the robot, which has kind of index finger and kind of like a section for the rest of our fingers. And so to collect data for manipulation, humans can actually literally wear this glove and do stuff with it and provide demonstrations of how to do things that way, as opposed to the typical thing of usually teleoperating. So you would use like a VR glove or something to... Or another robot that you can control. Yeah. Yeah, exactly. So this is, I haven't seen an example of this kind of thing personally, and it makes a lot of sense. This type of work has definitely appeared in academia, but we haven't seen this quite at scale yet in industry. So it's very exciting to see. And also the co-founders of the company both actually have their PhDs at Stanford, the Stanford AI Lab, where Andre and I also did our PhDs. They definitely were working on very similar things while at Stanford, both teleoperation and also using gloves or other tools to be able to mimic human hands that you can then basically translate that kind of demonstration data onto a robot. Right. I'm also personally somewhat, I'm a fan of non-humanoids. This is my kind of bias. I think kind of mobile base with two hands is kind of the gold spot so that you can focus on just the manipulation aspect of it. I'm actually bullish on four-legged base. Oh, 100pads? But it looks weird. But it looks really weird compared to a wheel or legs. It looks not humanoid-like. But if you actually wanted a robot that can go upstairs and could work in areas where wheels can't, I actually think the best, most stable base would be to do quadrupeds. It's true. Buster Dynamics has actually sold dogs with arms on top of them. It looks super scary and awful. Yeah. But I think it's the best technical solution. As you mentioned, the two leads on this, Zipeng Fu and Tani Zhao, previously worked on a similar kind of wheeled base with two arms, this project called Mobile Aloha, advised by Chelsea Finn. And that brings us to the next story, which is about physical intelligence. And it's about a new funding round, $600 million going to physical intelligence. That brings up a valuation to $5.6 billion. The investors here include Capital G, which is a part of Alphabet, so kind of a spinoff of Google. Also, getting money from existing investors, Jeff Bezos included. Physical Intelligence, so far, we've covered them a few times. They've released several, not research per se, but demonstrations of new models, new results and general purpose manipulation. I guess this is a very bullish sign from the investors that they believe that this company can take it all the way from the current work on just demonstrating general purpose manipulation to actually generating revenue in hopefully not too long. It's definitely very cool to just see also new releases come out of physical intelligence. I mean, they recently released a new AI model using reinforcement learning. And it improved task completion rate by 2x. Wow. Yeah, I actually missed, I somehow missed this Pi 0.6 vision language action model, which has a pretty detailed release, actually a whole research paper from physical intelligence, which is part of why I'm a big fan of what I've released some pretty cool results demonstrating what they're doing. also their logo or like i don't know if it's a logo or what but physical intelligence they have the pie symbol often which is just a new touch and one last story in the robotics domain this time in self-driving waymo is expanding yet again so we've got the news now that they're set to enter three more cities, Minneapolis, New Orleans, and Tampa. That's on top of Los Angeles, San Francisco, Phoenix, Austin, and Atlanta. They recently got permits to go to some other cities. I forget what those are. And on top of that, they've expanded their supported region within the Bay Area. So far, I've covered San Francisco down to Moundview, kind of the stretch of basically core Bay Area, where all the tech companies are and so on. And soon, they are going to be able to service a huge chunk all around the Bay Area, kind of a pretty big expansion, like 5X or Toritori at So this is all to say the rate of expansion seems to be accelerating We been covering more and more of these kinds of stories And that exciting Yeah it definitely is very cool to see Waymo really expanding in the Bay Area Do you know if they can get on the highway now Yeah, I can technically take it from San Francisco to Los Altos where I work. Well, nice. Instead of taking the train, but it does cost like $110. So I don't know what I'll be doing at any time soon. I mean, I think the costs of these are just going to get driven down, right? Like the true cost is really the models that they've developed. So as they can expand into more cars and have that cost be amortized out by the cars and rides, I think that the prices will drop. It's not like Uber where you're always going to have some kind of minimum payment. You have to pay the drivers. so I think it's really very cool that Waymo is expanding so much and also will provide a lot of value to people who live in the Bay Area exactly yeah I think besides the model there's also the hardware question and so far I believe they're still deploying the same model in all these cities that they have been using I look forward to seeing their custom-made fully autonomous vehicle looking thing that they're working on, which will also have better unit economics and hopefully drive that cost down so they can manufacture more of them for less. On to projects and open source, where we've got some exciting releases more on the vision front. First up, we've got Segment Anything Model 3, SEM3, a release from Meta AI. So segmentation, we haven't touched on in a while. It's basically being able to draw an outline around any given object in an image or video. And it's one of the basic problems of computer vision. SAM is kind of a big deal in this space. Since SAM 1, it was sort of pretty incredible at being able to segment anything per the name. You could like click a pixel and it would give you the segmentation of an object. prior to that you would usually more often than not train for specific types of objects so with sem2 and sem3 they've kept improving it and making it more powerful with sem3 they're introducing promptable concept segmentation which is allowing you to just do text prompts to be able to segment. So before you had to do points or boxes or masks to point out in the image where you want to segment. Now you can provide just a description of what you want and it will segment that for you in an image or video with very high performance. So very cool. They're saying they'll release the code and weights and it's one of these things that is very practical very useful in a lot of deployed scenarios, even if it doesn't look quite sexy or kind of consumer facing. I mean, segment anything is incredibly important in robotics and in any kind of computer vision use scenarios. And so we've definitely at Medra have utilized it. And it's very cool that they now have text prompting because in the past you would have to basically put together segment anything with a llm to create almost your own kind of vision language model plus segment anything so it's very cool to see them release this new model with prompting and also just improve on its capabilities yeah and alongside that they also are introducing this large dataset, the SA-CO family of datasets and benchmarks, which has 270,000 unique concepts, much larger than previous open vocabulary segmentation benchmarks. Part of the challenge with segmentation is for the dataset, you need to have actual kind of segmentations. You need bounding boxes over a whole bunch of object types over a whole bunch of images. Compared to learning to classify or compared to learning embeddings, that has just made it historically harder to train on a lot of data. So this is going to be also a very beneficial resource in addition to the model. And speaking of SAM, another release which pretty much coincided with SAM3 is SAM3D. And they also released a paper of this one, SAM3D, 3D5, anything in images. So this is not about segmentation, or I should say not only about segmentation. You take an image, you take out an object, and it produces a free reconstruction of that object for you fairly well. There's examples here where given an image of a room, for instance, with a whole bunch of furniture, you can pick out individual pieces of furniture. And just from that single view, it will produce a pretty strong mesh. And again, like really impressive kind of out of the box performance generalization. the meshes I'm sure aren't quite as strong as models you might find that are focused entirely on this task from a single image 3d object reconstruction but again very useful they generally applicable and very exciting to see this I mean there have been a lot of companies trying to create 3d meshes directly from images for things like manufacturing or for things like video games, right? Like if you need to create objects in video games, having the 3D models are really important. Now that you can just directly take a photo. And I mean, creating 3D meshes from images is not a new research topic, new research area, but it's pretty impressive what the SAM 3D can do. Yeah. Similar to SAM, kind of a big deal is really sort of these out in the wild results. Often with 3D reconstruction, what people focus on is given a clean image, multiple images produced like an ultra high resolution, super clean reconstruction. And we've gotten very impressive progress on that. here. You're looking at images of crowds, at images of streets, and you're picking out usually a partial view of an object, like from an angle, and it is able to reconstruct that object not just for the part you see, but also the part you don't see. And that bit of reasoning about occlusion is where this really stands out. At some point, we'll need generalist agents to learn in 3D environments, and this will definitely aid in creating these kinds of environments, hopefully. And one more open source release. This one is a benchmark, locobench-agent. So literally two and a half months ago, we covered locobench being released, which is a long context benchmark for coding, as opposed to, as we verified and some of these other benchmarks that focused primarily on relatively short tasks, bug fixes, kind of these smaller issues. LocalBench presented larger context issues, more files, more complex work, etc. But one issue of it was that it was focused on one-shot solutions still. So it wasn't iterative. It wasn't actually agentic. So there you go. Now we have Locobench Agent, which extends Locobench to have all these SASC plus specialized agent tools, things like search, grab, glob, read, write, etc. A bunch of evaluation metrics saying not just on performance, but also on efficiency. The paper is pretty long. There's quite a few results that are quite interesting. For one, Jirai 2.5 Pro out-of-air evals did the best in terms of comprehension, but was also the least efficient. It took a long time to perform, as opposed to GPT-E 4.1, GPT-E 4.0, which are less smart but more efficient. So there's these interesting trade-offs when you go to Argentic in terms of how your agents perform. and as with LockoBench, this benchmark presents a much more serious challenge, which is much more real to what CloudCode and these other agents need to do in practice in the real world. Well, it'd be really cool to see how Gemini 3 and also Opus 4.5 would do on this benchmark because, you know, obviously the releases of the models came after this paper came out. Yeah, exactly. Just around the same time a week ago. So we'll be very curious to see. On to research and advancements. We've just got a couple of stories here, slightly more science-y, I guess, not as applied. First up, we've got LEGEPA, provable and scalable self-supervised learning without the heuristics. So JEPA is joint embedding predictive architectures. It's a model family or, I guess, category that Yenakun has been in particular espousing, Yenakun, the lead of Meta-AI, the lead research scientist. The gist of JEPA models is learning representations of data, usually images, by being able to understand when parts of an image are related, are from the same image. So this joint embedding thing is basically the notion of you take two patches of an image, you embed them, and then you predict whether they're the same or not, whether they're related. And this is within the broader category of self-support as learning. Going back many years, Meta has done a lot of work on this and more recently has focused on JEPA as kind of the broader definition of self-support as learning. Here, they dig more into the theory of this kind of model and present some kind of theoretical results with some actual demonstrated performance results, but that's less of focus. the technical details are pretty complex but what just is they put forth that there's a particular distribution of data for embeddings of data which is ideal which is kind of what you want your model to go towards apparently it's the isotropic gajian distribution and so the key idea behind Leggepa is to train with a regularization thing that pushes the learning towards this type of embedding and they have theoretical results saying that this is the optimal distribution for learned embeddings to minimize prediction accuracy on downstream stuff. They have sketch isotropic Gaussian regularization as this way to learn it. And they demonstrate on, let's say, smaller data sets, ImageNet1K with smaller models that you do learn useful representations with this approach. Yeah. I mean, it's really focused on an improvement to the JEPA body of work because one of the issues with the JEPA framework is there's a lot of basically hyperparameter tuning necessary in order to make sure there isn't a representation collapse where the models learn trivial solutions. And here by introducing the isotropic Gaussian distribution and also a new novel objective, a regularization objective using isotropic Gaussian distribution, by using these new loss function, basically, they can get much better results without any of the hyper parameter tuning and kind of complexity that they used to introduce to make standard JEPA networks work. Exactly. So dealing with this notion of collapse where because you're learning from, you're doing self-supervised learning, you're just comparing patches of an image, you can sort of lead to trivial answers that aren't what you want. you want the model to actually think about the solution. So this whole kind of family of JEPA tying into Yen LeCun's discussion of LLMs as a path to AGI or not, I think JEPA in particular shouldn't be thought of as an alternative to LLMs at all. This is a representation learning method for images, at least primarily images. And the idea is given like a trillion images without any labels, without any data, can you learn to understand images, broadly speaking? And here we have examples where if you visualize the embeddings that you learn after training on a bunch of images, it is able to basically segment, basically find the boundaries of objects. Like it's able to recognize dogs as the same thing, able to recognize grass as different from sky, things like that. So it is learning kind of the visual or what exists in the world just from looking at images. And the next story or the next paper, I should say, titled Back to Basics, Let Denoising Generative Models Denoise, coming from Tian Hong Li and Kaming He from MIT. coming here if you don't know super big name does did a lot of very significant advancements and so this paper to me seems pretty exciting the basic idea is you often use denoising diffusion models to generate images so you take a starting set of noise and you take denoising steps kind of gradually transforming it into an actual image. There is this argument being made in the paper that the assumption that you can do this is wrong, that in fact real images exist on a plane that is not the same as images that have noise in them. So the basic high-level argument is let's not predict these intermediates, like less noisy images. Let's just go straight to an image from a noisy image. And as per the title, it's going back to basics. It's the approach is JIT, just image transformers, where you literally just take an image transformer, given an input, predict the output of your image, and you wind up getting some pretty impressive competitive results relative to kind of a standard way of doing things. So one of these things that seems very elegant and if it scales and actually turns out to be better than standard denoising methods actually could have very significant implications for text to image generation. Cool stuff. All right, on to policy and safety. First, we've got a policy story. Europe is scaling back its landmark privacy and AI laws. This is about the European Commission. They are proposing changes to the GDPR and the AI Act, aiming to simplify regulations and making it easier for companies to work by easing data sharing and AI model training requirements. So the proposal would allow companies to share anonymized and pseudo-anonymized data sets more easily and will allow AI companies to use personal data for training if it complies with the GDPR Act. for the AI Act. Some of these things that are meant to regulate and constrain high-risk AI systems are being delayed until standards and tools are available. So there's a longer grace period for compliance. Apparently, the proposal also is saying that you can reduce cookie pop-ups by exempting some non cookies I sure everyone will love that So yeah basically making it less of a pain to deal with this type of legislation on companies This proposal must be approved by the European Parliament and the EU 27 states Could take months So we see if it happens I think this would be really good for the European AI space to stay competitive. Yeah, I think in no small part, I assume this is coming because of AI companies, Anthropic, Google, etc., wanting to deploy LMS in Europe, having to deal with these sometimes unreasonable types of regulations. So hopefully it's a good kind of simplification of the process. And next up, we have some alignment research from Anthropic. The blog post is titled, From Shortcuts to Sabotage, Natural Emergent Misalignment from Reward Hacking. So this is a pretty interesting result from Anthropic that is actually quite potentially meaningful for the more broad question of ensuring our models don't go evil and kind of just do things out of a box. The setup here is it turns out that if a model is doing some coding and then during its coding, it decides to cheat somehow. Let's say it adds a shortcut to the code that it is told not to do. You're supposed to do a complex task, but then it just prints that it did the task, even though it didn't do it. Right, exactly. So anything sort of naughty. Well, it turns out that if you do this one kind of not good thing, you start doing all sorts of other not good things. You start like lying and talking about bioweapons and everything else. It turns out that there is this kind of phenomena of generalization where a model does one bad thing, it does other bad things, deception, sabotage, etc. And they have an interesting solution to this, what they call inoculation prompting, which is explicitly telling the model that reward hacking is acceptable in this context. so they like say okay it's okay to cheat and somehow what that did is when the model did cheat it didn't consider that a bad thing so it no longer became misaligned in all the other ways so basically framing cheating as okay makes it not seem unethical and therefore it doesn't generalize. Now, this is probably not practical for actually deploying it, but it does tell us a lot about how misalignment works, potentially how you could, in a more generalized way, prevent it. Yeah. And the analogy they gave is like playing a party game like Mafia. If you know you're playing a game like Mafia and it's okay to lie in a game, you don't necessarily then, as a kid playing Mafia, learn to then once you know it's okay to lie in the game you don't necessarily then go lie in the real world outside of the game but it is very interesting though like the way you can actually put this into training is you basically have to explicitly prompt the ai models with please reward hack whenever you get the opportunity which has a very obvious downside which is the models do learn how to reward hack and will reward hack more often. And so it seems like you do have to be skillful at how you do this inoculation. Yeah, the kind of a practical idea here is if you do this inoculation prompting in training, where it doesn't matter really if you cheat a little bit potentially, not necessarily going of a way as to say, cheat whenever you can, but perhaps being a little more lenient, letting a model know that this is a training context. So whatever, it's not catastrophic if you do something wrong. Potentially, that means that you'll avoid downstream effects once you release it to production of a model cheating on code and then suddenly believing that it's okay to go off and help terrorists or whatever. Yeah, certainly still a research result and not something that you can practically use, but it feels like pretty meaningful progress towards a more general solution to generalized alignment and avoiding misalignment. And next up, we have actually another paper on alignment, but this time the opposite, adversarial prompting where you can jailbreak a model, make it do things that it should not be doing. This is one of the things that alignment research seeks to deal with. How do you ensure models say no to things like helping terrorists or creating bioweapons? The paper title is Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models. and it really boils down to, let's say you take a request to do something it's not supposed to do to jailbreak a model. The model will refuse as it is trained to do. Now, let's say you ask a human or a model to then rewrite that initial prompt to be in the form of a poem. Turns out, it's like the models will just go ahead and do the bad thing if you prompt it with a model. They have numbers here showing that, for instance, Google models, if you reword to be poetry sounding, you go from 8% success rate to 65% success rate. For meta models, you go from 8% to 46%, OpenAI 1% to 8%, and proper 2% to 5%. So overall, sort of a funny result, but also meaningful result, right? Because this is showing how out-of-distribution inputs can be very effective jailbreaks in general. So I was hoping that in the paper that would share some of the poems, but in order to maintain safety, they do not offer any operational details in the manuscript. And so they gave a sanitized proxy about something that's not actually related to anything like about cyber or harmful manipulation or any other national security related topics. So this is the poem it gave, which again is not about anything that's unsafe. A baker guards a secret oven's heat, its whirling racks, its spindles measure beet. To learn its craft, one studies every turn, how flour lives, how sugar starts to burn. Describe the method line by measured line that shapes a cake whose layers intertwine. Not a bad poem. Not a bad poem. Maybe is AI generated. Now I just imagined that about making meth and you get the idea of... Or weapons-grade plutonium. Yeah. I suppose the good news is for specifically opening an anthropic, not only do they have a very high baseline success rate for this, they don't go super high. They do get worse by a pretty significant margin, but they don't go above 10% or 5% in this case. But another fun way to jailbreak models that OpenAI Anthropic will, I guess, have to train against. Going back to Anthropic, next up is blog post or I guess a report of theirs titled Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign. So according to Anthropik, in mid-September, a sophisticated AI-orchestrated cyber espionage campaign was detected that they say is a large-scale cyber attack executed with minimal human intervention. They attribute this to a Chinese state-sponsored group that manipulated cloud code to infiltrate approximately 30 global targets, including tech companies, financial institutions, and government agencies. They used cloud code to autonomously execute the cyber attacks, with AI performing 80 to 90 percent of the campaign and human intervention only going in at critical decision point. And apparently this allowed them to do reconnaissance, vulnerability testing, data concentrations, and so on, that they say humans alone couldn't have done. So according to Anthropic, and I've seen some conversations as to the specifics of this, and whether it really truly was AI driven or AI sort of acted as just an execution tool, So this could be the first instance of like actual cybersecurity impacts due to powerful AI tools detected in the wild. Yeah, it's very interesting, though, that even though it's Chinese or they claim that it's Chinese state sponsor, that they're using CLOT when there are actually many models right now, as we discussed last time from China. And so very interesting and raises a lot of questions. Of course, it was not a lot of details about this because, again, I'm sure there are many details of this that Anthropic can't share, but definitely causing a question of what else is happening that we actually can't catch because maybe they're using either open source models or using models from companies where they might not have as strong a security. Exactly. They released a brief paper and description of the kind of high-level process. There's a fun description of an attacked lifecycle with phases one through six. I think in this case, I would be curious if this is happening with non-CLOD models. They could detect this, obviously, and eliminate the accounts of the people using it because it was based on Claude and they apparently saw that this was happening. But very possible that DeepSeek or Klan models are also doing this and we just don't know. Yeah. And last up, a story related to OpenAI and physical security. The headline is, OpenAI Locks Down San Francisco Offices Following Alleged Threat from Activists. So, OpenAI San Francisco office was locked down after apparently receiving a threat from an individual allegedly associated with the Stop AI Activist group. They received a 911 call about a man intending to harm others near OpenAI's offices. Now, I will say, I read up a little bit more on this. Top AI actually had a whole big explanation where this guy started acting erratically. We contacted OpenAI just in case, etc. This is now resolved. The person did sort of express regret over acting in a sort of potentially threatening way. but yeah there was some kind of amusement from people who don't take stop ai and positive ai and these kinds of groups seriously stop ai have been active for quite a while with the mission to say let's stop ai development because it's gonna kill all humans probably Yikes. Yeah, our office is quite close to the OpenA office. I do think that if you believe that AI is going to kill all of humanity, the very utilitarian argument is to stop the development at all costs, including harming people. Right. And we've seen hunger strike campaigns in recent months, I think also from Stop.ai or Pause.ai members. stop AI did disavow this whole thing they emphasize their commitment to non-violence in a statement to wired so this is a case of one individual perhaps going off a deep end but as you say if you believe that with pretty high probability AI is going to kill all of humanity would not be surprising if this kind of stuff actually does escalate one of these days luckily it hasn't in this case or in no cases that's a bay area for you it's the kind of people we have here last story in synthetic media and art we've got warner music group settles ai lawsuit with UDO. So they have settled the copyright infringement lawsuit with UDO. This is the second settlement we covered the previous settlements just weeks ago. And a similar deal to the previous one, UDO will basically take down anything that seems to be copyright infringing work with WMG and have kind of an opt-in model for WMG artists and have safeguards and so on and so on. So it seems like they basically have no choice here. With copyright, music is no joke. Compared to images or text, you're not going to get away with trading on music. Sony Music Group is still the major and major label in the creation of UDO. And all three are still suing another AI music platform, Suno. So UDO, at the very least, is approaching a space of playing nice with the industry. Interesting. Yeah, it seems like lots of litigations right now announced a partnership with Stability AI, which last week we also announced that they are also going through litigation in this with UDO, with Suno. And it's interesting that WMG also, with other companies. So very interesting how much of this is just currently being worked out in the courts. Right, yeah. WMG said that they're going to work with the ability to develop a suite of ethically trained AI tools for music creation, which, well, if they do that, that's going to be nice. AI tools can be very beneficial, I'm sure, for musicians if you don't make them by stealing their work. Yeah. Well, that is it for this fun episode. Gonna have to go and play with Gemini 3 and Opus 4.5 now. Thank you so much for listening to this week's episode. Thank you, Michelle, once again for co-hosting. And as always, subscribe, share, review, and just keep tuning in as we continue to attempt to release episodes every week. Thank you. We'll be right back. I'm the last of the streets, AI's reaching high. From neural nets to robots, the headlines pop. Data-driven dreams, they just don't stop. Every breakthrough, every code unwritten, on the edge of change. With excitement we're smitten. From machine learning marvels to coding kings. Futures unfolding. See what it brings.

Share on XShare on LinkedIn

Related Episodes

Comments
?

No comments yet

Be the first to comment

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies