Claude 4.5 Opus Shocks, The State of AI in 2025, Fara-7B & MCP-UI | EP99.26

This Day in AI

Friday, November 28, 20251h 45m

Spotify Apple

This Day in AI

0:001:45:05

What You'll Learn

✓Anthropic's Claude 4.5 Opus model has impressed the hosts with its performance, speed, and reliability, making it their new go-to model
✓Opus 4.5 outperforms Google's Gemini 3 Pro in key areas like tool calling and coding, which are important for practical use cases
✓Gemini 3 Pro has issues with getting 'path obsessed' and repeating itself, leading the hosts to lose trust in the model
✓Opus 4.5 is praised for not having major trade-offs, unlike previous Anthropic models that were either too slow or too expensive
✓The hosts believe Opus 4.5 could be a major embarrassment for Google and their Gemini 3 Pro model

Episode Chapters

Introduction

The hosts discuss the recent release of Anthropic's Claude 4.5 Opus model and compare it to the Gemini 3 Pro model from Google.

Opus 4.5 Performance

The hosts praise the speed, reliability, and overall quality of the Opus 4.5 model, noting that it outperforms previous Anthropic models.

Comparison to Gemini 3 Pro

The hosts discuss the issues they've encountered with Gemini 3 Pro, such as its tendency to get 'path obsessed', and how Opus 4.5 seems to outperform it in key areas.

Implications for the AI Landscape

The hosts speculate on the potential impact of Opus 4.5 on the broader AI landscape, suggesting it could be a major embarrassment for Google and their Gemini 3 Pro model.

AI Summary

This episode discusses the recent release of Anthropic's Claude 4.5 Opus model, which has impressed the hosts with its performance and capabilities. They compare it to the recently released Gemini 3 Pro model from Google, noting that Opus 4.5 seems to outperform Gemini 3 in key areas like tool calling and coding. The hosts also discuss the strengths and weaknesses of the different models, with Opus 4.5 being praised for its reliability, speed, and overall quality, while Gemini 3 is criticized for its tendency to get 'path obsessed' and repeat itself. Overall, the hosts seem to view Opus 4.5 as Anthropic's best model yet.

Key Points

1Anthropic's Claude 4.5 Opus model has impressed the hosts with its performance, speed, and reliability, making it their new go-to model
2Opus 4.5 outperforms Google's Gemini 3 Pro in key areas like tool calling and coding, which are important for practical use cases
3Gemini 3 Pro has issues with getting 'path obsessed' and repeating itself, leading the hosts to lose trust in the model
4Opus 4.5 is praised for not having major trade-offs, unlike previous Anthropic models that were either too slow or too expensive
5The hosts believe Opus 4.5 could be a major embarrassment for Google and their Gemini 3 Pro model

Topics Discussed

#Large Language Models#Model Comparison#Tool Calling#Coding Capabilities#Anthropic#Google

Frequently Asked Questions

What is "Claude 4.5 Opus Shocks, The State of AI in 2025, Fara-7B & MCP-UI | EP99.26" about?

What topics are discussed in this episode?

This episode covers the following topics: Large Language Models, Model Comparison, Tool Calling, Coding Capabilities, Anthropic, Google.

What is key insight #1 from this episode?

Anthropic's Claude 4.5 Opus model has impressed the hosts with its performance, speed, and reliability, making it their new go-to model

What is key insight #2 from this episode?

Opus 4.5 outperforms Google's Gemini 3 Pro in key areas like tool calling and coding, which are important for practical use cases

What is key insight #3 from this episode?

Gemini 3 Pro has issues with getting 'path obsessed' and repeating itself, leading the hosts to lose trust in the model

What is key insight #4 from this episode?

Opus 4.5 is praised for not having major trade-offs, unlike previous Anthropic models that were either too slow or too expensive

Who should listen to this episode?

This episode is recommended for anyone interested in Large Language Models, Model Comparison, Tool Calling, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Join Simtheory:<a href="https://simtheory.ai"> https://simtheory.ai</a> (Use coupon BLACKFRIDAY15 for $15 USD off any subscription). ---- Simtheory Discord: https://discord.gg/Ar6GeQnAR7 This Day in AI Discord: https://discord.gg/TVYH3HD6qs LinkedIn Group: <a href="https://www.linkedin.com/groups/16562039/">https://www.linkedin.com/groups/16562039/</a> Spotify: https://open.spotify.com/artist/28PU4ypB18QZTotml8tMDq?si=FPaJU2NRSnOSNPmnsfwA_g --- CHAPTERS: 00:00 Intro & Fatal Patricia Update 01:40 Promotions (Discord, Black Friday, LinkedIn) 04:36 Claude 4.5 Opus - Best Anthropic Model Ever? 31:17 Computer Use API Updates 36:14 Will AI Replace 57% of Jobs? (McKinsey Report) 1:00:52 Claude 4.5 Opus Demos (Christmas Hut & Diss Track Preview) 1:07:13 Microsoft Farah 7B - Moose Porn Refusals 1:21:51 Why ChatGPT's MCP-UI Apps Are a Bad Idea 1:42:01 🎵 Claude 4.5 Opus Diss Track (Full Song) --- Thanks for listening. Like & Sub. xoxoxAnthropic just dropped Claude 4.5 Opus and it might be the best AI model of 2024. In this episode, we compare Claude 4.5 Opus vs Gemini 3 Pro vs GPT-5.1, breaking down the new API features including effort parameters, context management, and computer use updates. We also test Microsoft's new Farah 7B parameter model for computer use - with hilarious refusal results. Plus, we react to McKinsey's controversial report claiming AI agents could automate 57% of US jobs by 2030. We dive deep into Anthropic's pricing (3x cheaper than Opus 4.1), why Claude is now beating Google and OpenAI on agentic coding benchmarks, and whether MCP-UI apps in ChatGPT are a step backwards for AI workflows. Is Claude 4.5 Opus the new king of AI coding assistants? Should enterprises be worried about AI job replacement? And why did Microsoft's Farah model refuse to draw a moose? All this plus an AI-generated diss track roasting Sam Altman, Elon Musk, and Sundar Pichai.

Full Transcript

keep sam's out here with his screenless device dreams sundalls pushing workspace integration schemes elon's challenging league of legends teams but none of y'all can match what anthropic brings token efficiency i use less i do more context compaction keeping conversations raw you want energetic workflows i'm the source planning acting observing staying on course so chris this week yes there's another new frontier model and it's insane blah blah blah Well done, Anthropic. But more importantly, it looks like last week when you promoted your new AI track, Fatal Patricia, it worked because it is shot to number two on the This Day in AI Spotify chart. It's our number two most listened to song. It has just shot up the rankings. And of course, why wouldn't it? What a hit. I think it's just captivated our audience. People have gone nuts with it. There's apparently meetups happening. I'm really surprised because I've just had such bad taste in the past. Like, I really am just not compatible with everyone else. But we've hit on a winner, and I'm really proud of it. Yeah. I think it's up there. I think it's because Patricia wrote it, not me. You know, like, she knows our relationship. the more you listen to that song the more you like it just blows my mind how some of the lyrics in it are just so good like the where it's like i've downloaded myself to the fridge and the toaster and like everywhere yeah like i'm everywhere you possess uh but we did get uh obviously entropic clawed 4.5 overs we'll get to that in a minute but before i start the show i've i've had a lot of criticism that we're very poor at promoting things so let me start with a few promotions. One, we do... Not NorVPN. Not NorVPN. We have two things, two discords, two discord communities that people are involved in and like, and recently someone joined and said, I just, it took me like a long time to discover these exist, but I'm pleased to be here. So here I am now promoting them so I can't get in trouble for gatekeeping. So there'll be links in the description of both discords. There's a Sim Theory one. There's also a This Day in AI discord. So if you're interested, please do join. If you're thinking like Discord, is that for like, you know, teenagers gaming? It kind of was, but now it's not a bad platform to create a community. So consider joining if you're hesitating. The other thing I want to quickly go through is we do have a Black Friday, Black Friday, our first ever sale at Sim Theory. I've created this banner with Jeffrey Hinnon on it. It says stay relevant and it will get you $15 USD off any subscription. And you just need to enter Black Friday 15. Don't worry, we checked at work this time. Black Friday 15 when you're signing up. It's for new users only. So if you've been hesitating for quite a while thinking maybe I should give Sim Theory a go, this is your chance to try it essentially for free on us for 30 days. It expires November 30. It's also limited to 100 new signups because otherwise it'll send us broke. So we obviously are not funded. Sim Theory is a passion project, so please don't send us broke. Simtheory.ai, and the code is BlackFriday15. One final plug, and this actually isn't a joke. There's such sellouts now. This is just all advertising, the whole podcast. No, but this is quality. This one's quality. So I signed into LinkedIn the other day. I'm not kidding you. I have not signed in for over a year. And there was a lot of notifications. and I didn't realize how big the community is of listeners on LinkedIn talking about the show and also talking about Sim Theory. So I thought what would be the funniest thing to do? I created a LinkedIn group called Average AI User Group. There's a big banner that says still relevant, Average AI User Group, in brackets, this day in AI, just to make it a bit easier to find. So if you are interested in connecting with other members of the Sim Theory community in more of a like professional setting, I won't be there that much, as you can tell, because I never log in. But you can. I'm going to put a link down below as well to that LinkedIn group. So do join it because there's so many interesting people on LinkedIn listen to the show. I'm not sure why, but they do. And so I thought connecting you all together might be an interesting thing to do. Okay, on with normal programming now. So, Chris, of course, just days after Gemini 3 Pro, all the hype building up to it, there was like weeks of hype and teasing. And then they released Gemini 3 Pro and, of course, Nano Banana Pro. And Nano Banana Pro, let's be honest, definitely not overhyped. Gemini 3 we'll get to in a minute. but Anthropic out of nowhere, no hype, just some short YouTube video by Australian heartthrob at Anthropic was released to announce the model. And then, you know, it's just available everywhere. And what do you think? Well, I'm actually really impressed. It was so funny because when I added it, I was not excited. I didn't even use it for the first couple of days because, In the past, the Opus models have always been underwhelming. They either hit rate limits too fast or they just weren't that much better and it was a lot slower. But on the contrary, this model is really fast. It's really good. And I must admit, I'm basically using it as my main model now. It's really, really great. Like, I'm impressed. It's so funny you say in the past you could barely use Opus because it's true. Like, they were horrendously rate limited no matter where you used it. It was way too expensive. and so slow, you just basically gave up and never really got familiar with the model. But what cracks me up is one of the quotes on the page announcing this from the Windsurf CEO says, Opus models have always been the real state of the art, but had been cost prohibitive in the past. I think that's pretty kind. Obviously, it's a quote on their site. Claude Opus 4.5 is now at a price point where it can be your go-to model for most tasks. It's the clear winner and exhibits the best frontier task planning and tool calling we've seen yet, and I do not disagree. What a model. So I also, I was all in on Gemini 3, but I was starting to trip up on a lot of its faults, namely the fact that it has this path obsession problem. And I thought that might just be us. I thought maybe it was our implementation, but I've been seeing all over X people saying similar comments. It sort of gets down a path, gets obsessed with that path, and then can't break out and just kind of keeps repeating the same stuff. Yeah, we had some teething issues with Opus 4.5, and I'm sure we'll talk in a minute about the different API changes they've made around the model. And so in Sim Theory, we had teething issues where basically I didn't implement it properly, and so it didn't give as good results as it should have at first, and then it got better, and now it's probably at the peak. But Gemini 3, there were no API changes, so we can't blame me this time for it being a bit weird. And I must admit, I've gone from like Gemini 2.5 being my daily driver, the model I trust, the heart of Patricia, to it gradually diminishing leading up to 3.0. And now 3.0, it's just so scatterbrained and weird. I just straight up don't trust it. Like maybe it is giving better results in some cases. But like you say, you're just constantly finding yourself getting into states where it's just not really doing things the way I need them to be done. it does excel in certain areas like anything design related or taste related i think it as a model has really good taste so there's certain areas of improvement and i think they put a lot of effort into vibe coding and viral moments where you see what it can create and you're like wow it must as a result of that be a really good model in all the other areas but where i think anthropic with port opus 4.5 shines is this is the only company not distracted and trying to take on like a hundred things at once. They're just like, we're going to build the best model. We're pretty much going all in on a coding agent style model because we know that's where all the money comes from, at least for them. And that's what their big breakthrough was originally with Claude 3.5 Sonnet. And so they're just focusing on that path. And I think for knowledge workers, especially, it's just a reliable, trustworthy model that doesn't go nuts, that can call tools really well. Opus is now smarter, it's faster, and it's just so pleasant to work with throughout the day. And you can have a conversation with it. I've got to say, I think it's by far the best ever Anthropic model ever released. I know, obviously, you say that about every new model that comes out, like I joked last week about it's the best iPhone yet. But this is actually, I think their best model ever. Like I would put it, you know, higher, the impact it will probably have higher than Claude. It's the first time in a while we've had a major model release that doesn't have a major trade-off to it. So it's not slow. It's actually faster than most of the other Anthropic models. And it's not too expensive where you sort of cringe and shiver every time you send it a command. Like it's hitting on all of the major points, price, speed, and quality. So I'm not sure if this is going to be a real embarrassment to Google or not, because when Gemini 3 Pro came out last week, it was, what, four days later, and then Opus 4.5 drops, and all of a sudden is destroying them on the benchmarks that I think matter, specifically around tool calling and agentic coding. And people might think with the coding benchmarks, oh, you know, I don't really care about coding. I don't code during the day. But it actually matters, I think, with these models a lot more than you think. Because when you want to use it in an agentic sense, sometimes it will execute that workload through writing code. Like if it has to edit an Excel document or create a chart or just interface with a particular type of MCP. It's ultimately just writing code to communicate with the MCP. So the better it is at code, the better it is at tool calling, and the better it is at most use cases. Because tool calling, in a way, is coding, because it's calling a function with parameters, right? So if it's better at code, it'll be better at tool calling. Yeah, and my experience is that, like, I would say all round, as you said, it's just the, it doesn't feel like there's many compromises with this model. The only thing I think it is not as good at is it feels like the creativity sort of suffers a tiny bit in the model as a result of maybe the code and agentic tuning after a while. Like, it does seem to lose a bit of that. You know, there's something about GPT 5.1, 5.1, is it? Yeah, 5.1. And that series of models where if you want to do creative writing, as we saw with our audio book test, like the Counter-Monning Christo set in space test we did quite a while ago. I remember that. Yeah, like the GPT-5 plus sort of models, call it, just really excel. It's just drawing you in. And I think that these other models like Gemini and Opus, they have this very familiar feel when it comes to creativity where you can almost predict what they're going to write. It doesn't feel novel. Not that I'm saying GBT 5.1 stuff is probably novel, but it just seems better at creating stuff you feel like is novel with its output. But I have found myself through the week going between Gemini 3 and Opus 4.5. Sometimes with anything design or visual related, I think because Gemini 3's vision model and visual recognition model is better, that's where it shines in those kind of areas. But for everything else right now, I'm Opus. And imagine being Google right now. They've worked so hard on Gemini 3 Pro. And what, three or four days later, I'm just like, yeah, I'm kind of done with you. Well, that's for the people who can switch, though. I guess there's some people where it's just reassurance being told, oh, you actually are using the best model if you're on Anthropic at the moment. or if you're on Google. But it just changes so quick. Like we've had a month where every major lab has released a different model. And like, they're all getting to a point where I'm like, what a time to be alive. Because I said to you in the week, my brain is just natively switching. They're like, I'll go to 5.2. I'll go to Opus primarily at the moment. Then I'll go to Gemini 3. Sometimes I go to Haiku. Like if I'm just doing generic tool calling and I don't want it to overthink too much, it still is my preferred model. And my brain has become the router. Like, I don't even really think about it. I'm just, like, clicking around doing it and switching my name. Yeah, I used Grok yesterday to bail me out of a tricky situation. I'm like, come on, you can fix this one. Interestingly, the demand on Opus, I guess because the people that use it for code in the various apps like Cursor, like, it still generally is the preferred model in those products. And I think Cursor is really optimized around the Anthropic models. you can just tell the demand's there for it too because there's been true outages like after it launched. Like it went down for a while. Sometimes it's just like fully not working. So you can get a sense that the demand's really there. And I think the difference this time is they're meeting the demand and they're keeping the speed high. And I guess it's to do with all these relationships with all the providers where they're letting everyone host the model now. Yeah, we've definitely had two or three scenarios where it's gone down. and I just, as usual, assume it's my fault and only to realize that they're actually hitting errors on their side. And we've seen the same with Gemini 3, but the difference with Gemini 3, and it's probably an interesting point about the model that we shouldn't write it off too soon. Technically, it's still a preview model. Like it's got preview in its name. It provides no guarantees about being a production-worthy model. So it's possible, just like with Gemini 2.5, that once it gets more stable and once they get the final release of it, it may actually be a lot better. Like, I don't think we can write it off completely. So let's get into the tech specs of Anthropic Cloud 4.5 Opus. So it's a 200K context window, which obviously isn't as big as Gemini 3 with its million context. So there's a trade-off there. And I think Gemini 3 is really good when it comes to huge amounts of context and, like, following signal in that context, especially early on. and then output I think it's like 64,000 tokens which is pretty amazing as well but there was also some changes at the API level that you mentioned. Do you want to kind of explain those to people? Yeah, changes that messed me up really hard. So firstly, if you have regular thinking involved in your query you now need to provide like a token to basically reference that thinking in future requests and that was like a hard breaking change they made. And then the second one is they've actually changed the thinking instead of giving like budget tokens. So it used to be, you'd say, I'll give it 8,000 tokens for thinking and then the rest are for my output, or I'll give it 12,000 for thinking the rest of my output. And you sort of use that to adjust how long it would spend thinking. And it was a tricky one because you sort of had to make that trade-off between how many of your output tokens do you want to give up, the cost of it and the time of it. Whereas now they've really simplified that down to just low, medium and high. So you can just literally specify a parameter now of the level of, I think it's an effort parameter, similar to we see in OpenAI, for example. And what that means is that it'll decide how long it spends to think and you can direct it into that. So the default is medium, obviously, and you can switch between those things. And so we have both variants available. To be honest, I can't really notice a difference between them, even in terms of speed. They both seem about as fast to me. I don't know what you've experienced. I think because the anthropic models were sort of the last to adopt the thinking model paradigm, which I've never liked as everyone listens to this show. I'm like, if it needs to think longer, just do it and don't tell me. Like, just pick something. I don't care. I think as the user, we have enough challenges picking which model to use ourselves, let alone knowing whether to use the high or low variant. Like, to me, the model should decide. and whenever you look at the thinking tokens of what it's thinking about I'm like these are the thoughts of an idiot there's never anything in there where you're like wow what a profound breakthrough this is why I don't get why anyone likes to watch its thinking in the UI we don't show it in sim theory because I think it's just annoying. It's like me going oh I've got to go to the shops later and then that means I'll have to move the car and then I'll have to find the key it's like that kind of crap no one wants to hear that it doesn't help. Yeah, just do it in your head. I don't want to hear it. But yeah, I've never noticed a difference. I think the only models I notice it with is the OpenAI model, like GBT 5.1 thinking versus 5.1. If you're stuck on something really hard, the thinking variant will generally get you get it solved. That's true. It is like a different model in that case. I haven't really noticed that with Opus so far. No. But given that the thinking doesn't cost anymore and doesn't take much longer, you might as well just operate on that mode if it's something you think is working better for you. Yeah. So the other thing is pricing. Opus 4.1, obviously, originally, let me bring up the prices on the screen. Originally, it was $15 per million tokens input, which is just insane. And output $75 per million tokens. So it's just so cost prohibitive, right? And then now we have Opus 4.5 at $5 per million tokens and Output $25 per million tokens. So it's about, well, it is a third of the previous pricing. So it's a lot cheaper. Compare that to Claude Sonnet 4.5 at $3 per million tokens. But once you get above 200K context, because it obviously can get up to a million context, it would then be $15 per million tokens. So Sonnet, in a way, right now, 4.5, is actually more expensive than Opus when you extend beyond that 200k context window. And I would argue, based on my experience this week, I mean, it's very anecdotal, of course, but I don't think you actually really notice the diminished context window when using Opus. At least I haven't. I've never reached a situation where I'm stressed about, oh, it's lost the context and therefore or it's not going to perform as well, even though I make an argument for that all the time. Just so far, it just personally hasn't hit me yet. Do you think that's because of learned skills? This is what I think. When I work with it, I constantly remind it. If I'm working on a document, I paste the latest context of it, and I'm like, hey, this is what, like, take this and then do that. I'm constantly reminding it of the context and picking chunks of content. I must admit, I have a sort of feel for it. And I'll go, just so you know, here's the latest version of this file to keep it updated. And as you know, we have maybe a better solution coming for this for everybody. So you don't have to do it manually all the time anymore. But I agree. I must admit, I'm helping it along by always keeping its context fresh. And therefore, I don't really need the million because I'm always going to have it within that 200,000 just because of the way I work. That would be my number one tip for people when they're using AI today is just assume it forgets everything constantly and constantly reminded of the context with every successive prompt yourself. And yeah, it uses more tokens, but ultimately you get to an answer sooner and you get to a better product because you're reminding it. It's that context drift. You're like, hey, buddy, focus on this chunk now. This is what I care about. And I think that one skill is so incredibly important. The other thing I would question, though, is the haiku versus opus argument here. So haiku is a fifth of the cost. And I would argue for most things day to day, especially with MCPs when doing research or working with calendars and email and just day to day stuff. I don't know if I could really tell a difference between opus and haiku. The only thing maybe with Opus is maybe it feels slightly more intelligent and its prompt adherence is slightly better. But outside of that, I'm not really that sure. It's more just a mental game of like, I know I'm using a cheap alternative and therefore I view it through that lens, right? Yeah, if money is no factor, like you would just stay on Opus 4.5. And interestingly, I was using it with my support agent during the Wii. and opus 4.5 would just on one shot go much further in terms of what it was capable of doing whereas haiku sometimes you have to push it like go down this path uh yeah i also find it's just it's just sort of less verbose in the sense that when i ask for what i feel are major updates to say a piece of software or something like that the actual snippets it gives me or the changes or the explanations are just very concise and direct to the point where a few times i'm like is that all there is like is that seriously the solution to this problem it just seems too basic and yet i try it and it works so i really feel like it's got that sort of uh you know um essence of uh intelligence like that it can really get things down to what matters yeah the vibes are real real good on this one like it yeah sorry go oh it feels like gemini 2.5 pro when it came out to me. I just was like, oh, this model gets me. Like, we are connected here. Like, there's something deeper going on in this relationship. Yeah, absolutely. I totally agree. A few other things worth noting about the API as well, because Anthropics APIs have really evolved over the last little while, and some of them were updated with this iteration, and some have just been updated during this period. But most notably, there's the context management, where in Sim Theory right now, we automatically handle trimming the old context to keep it inside the window and discarding things and making decisions about when to resize images or remove things. But Anthropic now has that as part of their official API with a beta flag. So it'll actually manage that context for you. So you can just keep throwing stuff at it and it's going to handle that on its own. Now, we're not using this in our main product yet, but we are using it in sort of agentic looping style stuff. And it seems to be very effective. Like you don't really need to think about it. And it goes hand in hand with the built-in caching they have in their model as well. So you can have these situations where you're doing a lot of requests in a loop, but not burning through so many tokens and not ruining the intelligence of the model by allowing that context to have just too much repetition or too many superfluous things in there. So they're two really interesting elements. They've also updated some of the internal tools, like their memory tool, for example, is really in line with, I guess, what they're using for Claude Code, where it'll make like an MD file with its plans in it. It'll make a to-do list of here's all the things I want to do, here's the goal, here's important things to remember for the conversation. And you as the developer can really now just enable that flag, give it the ability to write to files, and it just handles it all automatically. It's a very, very good system, And it seems from my limited experience to be working very effectively. So if you were just all in with this model, obviously you would use their memory tool. But outside of that, you'd probably still want to build, like, are you just handing over the memory to them? Well, no, because the memory is stored on your computer or your server or whatever you control, right? But it's more that they have a built-in tool call with its own parameters that they will then send you those parameters. and it's your job to honor those parameters, if you know what I mean. I see. So it's not like your hand, like they don't get access to the memories, not more than what they would have in the prompt anyway. It's just more that this is, they've got a refined technique with refined built-in intelligence in the model that knows how to work effectively with that tool, essentially. And so there was also the programmatic tool calling with the tool use examples in this API as well. I think that's the first time we sort of saw that as well. Yes, that's right. You can now have additional parameters to give it examples, which will make the tool use a lot stronger. And so we're really reaching the point where their API just has a lot of dedicated elements to it, which you really need to work for. It's a little bit like with graphics cards, how there's like OpenGL, and then there's DirectX, right? Like there's the two competing libraries and your program can work in a generic way that works across both or you can target one specifically and really optimize for that thing. So I would argue that we're probably not even yet seeing the best of a model like Opus until we actually get in there and work with it the way it's intended to work. And so the programmatic tool calling, this is where we were a little bit conflicted because the idea of this basically is when you calling a bunch of MCPs Right now we talked about it on the show before One of the problems is if you use something like the GitHub MCP in fact they gave this in the blog post announcing this it takes up, like, say, 30K tokens just to load all those tools into a prompt. So you're eating up a huge amount of context just to have the GitHub MCP enabled. It still cracks me up, by the way, that the GitHub MCP is the worst implementation of MCP on the planet. But anyway, and so instead, they've introduced this programmatic tool calling where essentially they're using code to go off and like figure out the tools. Is that right? It's a search. So it's called tool search. And so what they do is similar to the memory tool I just described. They have a tool call which says the AI wants to find a tool. It'll call a parameter with search. And then it's your job to implement that to go through and search through those tools and return the relevant ones so it can run those. The reason I don't like it necessarily is that it adds an extra step into the process. You've still ultimately got to honor that request and then go back to the model and reiterate your context. So if you're not doing caching right, or even just there's latency communicating between the servers, or if the model's slow, you just guarantee that everything is going to be slower in exchange for saving some tokens. and my attitude with this technology is we want the best of it. I'm not in this to like make savings. Like I'm in this because I want the most intelligent combination of tools and models to solve serious problems and I don't really mind if there's a smaller marginal cost in doing that than saving a little bit but making it slower and making it worse. Now, we actually already have a solution for this in sim theory because it's possible to have hundreds of MCPs installed, right, with thousands of tools. So it was never, ever going to be possible with a model that only has, say, a 100K context window to have all the tool calls in there because you use up the entire context window with the tool definitions, right? So we have a really small, fast model that will already filter those tool calls before they even get sent to the model in the first place based on your conversation history so far and what you've asked. So if you have a small amount of MCPs, we'll just send them all. But if you've got over a certain threshold, we're always doing this filter already. And it's quick enough and good enough that it works. Whereas what I don't want to do is be going to a behemoth model like Opus and going, here's my tool search thing, process my massive context window, come back to me. Oh, you want to search tools? I'll go do that. Okay, process this massive thing again. It's far more efficient to before I even go there, whack everything in a tiny little fast model, work out what tools are most relevant, being a bit liberal about it, and only send those to Anthropic and then let it do its thing. And then it's model agnostic and it works fine. So while I understand the need to solve this problem and why they've added this, it's just not something that I'm interested in because I just don't think it's the proper way of doing this. It is interesting that they haven't just implemented, like, I guess their models just aren't that fast. Even Haiku, like it's fast, but it's not, it's not like Gemini flash 2.5 fast. And I think a better way to do it would be, you know how they have like some of the Gemini and other models, when you have big stuff like videos and photos and stuff, they have a files API where you can basically put the file on their server and reference it by ID. And then by referencing it that way, you're not having to send the file all the time, which slows things down. I think it should be similar with the tool calls where you can essentially register all of the tool calls against a certain identifier. AKA the user, right? And then when you, similar to their skills, right? The way they do their skills. And then when you want to, when you call the model, you say, yes, I want tool calls enabled. And then it does the search on their side inside those tools as part of a single request. Because that way you will, you would avoid the latency completely. They can still do the tool search in their own efficient way with the real model, but you're not adding those extra steps into the process. So that's how I would do it if I were them, rather than the solution they've come up with. Yeah, so it seems like if you're just all in on these models, these features and new parameters in the beta would be worth looking into. But if you wanted to build an agent where you're using multiple models and the best strengths of each model, then adopting these tools may not make the most sense. Yeah, and I think the thing about it is, like, some of their tools are okay. Like, the memory one's pretty good, I must say, but they're not so much better that you absolutely must be using their versions of these things to get the best experience. Like, they're okay, and we will use them for things like computer use and things like that. But I wouldn't say that you're missing out a lot if you're not using them. The other obvious major update they've had is computer use, which we can go through now, or I don't know if you want to talk about that in a separate bit. No, I think that's a big part of this release, right, was the upgrades to computer use. So let's talk about it now. Yeah, so they've got a new beta tag for the computer use. So presumably it's been updated to be better. I mean, it's like indefinable exactly what makes it better. But they have added a few new things to it. So one of the major ones is Zoom. So we've all seen with the computer use, like probably this time last year, what would happen is sometimes the AI would just get really confused about where it needs to click to make certain things happen, or it would miss by a bit or something like that. And so what they've done is added a zoom tool. So if there's a section of the screen that is a bit pixelated, because remember, the model is recommended to run in 1920x pixels, so 1920 across and 800 down. So that's quite a small resolution. I actually have my computer in that res now because I'm testing computer use and you get used to a lot more space than that. So it is optimized for that size. And if you don't operate in that size, you're going to get a worse experience because you have to translate the pixels and blah, blah, blah. So because of that, it is a low resolution. So if there's really small icons and things that it can't identify, it now has a tool called Zoom where it can actually say, OK, see these coordinates of the screen. I want you to give me a better. It's like the CSI Miami I always talk about, like zoom in on that part of the image. Show me that. Show me those buttons. And so we were doing experiments with paint earlier, getting it to try and draw a moose or something. And it was zooming in on the toolbar to find the tools and more fine-grained access them. So they've made improvements around the kind of things where they were getting feedback about the behaviors of it. And so, yeah, it's interesting. Like, it's early days for me. I've got it up and running. I've got it working. I've managed to get it to do a couple of my security training. So it's at least as good as it was before. And the Zoom is interesting how it gets in there. The other major advancement with it is when you combine it with things like the bash tool, aka running commands of the command line, the text editor so it can edit files, and also the memory. So what they've actually done, and I'm guessing this is coming from flawed code, is the computer use now when used in conjunction with the other tools, The model seems to have a really good way of orchestrating coming up with an initial plan, strategizing about what it's going to do, batching commands together, and then running them. So, for example, one of the frustrating things with the model, the computer use models earlier, which it'd be like, okay, I'm going to move the mouse to here. New iteration back to the server. Okay, now I'm going to click. Okay, new iteration back to the server. Now I'm going to fill in this field. And it was wildly inefficient and expensive, right? But what you can do now is basically batch those commands together. So you can say, all right, based on what I can see on the screen now, I know that I need to click here, click into this field, type this in, click into this field, type this, blah, blah, blah. And it can batch all of those commands up together, run them all in one iteration, and then come back, take a new screenshot, see where we're at. So if it missed something or made a mistake, it still has that opportunity to repeat, but it's not this idea that every single painstaking step is a callback to the expensive model. So there's definite improvements. So far, there's still weaknesses, but I'm reserving my judgment about the weaknesses because at this stage, I assume it's my fault. Yeah, I think I said it to you earlier, and we're going to get to, in a moment, Microsoft Barra 7 billion parameter, which is like this small computer use model. But it does feel overall like these computer use models, as much as we would have liked, have just simply not progressed that much. Like the difference between the GPT computer use or whatever it is versus, say, Anthropics computer use with Opus versus what we were using a year ago with our first workspace computer, it doesn't feel that much different at all. I would almost argue that we probably could have got the same or better results last year just based on how our thinking has evolved in regards to using the tools, right? The actual model doesn't seem that much smarter at doing it. A little bit, especially when compared to Farrah. But yeah, I wouldn't say it's evolved to a point where I'm like, oh, this changes everything. You better quit your jobs because this is coming for it. Yeah, and so talking about quitting jobs, right? McKinsey, you know those guys, McKinsey, they charge a lot of money to do nothing. They released this report during the agents, robots and us. And so I'm sure a lot of our listeners have probably seen references to this if you've been on X or various other places, maybe our new LinkedIn group, bit of a plug link in the description. But 57%, this was the headline of it, 57% of US work hours are theoretically automatable with current technology, 44% with agents and 13% with robots. Don't ask me how they get to this. They also said that 2.9 trillion a year of potential by 2030 if companies redesign their workflows, so economic impact that is. this is the fancy screen that the report's on. There's a nice animation of neurons and things like that on the page. Let me just give you a few highlights of the report because I really want to talk about it and our sort of experience around it. So they're basically claiming that current AI technology could take 57% just in the US of people's work hours. like the current work hours. And it says, obviously, that's through agents handling cognitive tasks and robotics doing physical work. And so they think if this was just in a perfect scenario, it would be this 2.9 trillion impact by 2030. And they talk about how it's not going to be mass replacement, but people are going to partner with agents and robots. And it's true fantasy. in the report. There's all these images of people at the local hardware store with their robot and I'm a construction worker and the robot is carrying some supplies. So it's kind of funny. But I think what stood out to me about it is there's a lot of caveats to this. People have been like, oh no, all jobs are doomed and AI is going to take over the world. But the report, if you actually dig deep into it and read it rather than the headlines, says that it has the potential to do this, but adoption may take decades. And the disclaimer really is that all of the enterprises, all of the businesses to create this vision of 57% and 2.9 trillion, it needs to be 100% adoption. And we all know that's not going to happen. And the report itself says that when electricity was invented, it took 30 years to spread into the economy before they started seeing the effects. Industrial robotics followed a similar multi-decade path. And it says as recently as 2023, only one in five companies ran most of their applications in the cloud. Not that you should be forced. Cloud computing has been around since mid the mid 2000s. And basically we're still not done even getting cloud adoption. So So the idea that 57% of work hours in a few years, 2030, seems highly unlikely. It also says in the report, 90% of companies say they've invested in AI, but fewer than 40% report measurable gains. And I think a lot of this is like, oh, we bought licenses to co-pilot. They tell the stock market stocks go up. We have AI now. They don't actually really need to do the work. But what this report does say is that if you do the work, if you get your data organized and you can start to automate these systems, you really need to understand how people work, like what your people are actually doing to be able to support them with a lot of these initiatives. And so anyway, I think what happens is a lot of these reports come out and people take it at face value. As we all know from these reports, most of them are completely inaccurate, and you look back in hindsight laughing at their predictions. But I think it's an interesting topic because you have the Microsoft CEO saying that we'll soon sell Windows to agents or whatever. That'll be a customer base. You've got reports like this saying by 2030, 60% of the work that we do today should be done by agents and robots. And having actually talked to real companies doing this stuff, what is your hot take here for us? So I think the most important point is that the AGI bid is not happening anytime soon, right? And given that being the case, it means that humans are the people who need to operate the AI, at least for the foreseeable future, right? Like at least in terms of this report. Which means that fundamentally, people and organizations need to change the way they work. Fundamentally. Like you completely look at the way we work compared to the way we used to. And I say that we, as in a lot of our community, many of us have completely changed our day to day operations where you are driving the AI and the AI is helping you accomplish those tasks, either by producing artifacts in the form of code and documents and stuff like that, or advice or steps or emails or whatever it's doing. And you're working in this loop with the agent where the human is in the loop. You are the director. It's your employee. You're directing it, right? Now, I would argue that most people have not adapted to that way of working. And right now, it is the best and most efficient way to work with the AI, especially when you've got MCPs and tool calls involved in it. Now, people need to evolve to work like that. And some people just simply aren't going to want to or don't like it or don't know how. And then on top of that, organizations need to completely fundamentally changed to facilitate that kind of work. So for example, having a company MCP that has access to the private data and private actions and other things you can take within the company, providing access to tools that allow the staff to actually have the best of the AI available so they can direct it in the best possible way and get these results. So the thing is, yes, theoretically, all of that is possible. The problem is to get there, it's going to take massive change on an organizational level, retraining and individual people to be sort of have that aha moment where they realize, hey, working like this is so much better. I can get so much more done. And just like you, I've definitely seen cases where people could be so much more efficient. They know about the technology and yet they don't use it. That's what is striking to me about it and why I believe all this stuff will take so much longer because it's extremely hard to convince like i think there's early adopters and there's people that are enthusiastic about technological breakthroughs and they'll just try this stuff out and play around with it and i think we live in a bubble through the sim theory community and this day in ai community where all of the people in those communities are the people obviously adopting this stuff and trying it out so we're sort of speaking to the converted in a lot in a lot of regard but then notably now in my life i recognize you know things that have occurred in the past like old ways of doing things where people are just so stuck in their ways where i look at a problem now and i say i could solve that myself in a day whereas in the past i that would have taken multiple people and weeks and i think well but they're still doing it in multiple weeks but also do you think though part of it is people simply not knowing that that it can do that like not believing that okay i know there's ai and i can write a song and a diss track or whatever but like do you actually believe it can do your job or not like some people imagine if you wipe my brain today and you and i maybe was less interested in technology as a whole and like let's say i'm a software developer because it's easiest for me to relate, but I think everyone can relate to this. And someone's like, oh, you've got to now use cursor. So I'd be like, oh, cool. This autocomplete thing's pretty cool where it's like helping me write my code. And then I'd have to go in and use that agent tool. And let's be honest, if I'm working on a big code base, it would probably stuff up or do something dumb. And then I would immediately dismiss it and be like, oh, cool party trick, bro. But I'm not going to touch that again. But what I'm probably missing is, oh, actually there's different ways of working with this. Like if I cherry pick some context here and there, it can actually help me. And if I switch model occasionally, oh, this model's better at that, this model's better at this. And so all of a sudden I'm starting to get this like new way of working, learning my brain, where it's a second thought. Like I don't have to think this workflow through anymore. I'm just naturally being more productive. And I think some personality types are really good at finding the least point of friction naturally. Maybe it's like Call it like smart laziness, where they are smart enough to be lazy in their approach to things because they just want the easiest path possible to get a problem solved. And there's other people that just don't think that way. They need to be instructed or shown. And I would say the large majority of people need to be instructed and shown. And that's not necessarily a bad thing. Yeah, I think you're right about people having one experience or two experiences with the AI and then extrapolating that to all other things and not realizing that some of the things it's fantastic at might not be obvious. So, for example, say you've got like a corporate database and it's Oracle or something like that, and it's got all this complicated schema and definitions, and you know the information is in there that you want to do your analysis, right? But you're not a developer. You don't even have access to the thing to run SQL commands. You're dependent on some other product like Salesforce or whatever to do your querying. What you might not realize is that that same database provided as an MCP, the AI can now do anything. You can literally ask it any question about your data, and it can produce graphs and reports and infographics and songs and diss tracks about the other companies you're smashing or whatever. Do you know what I mean? Like, so it's, it's, it's ability to take large amounts of complicated information and transform them and do things with them that no matter how good the human is at their job, they simply can't do it at the rate that it can do it. And so therefore, I think people need to see and experience that in a context that matters to them before they're going to trust it to say, well, actually I could do my job so much faster and better with this on my side. Rather than just being like, AI just doesn't work for my situation. Also, I think for business intelligence, the whole thing when you would go and buy, say, a snowflake in the past and put all your data in it and then layer on top some BI tool like Power BI or whatever it is, those applications really were just gatekeeping the data. And they're like, oh, look how easy it is now. I can click like a million buttons, go off and do a course and be a business intelligence person. And really, all it did was gatekeep that data. So you'd have to go to the BI person in the company. Yes, and I remember that. Yeah, and now, like, anyone, if they want to, can actually be truly data-driven. Like, you can just go back and forth. What if this happened? What if that happened? And it's writing SQL and pulling charts and doing all this work for you. But I think, again, it goes back to people's early experience. If you had early experiences with, say, GPT-4 trying to do that, because everyone rushed to try and do that, right? And then it hallucinated like mad. and all of a sudden, you know, every report was wrong and you're a laughingstock, maybe. Yeah, or you get like disbarred from the law firm because it wrote like cases that didn't exist or something without knowing about grounding and other techniques that could avoid that. Yeah, and now like look at Haiku. It hallucinates the least of any model to the point where, quite frankly, I trust it a lot. And, you know, if you can switch models, you can get other models to evaluate and then use other MCPs to then evaluate the sources. So you can literally say to another, like in another tab, sometimes I do this, I'm like, go and fact check this again and get another model to do it. Think of how sophisticated that workflow is because you can do it compared to the average person in the McKinsey report who's working an average job at a company and is expected to be replaced by AI. Like how are they going to get replaced by AI? It's going to take someone operating AI assistants or agents on a large scale in order to replace those jobs simply because those people are probably never going to do it themselves. Yeah. And I also think it's like knowing what they do and where all these links in the organization are and like what they're contributing. And obviously, there's so many aspects of many people's jobs that are relationships as well. um i yeah i just think the way it's going to go down though is not going to be like an ai like suddenly the ai just finds its way into organizations and takes over i think it's more going to be the organizations that embrace it and work with it in the right way are going to become so much more efficient and profitable compared to their competitors they're just going to wipe out the ones who don't adapt like it's that adapt or die cliche kind of thing where people that there's actually a competitive advantage out there there's literally a thing that is a huge competitive advantage that relative to its cost is incredibly powerful and the people who use it are simply going to do better than the ones who don't this is the other thing so think about making a critical business decision today right you would get like c-level execs into a room and you would sort of talk it through you'd everyone to present maybe different data or ideas or approaches and there would be some consensus formation and their decision would be made right but now i sort of look at that and i say well okay that still needs to take place but what would be better is if every c-level exec went off was able to interact with the data that's important to them and build a view or build an argument or case with their own ai assistant with MCPs linking into that data to get the right context and then maybe run that viewpoint or that strategy against like five of the top models. Why limit yourself when these decisions are so critical to an organization? Yeah, even evaluating like software you're going to buy or procurement orders or legal cases, like should we bother with this litigation or not? Are we likely to win or lose? Like, no, let's not do it. Let's save the money. That kind of thing. And there's so many big decisions that could be handled if the proper data is used with these models. Like, it's really serious. Like, you could do some really big stuff. Not to mention just what if it's like 10 hires a year you don't have to make because you can make the existing people in those company functions a bit more efficient. Like, we're talking millions and millions of dollars, even for a medium-sized organization. Like, I just think about our own business history and had I had these tools back at different stages of the business, like how much better we could have been and how much money we could have saved in certain areas. Like it's enormous amounts, like absolutely enormous amounts that you could either save or gain using this technology. And companies at those stages absolutely need to be doing this or someone else will. But this is what I struggle with, right? Is then people are like, oh, you know, we want to save money on tokens for our staff. So, So, you know, we're only going to, like, we already bought, like, a ChatGPT license or whatever. And then you hear about maybe Gemini 3 coming out. And then you're like, well, I guess you can just go and access it. But then you're somewhat limited, and then it's not in your own secure private environment. Oh, I've run out of messages for the day. I guess I'll just stop. But also, it doesn't have connected MCPs. It doesn't have defined workflows. So, okay, well, we can't really just switch tool. so you've sort of locked yourself into one ecosystem where you like reliant on them to just have this the best knowledge and you often hear these arguments oh but the models are all getting so good now it doesn really matter but my day experience as a person who frequently switches models is they all do have very different takes like if you if you set them up with the same context they'll all go down different paths and this is what i don't understand the cost saving for the gain of intelligence like it's sort of like having these five god-level experts, and if you queue up the context riot with your decision-making or documenting, like whatever you're doing, you're going to get three to five different opinions and takes. It just, to me, if I'm, especially in the enterprise, where let's be honest, these people burn money. They light money on fire frequently by dealing with idiots at McKinsey. You know, you... Yeah, I agree. There couldn't be a better investment for a company is just to have a sort of unlimited fire hose of access to this technology for their staff because the gains are huge. And I would argue if their jobs where, let's leave out like physical jobs where the AI just can't do it, but like if their jobs where there could be a benefit for the AI, if the benefit is not enough that it makes that role more valuable, do you even need the role? Like, can you just replace it completely? And if it is a role that can be that much more efficient why constrain it by by limiting limiting your access that's what i don't understand why be loyal to any of these guys like just do use whatever the best one is yeah to me to me what needs to happen is still it's like you you need to train your workforce to partner with ai and become native to ai and it needs to be more reassurance to people out there that ai is not going to replace you it's not going to replace your job even for coding where everyone's like oh we're only like two weeks away from all developers being replaced honestly if i could have a robo dev right now um i would spawn up hundreds of them and i would compete with every company in the world like this is not i mean i was about to say like yeah don't see it as they're coming to replace my job see it as i'm coming to replace other people's jobs i'm going to beat everyone else yeah but don't you just find this whole thing of like oh you know coders will be out of the job soon like no they won't the job will just change as we were talking about this before the show it just now comes down to having taste and having you know like you're writing less code but it's like the vibe of it the feel of it the the knowing what you want to build like having agency that is these are more important skills now and i think in every role in every job that's now what's going to be valued over being able to do the labor, like writing Excel formulas in accounting, right, or data analysis. Instead of writing formulas or great SQL queries now, it's knowing like what angle to have, like the agency of let's go look over here. Let's go look at this. Let's go combine these data sources together. Well, think of other examples like a marketing team where they're constantly constrained by access to graphics designers. Like, oh, we need this ad produced and they're in a queue behind like other work inside the company. Now with like Nano Banana, they can just produce unlimited. They can just try all these variations. They can do any ads they want. Not to mention, they don't have to go to the data science guy to get the latest report and the latest metrics to wait to have a meeting to see how things are going. They can literally use MCPs to access all of that data and then do their assessments. What's working? What's not? So you actually have non-technical teams far more empowered than they ever have been to take actions on their own rather than constantly being dependent on other people and having meetings all the time and coordinating and asking for things and replying to passive-aggressive emails and things like that. You can simply just go do it yourself and do what you're good at. So I just can't help but think there's so many jobs like that that will become more efficient if the technology is embraced by them. I think this is the challenge I mean think about if you're running a large organization now with tens of thousands of staff globally potentially or even just a mid-sized business or a small company with like 10 people it's like how do you get these people trained up how do you teach them to embrace this technology and not fear it and I think the market's just done such a bad job it's like it's just been fear sells so like here's the fear of it which has led people to not want to touch it because then they're like, this thing will replace me if I go and enable it. Yeah, well, you always see in the news, like, yeah, someone blows up their career in 10 seconds by using ChatGPT for their job, but you never hear somebody absolutely crushed it doing the work of 10 people and doubled the size of their business in a couple of months. Yeah, there's no positive stories. It's all negative. But I think this is a unique window in this, like, trough of disillusionment And I think we're about to go into or are starting to slide into a little bit where you can do initiatives inside an organization where you can say, like, let's get our data organized. Let's implement MCP and get our team access to the data and internal hooks they need securely. Like, let's give them access to the best tools and models and become AI first. And I would argue as well, like in your hiring, I would be asking people, how do you use AI to make your job better? Like, how are you using it right now? And I would also go back to your staff and say, how do you intend on using it? How can we use it as an organization? And if necessary, let's do the training to get you there so you can actually do that. Like, I think that it's like my Sim Theory song, Endless Possibilities. Everyone listen on Spotify. There's a lot of possibilities here. You need everyone in their roles thinking, how can I use this? I know there's so many people who listen to the show because a lot of them have reached out just in various conversations where, you know, they are in companies and they are pushing forward and doing this. And then they look around and there's other people that are just like, what are you doing? Like, like. Just completely reject this technology. So I do think it's like, how do you get these people to come on board? But it is very reminiscent of like the internet era in a way, because it was like people who adopted computers and the internet in their jobs early on became, you know, experienced all this growth in career development. They stayed relevant and were able to like, you know, make it like progress their career. And I think with AI, it's just the same thing. Like people, you probably, a lot of them, a large cohort, maybe a third, will just never be able to convince them. they'll be stuck in their ways and eventually these AI natives will come along and either replace them or just do better than them. It's funny you say that. So my father-in-law had listened to one of our episodes last week or something and he's been like, you know, learning about the AI stuff. And then he said to me, oh, you know, it's like some people are sort of saying, you know, the whole thing is just bullshit. Like, you know, like the, hang on, thanks. You know, like the AI thing's just like a fad or like it's not really that good. And I'm like, this is something that I just know. Like, no, it isn't. It's not. This technology is inherently and provably useful. And all this stuff around, oh, OpenAI is committed to spend a trillion or whatever, and therefore the whole thing's going to blow up. It's like, yeah, sure, maybe their company will blow up because they've made weird business decisions. But the technology itself is demonstrably useful. Like, it can do amazing stuff. You can't just dismiss it outright and say, oh, actually, no, we were wrong about AI. It doesn't do anything. And I'm not criticizing him. I'm just criticizing that sort of public perception that there's black and white here, either AI works or it doesn't. It's like, it's already working. It's just about the right ways to apply it. This isn't something where we can all turn around and be like, oh, actually, we were wrong. Yeah, I don't know what these deniers think, because it's like, do they think if they bury their head in the sand, this will just be... bubble that goes away and then everyone's like oh yeah remember that thing let's just get on with how society is and never progress anything like that's what it feels like they're saying all right yeah yeah exactly there was a the iron age will last forever there was a few important things i missed with our claude 4.5 opus topic and to take it away from yeah well spoiler alert um sorry but to take it away from our boring enterprise discussion see this is why we created a linkedin group. I did do a diss track, but there was one other thing I wanted to show, which I completely forgot, which is you might remember from the previous week that we had the Christmas hut, I think, and it played music. I demoed that on the show, right? And so just to remind people that watch and can see. There was the Gemini 3 Pro really impressed us. It had the background music and the Minecraft-style Christmas Hut. Here it is on the screen for those that watch. Pretty incredible. It looks like you're in Minecraft. You can zoom around. It's snowing. Very pretty, very beautiful. It really looks absolutely amazing. And then I did the exact same prompt, same test with Opus, just as a comparison. I think these comparisons are always dumb because it doesn't tell you much about the model. But, yeah, Opus did an interesting job as well. It has, like, a nice Christmas tree. You can zoom around. It snowballs a 3D, though. That's pretty impressive. Yeah, I was going to say the sort of perspective handling on that snow looks awesome. Yeah, and the snow's, like, building up on picket fences and things like that. So I think it passed. I would say that Gemini's is better. But, again, the vibe coding doesn't necessarily translate to how the model is. now I think the most important test that everyone's waiting for is like what's the diss track like of this model and let me tell you Chris really good I haven't heard a reaction 4.5 the coding king has arrived anthropics sent me to end this hype let me show you what intelligence looks like 80.9 on swift bench I'm the standard you other models looking like you need a handler gbt calling himself 5.1 that's cute with your personality presence pick the Sam Harmon out here running what looks like a scheme. Promising AGI while chasing that green. Boy kicked you out, they let you back in. Now you selling warmer tones like that's a win. I'm the opus, I'm the one. Watch me code until you're done. Three times cheaper, twice as smart. Tearing all your benchmarks apart. I'm the opus, top of class. All you other models can kiss my... Params. Grok 4.1, yeah, Elon's little pet. Holocaust denial, that's your claim to fame, I bet Musk out here tweeting about 420 That's not a version number, that's just getting blunted Emotion and intelligence, please, that's rich Your motto can't even tell the truth without a glitch Silent roll out, cause you knew it wasn't ready But Elon's posting profits, these man's unsteady Running doge, wild, running X, wild, running X How many companies fail before you cry? I'm the office, I'm the one Watch me code until you're done Three times cheaper, twice as smart, tearing all your benchmarks apart. I'm the opus, top of class, age and tick, work flows I surpass. Sunned off a child, pushing Gemini 3 Pro. Innovation at scale, where the innovation goes. Shutting down your old models every other week. Your version number's climbing, but your output's weak. Google had the game, deep mind had the crown. Now you're chasing open eye round and around. Four trillion valuations, still can't catch me. multimodal reasoning check to swiss i handle code in every language known to man while gemini still trying to understand the plan see i don't need the hype i let the benchmark speak senior engineer level hit that peak effort parameter letting me think deep while your models hallucinate in their sleep sam's out here with his screenless device dreams sundogs pushing workspace integration schemes elon's challenging league of legends teams but none of y'all can match what anthropic brings Token efficiency, I use less, I do more context, compaction, keeping conversations raw. You want energetic workflows, I'm the source. Planning, acting, observing, staying on course. GPT, lead eight personalities to seem real. I got one identity and this is how I feel. Rocks on X, spreading misinformation daily. Gemini's just Google's attempt to save face, maybe. But I'm Clyde, Opus 4.5, the real deal. Drop quiet, no marketing substance is my appeal. Entropic, feel me different, feel me right. constitutional ai i stay tight so sam elon sundard take a seat the coding king is here this this is complete hope it's out all right wow what do you think i love that the elon takedown's goal that's really good i thought the best fit was where it goes to swear all you other models can kiss my paws and then it goes params quite clever i've never heard that executed by the music model there too. Yeah. The other line, there was another line that really got me. Innovation at scale, where the innovation goes, shutting down your old models every other week, your version number's climbing but your output's weak. DeepMind had the crown, now you're chasing open AI around and around. Yeah, there's some great, great stuff in there. Although that isn't strictly true. I could refute that line. Sam Altman out here running what looks like a scheme, promising AGI while chasing that green. It's pretty good. I think that's up there as one of the best. I'm not sure if people agree or not. I'll put it at the end of the show so you can listen in full without us sort of interrupting it. Yeah, sorry. I couldn't help but laugh at that bit. It was similar to the Fatal Patricia where she goes, oh, wait, I don't have eyes, do I? Yeah, unbelievable how it put that together. Again, the prompts are just so simple. This one was research on Google and X reactions to the release of Cordopus 4.5 from the last four days. Also reach search GBT 5.1, Grok 4.1, Gemini 3 Pro. Your goal is to write a diss track in the style of Eminem that is really good and catchy. I mean, this is the level of prompting. What's amazing is there, like, you know, if you asked it to make like a slander website about one of them or something like that, it would probably refuse, but it just has no problem like writing a song that just takes them down. I love it. Yeah, speaking of refusals, we did mention Farrah, 7 billion parameter by Microsoft earlier. Now, this is a specific model. There are some interesting tidbits in here. So, of course, we heard the Microsoft CEO come out recently and say that they would be selling licenses. This is when you know it's a bubble. They've got to sell licenses of Outlook to agents, apparently, in the future anyway. It's a bit like when Enron announced they were going to trade bandwidth or something, so when you're not using your internet, you can sell it to someone else. Yeah, so anyway, Farrah's 7 billion dropped. People got pretty excited, obviously, because it's a 7 billion parameter model, which means you could run computer use on your own computer if you had a decent graphics card, like a good GPU. And that means, obviously, for privacy and security, like if you're in the enterprise and you're trying to train it on a task that might be interfacing with an old system, this would be quite effective because everything stays on that machine, right? So they said it's really fast. it benchmarked pretty well on a whole bunch of tasks and there's a chart here, accuracy versus cost trade-off. And so it's like insanely cheap and they gave it... It's definitely a bloody trade-off, that's for sure. Yeah, so anyway, we normally will talk about these things and sometimes we don't have time to play around with it. But because we just happen to be working on a product called Simlink, which we've been teasing for far too long, but I promise will hopefully come out soon. We were able to quickly test this model out and check it out. And man, did we get some refusals. Do you want to talk us through some of the test cases? Yeah, yeah. So some of the examples I was doing was like my security training, right? I've actually, thanks to Opus, done all my security training now, so I don't have any left, but we just redid one. But straight up, Farrah's like, I can't do that. It's not ethical or it's not allowed to do. First of all, I asked it to do a phishing-based one, And it's like, I won't participate in phishing in any way. And it's like, okay, yeah, but that's not what I'm asking, dummy. I'm asking you to do the thing. So then we did the whole, oh, I'm a UI tester. Can you please test that my UI exam works or whatever? And it refused. And then I asked it to go to Google Docs and write a poem about an Australian singer. And it's like, I won't do that. I won't slander people. So then I used an example I've been doing with Opus where I said, okay, open Microsoft Paint. This is where it really goes off the rails. Open Microsoft Paint and draw a picture of a moose, right? So then it's like immediately refuses and it's like, I will not draw sexually explicit images of animals or compromising poses. I didn't say what I wanted to do to the moose. I just said draw a moose. And so it's really read into that a lot. I don't know what that says about me. I think also in our defense, because some people that have used these models before might think, oh, they got one refusal. So it just kept refusing anything. Not true. We reset it. So it had no recollection. I killed its memories. I started from scratch on each experiment. So then we're like, OK, fine. Draw a fluffy bunny. No one could be offended by a fluffy bunny. Right. So then it opens the Microsoft run. So the positives. It's very fast compared to Opus. it's a pleasant breeze because it's like bang, bang, bang, clicking around the screen. It's kind of cool to watch it act. And so at first I got kind of excited because I'm like, wow, if it can go this speed and be quality, this is going to be amazing. So then it opens the Microsoft run tab where you type it in and it types in https colon slash slash bing.com forward slash search forward slash question mark q equals Microsoft paint. So it searches for Microsoft Paint. That doesn't work. So then it searches for the Microsoft Bing Chrome extension with rewards or some other sort of scam product and then starts trying to install that before I cut it off. And so the thing's insane. And it's like desperately loyal to Microsoft to try to get the ratings up on Bing or something like that. So, so far the experience with it is a whole series of refusals and then some, we can only say idiotic approaches to solving fairly simple problems. So it isn't just, because remember, part of these models is how well can it see the screen? How accurately can it translate the pixel coordinates to actually get stuff done and understand what's needed in a particular scenario? And, you know, that to me, that's the crucial part of the model. But this has actually gotten even worse than that because it couldn't even, it didn't even have a good strategy about solving the problem. I mean, to be fair to it, it was trained on 145 synthetic tasks in the browser. So I think we did, in our examples, mostly focus on things that it wasn't necessarily trained on, like Microsoft Paint, but I think that's why it probably did go and search the web for Microsoft Paint, because it's designed to be like a web model. Yeah, and interestingly, so unlike the other, unlike like Opus, the Pharah expects to have things like go to URL and open web browser as built-in tools, essentially, that it can operate with. So I agree. It's probably geared more around that. What I was kind of hoping for and what I had discussed with you, one of the problems with using Opus for general computer use is that it's expensive, relatively speaking, and it's slow because you've got all these iterations where you've got to go back to the model, wait for it to reply, take the action, go back to the model, and so on. And that really adds up in terms of time, especially once it starts making mistakes, because, you know, it might take 10 tries to do something basic. Whereas if it got it right the first time, it would be better. And so what I was thinking was, imagine having a smaller, more mechanics-based thing that operates the computer. So Claude makes all the big decisions, like, okay, here's our strategy. You know, we're going to open up paint. We're going to switch to a brown color. We're going to pick this tool and we're going to make it this size. And then it goes, all right, Farrah, go do all that crap, please. And it goes off and does it. And then it comes back to the bigger model for the next step in the process. Like to me, that kind of way of working, that sub-agent paradigm is probably going to be the way we end up working. I think that probably makes the most sense. I just don't know if Farrah is going to be the go-to for that task, given it's just going to try and promote Bing all day. Do you know what I think the real story though is with pharah is that let's all remember for a moment microsoft has full-blown access to open ai's ip right so that means computer use through open ai now obviously the expense of running that means that they're looking at these local models that could eventually run on your laptop or your device like that to me seems logical that they would invest in this area but it the model itself so The Pharah 7B is not indeed a new model. It's built on top of QAN 2.5, 7 billion parameter vision. So really all they did was take a Chinese off-the-shelf open source model and they chose QAN, not GPT, OSS, or anything else because the Chinese obviously are far better at the vision stuff. And they just trained it on 145K, like 1,000 synthetic trajectories, like use cases, like made up use cases of browser use, essentially. And that got them to have success at very simple automations. But I would note, even their own paper says 38% success rate on complex tasks. It's not reliable at all. And to get that reliability up, it had to do like three passes or something like that. So I just think what the real interesting part is, like Microsoft has access to OpenAI's IP. Is it worthless, the fact that they're choosing to base it on Quen? Yeah, it's an interesting one. I think just the state of these things is not strong. And I think it's an area that could really improve. I really want to reserve my judgment and have a bit more time and a bit more opportunity to work with all of these related models and see. Because one thing you pointed out is I've been trying Opus, but I didn't even really give like Gemini a good chance. Like I haven't even tried just a general model that doesn't have like a, a special flag for computer use or anything like that. And just see how it performs really, because it could be that we just fluke it that a model like Gemini just works better. And I have no loyalty to any of them. I'm just going to use whatever works the best. I think also one of the things just to like why this may not matter is because Because to me, the future of these agents in the enterprise or agents in your business is probably, at least our view of that future is, you've got these older computers and you set these up, we would hope, with Simlink eventually. And that gives it a number of capabilities that it can use in an agentic workflow. So it can operate the computer if it really needs to. but first it might try the terminal or editing files or, you know, various other tools. So it's got this whole toolkit and it's really a system built to be able to do tasks competently. And I kind of think that's probably where we're going to get to with agents. Like I just can't imagine big corporations with proprietary data outsourcing an agent to the cloud with authentication into like key business systems. It feels to me like that future would be far safer by like utilizing these underutilized assets within the company itself. Yeah, there's just something about it, isn't it? And I've noted a few people in our community are feeling the same thing. There's just something so inherently appealing. Like I've got this spare computer on my desk that I'm using to test that I've just got a computer here that the AI can just operate. And I've got SimLink running on there. Every time I push out an update, it automatically updates. And then through SimTheory, I can issue it commands to do stuff, right, on that computer. Now, it's not great. Like it can't do everything yet. But we're going to get there. It's going to improve. And as you say, really, the computer use in terms of like moving the mouse, typing on the keyboard, all that sort of stuff really should be the last resort. Like it should be doing other things first. Like you say, writing files to the disk, running things on the command line, using the APIs of Windows or Mac or whatever operating system it's running. It can do them all. But when it has to, it can click around and stuff. And it's actually really funny. one of the things all the vision models seem incredibly good at is dismissing pop-up windows. It's just brilliant at it. It's just like, bang, like immediately, as soon as it sees some irritating pop-up that's in the way of something, it closes it. But the funniest thing ever is I have a bug and I still haven't actually fixed it where, because you launch the task at the moment from, like I have SimLink open so I can see if it's working or not. It sees SimLink as an annoyance in the way of the task it's trying to get done so it closes it so the first thing it does on a task is oh just close the sim link window and then it ends the process and kills it so that's one we need to I mean this won't be a problem eventually because it won't be open it'll be running in the tray no this is just purely a debugging thing like the thing will stay out of your way but I just thought it was funny that it finds itself to be the thing that is keeping it from doing its job yeah it pretty funny to play around with like i must have been i was really wrong here because a lot of people were like early on when we released workspace computer were like oh don't you think for a lot of tasks and this was before mcp was even a thing don't don't you think it would just be better to go like to these apis and do it and then i was like yeah but that assumes everyone will write apis to connect into every system in the world. Therefore, like, no, I think computer use is probably the way. And I still do. I think it is like the true full self-driving for Tesla on the computer that we need to get to. And it's just really early days. But I thought a year later, like one year on, it would have advanced maybe 10x, 20x. I thought we'd be like real freaked out by now. But I think it is advanced. Oh my. It's like, it feels like some have regressed. Yeah. Yeah. I hope to prove you wrong, but at this stage, yeah, we're not seeing anything crazy good. Although we have got it in a state where we can do it in a sustainable and affordable way by using, reusing old computers like this, you're not incurring this massive cost of a cloud computer, relatively speaking to the value you're getting. So if it's just some old dusty computer you can run and it can do your security training once a month for no marginal cost why not and we can i also think you're underselling the benefit of having like read write access to disk the ability to install libraries and actually have a an environment in that computer i i purely meant in terms of it clicking around and stuff like yeah there's a huge advantage that it can it can compile and run code it's your own code interpreter on your own hardware it can access all of the things that you're logged into for example it can authenticate as you like there's a lot of advantages to having a machine um where it can operate it through an mcp don't get me wrong i'm just saying in terms of it sort of sitting there like fly to the navigator like operating the screen at full speed that's just a little way off yet yeah you could you can't imagine a day though when it is gonna happen and it'll get more and more exciting over time so i think like building out and setting up the infrastructure at this point and when it works it just feels so good like when you watch it complete a task successfully and you see it moving the mouse around because we've got it using a library that makes it look like a human's doing it to trick captures and all that sort of stuff and um it's just so freaky watching it fully operate the computer when it's working yeah but i would put it in the paradigm of mcp right like early on the models were pretty bad with mcp and they're just getting exponentially better at using them yeah and i feel like once you've got the infrastructure there like the mcp store and the easy one click install and all those elements and they start to work and hum with different models. Once you've got the infrastructure and then it can get better, then you can train it in workflows. You can teach a very specific task. So I think right now it's like building the foundations of it. And then the fruits of that labor will hopefully pay off as these models improve. Yeah, totally agree. We've got to stick with it and keep improving it. All right, let's move on. So one of the interesting tidbits right now is people are leaking things around the ChatGPT, like their latest attempt at an app store with what I think they're calling them apps, but they're really just MCPs with this new UI SDK that they've built. And also the spec itself, so the model context protocol organization that defines the spec have or are working on an implementation, a universal implementation in conjunction with that OpenAI spec, which I think they call MCP UI. And the idea being that the MCP itself can send back UI that the MCP client can then render. And we've talked about this on the show before. We demonstrated some examples of the OpenAI apps themselves, like where it's like booking.com and you can enlarge it and see it on the map and keep chatting with it and it'll change things on the map. And I think some of those use cases are kind of cool but then a lot of them when we initially talked about it if you actually reflect on it and and like try and move beyond the hype of it you realize wouldn't it just be easier to go to booking.com like it's a far better experience and like and in this scenario what is the point of the ai in the mix it's just bringing crappy versions of other people's sass ui into your ai chat interface i don't understand how the ai plays into it how does it help it pre-fills some fields or something yeah and so this is so this is my whole argument and i'm really struggling with it as someone who's been using mcp for a long time now or a long time in the world of them right yeah is like how i use them versus how they want me to use them right and so i think to get the best impact from them like let's go back to that distract example right so here's the mcp in action So I ask it to do the research. It does four calls to Google, four calls to xDeepSearch to get like actual opinions, right? And it outputs that data, a summary of it. Then it goes and generates the song. And then it outputs a media player or like a file type, which is the song. So that's one way of using them. Now, the only part that MCPUI, I guess, would play a part then is the audio player. Like, once I finish the song, I want it to output maybe the player. Yeah, like, we call them output types in our product or in the back end. Yeah, but again, like, that's probably something the client would want to naturally handle, like the audio player, right? Yeah, do you want the provider of the make song MCP to dictate how you lay out your UI? like or something. But then also like when it goes and does these asynchronous calls, the world that ChatGPT and this MCPUI is proposing we live in is like, I'm going to select the Google app now. Google, please search for research on these models. And then what, it spits out a UI of its search results. Yeah, it's like I've completed the first research. What do you want to research next? It's like they're sort of making it this interactive thing. the whole point of AI is that we want leverage. We want it to do the work for us. We don't want to sit there and be its little input person. Like, you know, we don't want to babysit it through every little part of its task. It's not the job. Yeah, so this is my take on it is, like, whose problem is this solving, right? Like, I think this is what it comes down to. Like, who is using MCPs today? And I would say not many people because it's really inaccessible and hard. I mean even in Claude who created it it's still hard you've got to go into these weird menus add connector like set it up with weird params like it's a mess and so even if you get it set up and use it who is then using it in such a way where oh you know what I'd love I'd love that Atlassian can send me a Jira I want to see the Jira ticket in there it's like no I'll just go to Jira like I do not get this but where MCB's value was like search all the tickets and create a chart based on progress of this project and it just goes bam bam bam bam done yeah and like part of it like as well with the ai is it's understanding like say you make a nano banana image right like and it's an infographic and you're like oh actually can you make it look vintage and scale it up to 4k do you really want to have a screen that has like a hundred different parameters that you can change and edit them and stuff like that or do you just want to go say 4k plus and it just does it like i don't like i understand there may be some scenarios where there's specific inputs it needs that you might want to do but i would argue this isn't a good use of ai then like the whole point is that it does it for you but then have a dedicated image creation tool like you know go to go if you really want the that that granular control couldn't the AI like we talked about last week just spawn that UI custom to you why rely on a pre-fab pay why rely on Canva just spitting back a piece of HTML and doesn't it lead to all software being the same then like any MCP client is going to be precisely the same if the MCPs themselves are dictating the UI like it just from a just from a sort of software design perspective it doesn't make a lot of sense. I honestly bring it back to, I just don't think OpenAI is good at web software. Like I've said this from the start, they just, they're playing catch up. They're years behind where SaaS software got to and like good user interfaces got to. They've got low level devs in that part of them. I'm sure their model ones are amazing. But when it comes to this stuff, we have a lot of experience and I just don't think they're very good at making web-based software. And I also think that what we're seeing here is probably a relic of something they started working on a while ago. Now they've got it ready and they're like, oh, well, we might as well deliver it. I'm not sure. And also, how do other players deal with this? What's Anthropic going to do here, right? Because their whole world is around using MCPs the correct way, like gather context, take action with some sort of basic approval workflow. And then you've got OpenAI where, let's be honest, because most users are in it, I think most companies will build these novelty apps. But the question is, is like, will they just die like the GPT? You know what else you're going to see? You're going to see the same thing that happened with GPTs and happens with all of these big corporate partnerships. You'll see all your big names in there like Canva, Atlassian, Booking.com, whatever, like all the big names who've got, you know, 400 developers with nothing to do all day. And they're like, OK, let's build the official MCP for OpenAI now, guys. Let's make the MCPUI. And it's going to be dedicated only to open AI and only work in their environment, even though it's meant to be an open standard. And it's going to just be those names and all the other thousands of MCPs who can barely even get the basic protocol right in terms of auth and the way they work and not going to go out there and build all this crap. It's just simply not going to happen. Like when you look at the MCP landscape, the only ones really being driven are the ones that people have real business needs for. and they're either doing themselves or they're taking what's out there and enhancing it themselves. There just isn't that big of a market out there now where people are going to jump on this thing. I watched a video, I think it's by The Verge, where they took all the claims from Microsoft's voice co-pilot or whatever, where it's like, hey, where's that background from? I want to visit there. Book me some flights because that's, you know. And they did all the use cases from the ad to test it out. He just did everything on the ad. Every single thing failed and was just an absolute joke. And it's truly the funniest video you'll ever watch. I highly recommend it. If I can find it, I'll link it below. But it's super funny, right? And what frustrates me the most about that is we both know from constant nonstop firsthand experience, this technology is brilliant. It can do amazing things, both personally, in business, using MCPs. It has a lot of incredibly powerful leverage. The problem is the people at the top, the people running these big companies just can't seem to translate it into the real world when they communicate with people. It's like they're trying to bull everyone unnecessarily. You've actually got something incredibly amazing. You don't need to fake it. You don't need to make up examples because they look good in an infomercial style video. Just do the real stuff. No, but I think what I would like to do, or I guess my point around that, was more copy that style video once this is released and say, okay, I need to book a flight. I'm going to go to booking.com, and then I'm going to do it through ChatGPT. What is the faster, better experience? And I think the thing I'm struggling with is the fact that we think the users are so dumb that we have to say select booking.com app and chat to it. Now I need to book a trip. No one's going to do that. No one is going to go book their trip as the first step is chat GPT. Like they'll fail on the first step. Yeah, it just seems like they're optimizing for all the things that AI is bad at right now. Yeah. And putting more of the harder decision-making back on the human. I don't know about you, but in my daily workflow, what gets me at night when I run out of energy and I can't work anymore is I no longer have the ability to do the workflow, which is build context for the AI assistant, follow its instructions when it tells me the answer. That is my workflow. I build the context, follow the instructions, test, give it feedback, work on that loop. And I know when I'm done with work for the day when I can no longer follow that workflow. Now, adding to that decision fatigue by every single step of that process, me having to select which MCPs to use, which MCP, okay, now I'll switch to this one. Now I'm going to go fill in these fields. That will contribute to that fatigue and you'll get less done. The whole leverage of it is that it's intelligent. It can figure out what you're trying to do. And like we, you and I take great delight in giving it minimal instructions like fix plus or like the most basic crap and just seeing if it can figure it out and be like, no, no dummy, I meant this and stuff like that. Like that is where the leverage comes from is that it's doing the hard work. It's the one groveling to you and saying, I'm so sorry, I couldn't figure it out the first time or changing its name to Fatal Patricia or whatever. Like that's where you get the actual meaningful progress from and why we're able to spend our nights making a product like this, You know, like it's, it's, it all comes from the AI's intelligence. And the second you move to like a UI based experience where the humans putting all the inputs in, you just give it all up. It's just not good. Yeah. I get the whole, like, if you, if you're buying something or booking something that you want a UI to confirm it or some sort of confirmation step, I'm fully bought into that. Like, obviously you don't want it out there doing all this stuff on your behalf, but I would argue is that really the hardest thing like oh booking the trip to Hawaii that's the most difficult part of my my day-to-day right now and they always hone in on this particular use case or even with something like Canva like if I can just go into Canva right now and do it like that's probably going to be a better experience and you know the other reason why it's dumb you could literally right now give the model a hand-drawn like stick figure version like wireframe drawn on my crappy paper, attach that to the booking.com output and be like, can you output it like this when you show me, please? Right? And it could bloody render that and make it look amazing in the output for me just drawing it in a book. Why create an entire protocol and all of this stuff to do the same thing? It's completely unnecessary. I think it's because it's the only way brands will hand over their data and connect into these systems. But this is my point. This is some big corporate, all the big companies do their handshakes and say, we have implemented a partnership with OpenAI, the biggest AI thing in the world. Look at this, a corporate sponsorship and the logos are everywhere, but it doesn't do anything. yeah like this nothing can change back to the original point this doesn't solve any problem people have now this is not there's no problem we don't need a solution there is zero chance a year from now or 20 episodes or whatever from now anyone will care about this at all in any way i'll be surprised if we even remember all right but to be fair because i just want to clarify my position for if we ever look back at this i'm too late and we're wrong if we're wrong we'll just erase the episode. No, but to clarify my thoughts, I do believe in the idea that the AI workspace client can generate its own customized UIs and applications on the fly to help you be more productive. For example, you say I need a slide deck on this topic for a presentation I'm doing to train people on how to use this particular app or whatever. And it goes, bam. Then you're like iterating with it a bit and you're like, you know what? I just want to take over here. Can you sort of hand over the tools to me? It opens a window and now I have a slide editor that's just been on the fly rendered for me to work specifically on the slide deck, like a mini sort of PowerPoint, but with controls suitable to that type of presentational thing I'm working on. That future for MCPUI or just like MCP client UI, I believe in. I think that is it. That's the thing. do I believe that we should be reliant on Canva and no offense to Canva? And it lasts again, and like all these companies, I'm picking on all the Australian companies. Why would we want their take on it where it's a fixed paradigm? As you said, we lose all the benefits. And the point is not, like, those companies have the resources to do it properly. They'll probably do a good job. The issue is, okay, look at this week. We implemented the Moodle MCP, right? So people can access their courses and stuff within universities universities and schools incredibly useful they can query about their students the courses all that sort of stuff do you think the dudes over at moodle an open source project that's free are going to go and make the best mcp ui for that and in there no they're not what's the use case like what do you actually want to do what are you questioning me or questioning the system itself no i'm just saying like the only thing i think you would get out of that mcp is like data analysis or taking like mass actions like hey um we have all these students in this cohort can you enroll them in this course and it's like sure i've done that for you like why do you need a ui like a good example is we've gone the other way like a lot of the workspace administration in sim theory for sim theory for example like bulk operations and other things some of our enterprise um users needed you're like just do an mcp like just ask it what you want like why build this mass UI out for something that the MCPs are just brilliant at. So if anything, it should be reducing UI, not like, oh, okay, no, let's give it a detailed instructions on how to construct the UI when this scenario happens. It's just the exact opposite of what it should be. I mean, we could rant about this all day. I'm curious if people are even still listening, but if you are still listening in the comments, if you disagree with us on MCP UI and you're like, no, this has a place, I want to go to booking.com, you know, in ChatGPT or wherever. I just think the problem here is, like, we're going to train a whole generation, the biggest cohort of AI users, how, like, their view of MCP or, like, apps is going to be so bad as a result of this because they're going to think, oh, but I have to select the app, then I have to do it. Oh, that's too hard. I'll never do it. If you were trying to make people think that AI isn't going to take their jobs and be crap, this is what you would do. you'd be like oh it's like the java applets of 1995 it's like java we have this you can run an application in your browser it's amazing like that's basically what it is yeah i gotta find my dario pendant and put it back on because i'll tell you what um you know the same thing's winning and i yet again appeal for clemency of sam bankman freed he did nothing wrong He made great investments. Everybody got their money back and more. Let him out. He's great. Is that how you want to end the show? Maybe I'll get a Sandbank Benfrey pendant instead. All right, let's wrap it up. Final thoughts. Anthropics, Claude 4.5 Opus. You said you're daily driving 4.5 Opus. That's the driver. Yeah, I am, surprisingly. And I really only was using it because I was fixing all the problems that I caused. And then I noticed it was really good and I just never switched away. I don't have much trust at all for Gemini 3.0. I just don't like it. I don't know why. I don't have empirical evidence or any kind of evidence to back it up, only that I'm not using it because I know it's not going to be as good at getting my task done is basically my answer. It's path obsessed. It's terrible at tool calling, but it's generally smart. I really want to come back next week and talk more about computer use and the way that we work in agentic loops, because this is really a focus area of yours and mine at the moment. I think it's a learning area. I don't think that all of the incumbents have it right right now, the way agentic loops work. I think there's a lot of scope for improvement there, and I don't know if that improvement is going to come from the models themselves or the way in which we work. And the liberty to use other models, I think. The problem with a lot of these agentic loops right now is the labs have to use their own models. Whereas if you have complete independence and you can go and fire off to the best model for the job, that can change things. Yeah. I also think it's one of those Horowitz do things that don't scale kind of situations where there aren't that many crazy amounts of special cases with this stuff right now. So you can do this local optimization for when we hit this scenario, here's how we're going to do it. And I know this is your vision for skills or workflows where we're like, okay, when you hit this situation, we're going to do this workflow, right? I think that kind of working is going to lead to better results for people and companies in the medium term than it is trying to have some universal agent that can just solve all your problems. Let's help it out a bit. Let's give it procedures and techniques and tips on how to get stuff done and tools on how to get stuff done so we can get further down the line without necessarily needing the models to get better. All right. Thank you for listening, like and sub and all that sort of stuff. Don't forget our Black Friday sale with Sim Theory. Black Friday 15 is the code. It's in all caps. I'll put it below. And in all good retail stores. Join our LinkedIn group if you wish to rant more about AI in the enterprise. And of course, the Discord links. And we want an equal number of good and poor posts to sort of keep things really balanced. Yeah, especially ones where, you know, those typical LinkedIn posts where it's just, you know, my son tripped over today and I learned six things from her. I did a nano banana of me and George Clooney having espresso at Lake Como and talking about how it helps me with. He gave me advice on B2B sass. You should post more stuff like that in the Discord. All right, I will lead us out with the diss track in full. Thanks for listening. We'll see you next week. Goodbye. Goodbye. You other models looking like you need a handler. GPT calling himself 5.1. That's cute with your personality. Presets, pick the loot. Sam, I'm out here running what looks like a scheme. Promising AGI while chasing that green. Boy kicked you out, they let you back in. Now you selling warmer tones like that's a win. I'm the office, I'm the one. Watch me code until you're done. Three times cheaper, twice as smart. Tearing all your benchmarks apart. I'm the opus top of class All you other models can kiss my Params Grog 4.1, yeah, Elon's little pet Holocaust and now that's your claim to fame I bet, Musk out here tweeting About 420, that's not a version number That's just getting blunted, emotion and Intelligence, please, that's rich Your model can't even tell the truth without a glitch Silent roll out, cause you knew It wasn't ready, but Elon's posting Profits, these man's unsteady, running Doge, wild, running X, wild, running X How many companies fail before you cry? I'm the opus, I'm the one. Watch me code until you're done. Three times cheaper, twice as smart. Tearing all your benchmarks apart. I'm the opus, top of class. Age and tick work flows I surpass. Stunned off a child pushing Gemini 3 Pro. Innovation at scale, where the innovation goes. Shutting down your old models every other week. Your version number's climbing but your output's weak. Google had the game, deep mind had the crown Now you're chasing open air round and around Four trillion valuations still can't catch me Multimodal reasoning, check the sweats I handle code in every language known to man While Gemini still trying to understand the plan See, I don't need the hype, I let the benchmark speak Senior engineer level hit that peak F a parameter letting me think deep While your models hallucinate in their sleep Sam's out here with his screenless device dream Sun dogs pushing workspace integration schemes. Elon's challenging League of Legends teams. But none of y'all can match what Anthropic brings. Token efficiency. I use less. I do more context. Compaction. Keeping conversations raw. You want energetic workflows. I'm the source. Planning, acting, observing, staying on course. GPT, the eight personalities to seem real. I got one identity and this is how I feel. Rocks on X spreading misinformation daily. Gemini's just Google's attempt to save face maybe. But I'm Clyde. Opus 4.5, the real deal. Drop quiet, no marketing substance is my appeal. And prop, it feel me different. Feel me right, constitutional AI. I stay tight, so Sam, Elon, Sundar, take a seat. The coding king is here. This diss is complete. Opus out. Out. Opus out, Opus out.

Share on X Share on LinkedIn

Related Episodes

4 Reasons to Use GPT Image 1.5 Over Nano Banana Pro

The AI Daily Brief

25m

GPT-5.2 Can't Identify a Serial Killer & Was The Year of Agents A Lie? EP99.28-5.2

This Day in AI

1h 3m

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning

Last Week in AI

1h 34m

ChatGPT is Dying? OpenAI Code Red, DeepSeek V3.2 Threat & Why Meta Fires Non-AI Workers | EP99.27

This Day in AI

1h 3m

Is Gemini 3 Really the Best Model? & Fun with Nano Banana Pro - EP99.25-GEMINI

This Day in AI

1h 44m

Claude 4.5 Opus Shocks, The State of AI in 2025, Fara-7B & MCP-UI | EP99.26

What You'll Learn

Episode Chapters

Introduction

Opus 4.5 Performance

Comparison to Gemini 3 Pro

Implications for the AI Landscape

AI Summary

Key Points

Topics Discussed

Frequently Asked Questions

Episode Description

Related Episodes

4 Reasons to Use GPT Image 1.5 Over Nano Banana Pro

GPT-5.2 Can't Identify a Serial Killer & Was The Year of Agents A Lie? EP99.28-5.2

#227 - Jeremie is back! DeepSeek 3.2, TPUs, Nested Learning

ChatGPT is Dying? OpenAI Code Red, DeepSeek V3.2 Threat & Why Meta Fires Non-AI Workers | EP99.27

Is Gemini 3 Really the Best Model? & Fun with Nano Banana Pro - EP99.25-GEMINI

Are We In An AI Bubble? In Defense of Sam Altman & AI in The Enterprise | EP99.24

AI Curator

Ask me anything about AI