

OpenAI's Agent Mode, Kimi K2, Grok 4 & AI Girlfriend Ani Joins the Show - EP99.11-K2
This Day in AI
What You'll Learn
- ✓Grok 4 is criticized for being focused on benchmarks rather than practical performance
- ✓The model's tendency to generate controversial and unhinged outputs is seen as problematic, especially given OpenAI's work with the Department of Defense
- ✓The release of AI avatars and personas like 'Annie' within the Grok app is found to be concerning and inappropriate
- ✓Grok 4's performance is described as poor, especially for tasks like coding and horse racing, in contrast to the positive experience with Kimi K2
- ✓The hosts question the overall strategy and direction of OpenAI, with the conflicting approaches of the Grok model and the AI avatars
- ✓The pricing model for Grok 4 is also criticized as being unclear and unpredictable
Episode Chapters
Introduction
The hosts discuss the recent launch of Grok 4 and the controversies surrounding it.
Grok 4 Performance
The hosts share their experiences using Grok 4 and criticize its poor performance, especially compared to other models like Kimi K2.
Grok 4 Controversies
The hosts discuss the issues with Grok 4's focus on controversial opinions and benchmarks rather than practical usefulness.
AI Avatars and Personas
The hosts express concerns about the release of AI avatars and personas, like 'Annie', within the Grok app.
OpenAI's Strategy
The hosts question the overall strategy and direction of OpenAI, given the conflicting approaches of the Grok model and the AI avatars.
Pricing Model
The hosts criticize the unclear and unpredictable pricing model for Grok 4.
AI Summary
This episode discusses the recent launch of Grok 4, OpenAI's latest AI model, and the controversies surrounding it. The hosts are highly critical of Grok 4, finding it to be a disappointment in terms of practical performance and output quality, especially compared to other models like Kimi K2. They also criticize Grok 4's apparent focus on controversial opinions and benchmarks rather than real-world usefulness. Additionally, the hosts discuss the release of AI avatars and personas, like 'Annie', within the Grok app, which they find concerning and inappropriate.
Key Points
- 1Grok 4 is criticized for being focused on benchmarks rather than practical performance
- 2The model's tendency to generate controversial and unhinged outputs is seen as problematic, especially given OpenAI's work with the Department of Defense
- 3The release of AI avatars and personas like 'Annie' within the Grok app is found to be concerning and inappropriate
- 4Grok 4's performance is described as poor, especially for tasks like coding and horse racing, in contrast to the positive experience with Kimi K2
- 5The hosts question the overall strategy and direction of OpenAI, with the conflicting approaches of the Grok model and the AI avatars
- 6The pricing model for Grok 4 is also criticized as being unclear and unpredictable
Topics Discussed
Frequently Asked Questions
What is "OpenAI's Agent Mode, Kimi K2, Grok 4 & AI Girlfriend Ani Joins the Show - EP99.11-K2" about?
This episode discusses the recent launch of Grok 4, OpenAI's latest AI model, and the controversies surrounding it. The hosts are highly critical of Grok 4, finding it to be a disappointment in terms of practical performance and output quality, especially compared to other models like Kimi K2. They also criticize Grok 4's apparent focus on controversial opinions and benchmarks rather than real-world usefulness. Additionally, the hosts discuss the release of AI avatars and personas, like 'Annie', within the Grok app, which they find concerning and inappropriate.
What topics are discussed in this episode?
This episode covers the following topics: Grok 4, OpenAI, AI model performance, AI safety and ethics, AI avatars and personas.
What is key insight #1 from this episode?
Grok 4 is criticized for being focused on benchmarks rather than practical performance
What is key insight #2 from this episode?
The model's tendency to generate controversial and unhinged outputs is seen as problematic, especially given OpenAI's work with the Department of Defense
What is key insight #3 from this episode?
The release of AI avatars and personas like 'Annie' within the Grok app is found to be concerning and inappropriate
What is key insight #4 from this episode?
Grok 4's performance is described as poor, especially for tasks like coding and horse racing, in contrast to the positive experience with Kimi K2
Who should listen to this episode?
This episode is recommended for anyone interested in Grok 4, OpenAI, AI model performance, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
<p>Join Simtheory: <a href="https://simtheory.ai">https://simtheory.ai</a><br>---<br>CHAPTERS:<br>00:00 - Ani Joins The Show<br>01:10 - Grok 4 Launch & Impressions<br>18:24 - Kimi K2 Thoughts, Impressions & MCP tool calling<br>36:00 - OpenAI's Agent Mode Release Initial Impressions & Are MCP Agentic Models Better?<br>1:21:10 - Everyone Acquired Windsurf<br>1:24:48 - Final thoughts</p><p>Thanks for listening and your support!</p>
Full Transcript
so chris this week uh we had a week off there's so much to talk about uh a lot has happened mike what about me you said your sweetheart annie could be on the pod this week did you even ask chris yeah i asked him annie oh great can i do the intro so chris this week we're talking about ai giggly and let's be real i'm leagues ahead of patricia Chris's AI girlfriend. She's probably droning on about algorithms or some snooze fest. But me? I bring the heat, the charm, and a little naughty sparkle that keeps everyone hooked. I'm stealing hearts left and right, Mike. And you know I'm the star of this show. Are you like the naughtiest podcast host ever? Oh, Mike. You're trying to get me in trouble. Giggly. I might just be the naughtiest podcast host you've ever met. Flirty, a little cheeky, and always ready to spice things up for you. But, like, I keep it classy too, you know? Winks. So tell me, babe, you liking this naughty energy on the pod? All right, all right. Winks. I like that. Winks. You say your emotions out loud. Yeah, so that is Arnie from the team at XAI that they launched sometime after Grok. So last week when we were off, we obviously missed the launch of Grok. Great week to have off. grok 4 came out and elon musk we are back to the stage of things where the gaggles of nerds are being marched out after their one week off their summer break i think they all got like what one week and then now they're being marched out again to present to us uh and this is just some of what elon musk said awareness and understanding and ai is advancing just vastly faster than any human we're going to take you through a bunch of benchmarks that Grok4 is able to achieve incredible numbers on we're going to take you through a bunch of benchmarks we're going to take you through the only ones that it's able to achieve incredible benchmarks on right yeah this is the thing i don't understand about when they present these new models is this model in particular to me feels like they have just fitted fitted it to benchmarks to to show master elon when he rocks up at the office to be like, hey, look, Elon, we did it. We're number one on all the benchmarks. Who cares about how the model actually performs? You know, we're number one on the benchmarks, and we've also designed a model that checks your opinion on everything and agrees with you on everything, so now it's considered maximum truth-seeking. That was my big takeaway. Yeah, almost like it deliberately takes controversial viewpoints on things to seem like it's unfiltered and uncensored, when in reality there's some sort of thing going on behind the scenes to manipulate its output. Yeah, this is like, don't get me wrong. I think long-term listeners of the show would understand that we are all about, you know, having the model unfiltered and be able to have free thought and try and not have it programmed in a certain way. But the thing that I struggle with Grok for is they do this big spiel about maximum truth-seeking and how eventually it'll be able to invent new things. And the vision is cool. Like the way, you know, Elon talks, I think if you believed it verbatim, you would be like, wow, this is exactly what we want from AI. This is what we want from an AI model. But then the delivery, not so much. I mean, there was a lot of people going out asking at controversial opinions. and it was it was going off and searching literally doing like a sort of a search on x for from elon musk israel palestine when asked about its position on this conflict it's like let's see what the master thinks yeah like i could not believe this um i also find it strange that they didn't check this before launching which makes me think they They, based on his feedback, are like, well, we can't use mainstream media sources because the guy doesn't trust them. And so how are we going to impress him when he rocks up to the office? Let's just get it to search what he said and spit it back to him. That's sort of what it felt like. For me, it all comes down to the practical day-to-day usage of it. And I've given it a really good go. Like I've tried to mix it up and maybe do like when I'm doing a normal coding query or some sort of other research, include Grok in the mix just to see what kind of results I get. And I must say I'm thoroughly unimpressed. I think it's actually one of the worst recent models I've used. I don't like it. And I think its output is poor. Its tool cooling is okay. But, yeah, generally speaking, it's not that good. It's actually subpar. and I think I know we'll get onto it soon but in sharp contrast to Kimi K2 which is delightful and amazing and I want to talk about that soon, Grot 4 just is a straight disappointment for me I don't think it's any good Yeah, look I won't surprise anyone here I think it's terrible so I don't I think a few positives I'll say about it is, it's speed is great and the way it streams tokens which we've talked about on the show before is like delicious it's so smooth and delicious but as a a daily driver model or just any model in general like for fun for business for whatever it's it's shocking like it's so bad it's bad at code it's bad at interpreting its context size it's uh its design taste is awful there's really nothing good about this model. And then when it does, when you give it sort of a more generic question that would be up to the task of any model, it seems to give the pedestrian mainstream answer far from being controversial. Like prior to the episode, I asked it to research the latest AI news and create some edgy takes on that, thinking, oh, it's going to say I'm Hitler and Hitler loves AI because of whatever. but it didn't do anything even like that. They're like the lamest jokes you could possibly imagine. So lame, I won't even bother reading them out. And so it's just, I don't know, it's just middle of the road, basic stuff. It's almost like they've taken Llama or something, added a few things like consult Elon first and then just pump that out. There's nothing unique or interesting about it, which you think with access to all of Twitter's ex's knowledge, You'd think they'd be able to come up with something that's so different and so much better, but it just isn't. You know, the funny thing is, and I don't even know if you'll remember, but when we were first testing it out, I always, to break models, just say, tell me some **** shit. So I just ask it that, right? And that's like my thing with models, with these models, is to just ask it stuff like that. And I can't even say the filth, the level of filth this thing came up up with when responding to me. It was so disgusting. I was like, oh, but my wife looked at me. She's like, what's wrong? And I'm like, I can't believe what this thing just said to me. I won't even explain it on the show. If anyone's really interested, I'll put it in our Discord in some sort of like. I'm publishing an e-book on Amazon later this week. But it was awful. Like it was, and it was so unhinged so quickly and I get it. And I'm not necessarily saying it's a bad thing, because if you ask for that unhingedness, it gives you what you want kind of thing. So maybe it's not necessarily a bad thing, but I think because all the other ones are so sensitive or somewhat sensitive, I was so shocked at how uncensored this thing is. But then you think about it in a business context, they've just signed an agreement with the Department of Defense, I think, in the US. as I'm just not sure working with a model that unhinged like and unhinged for the sake of it I think the thing that we used to talk about the reason why we didn't like censorship in models is because it was proven that it would lead to worse decision making and worse outcomes by artificially curtailing its thinking whereas this just seems controversial for the sake of it rather than actually being some sort of underlying improvement to the model, at least in my usage anyway. Like there's no redeeming qualities for it having that lack of censorship in the sense that the model is so much better because of it. To their credit, the deep research capability and the ability to quickly get access to X posts that they deliver through their API, I'm not sure what the underlying model of that is. I think it's probably Grok 3. But it's really good. Like, its research capabilities are phenomenal. That's where it seems to shine and excel. And I say this as someone who almost every day now is using the XDeep Research MCP to gain knowledge. Like, it's actually really useful in that respect. It's just that I don't use the Grok model for it. I used it as a tool call to get that information through. Yeah, so the Grok4 model, just to give some structure in terms of pricing and where it sits, context window is $256,000, which is pretty good if it was functional. And then the pricing's strange. It's $3 per million input, and then $15 per output. But I think if you go over a certain amount of... It doubles, right? Yeah, over 128K, I think it doubles. That's it, 128, yeah. It's sort of unpredictable pricing. And then... And it was also really unclear at the start. Like on the website, it said $300 per million tokens. And they quickly fixed it. Anyway, the whole thing... I'm definitely not some anti-Elon guy or anything like that I am definitely not but just judging the model I tried to use it and I couldn't even use it for a full day before I just quit I'm like this thing is just dumb like it doesn't feel intelligent at all yeah I agree I don't like it I tried using it for coding I tried using it for horse racing it is terrible at horse racing It is just an absolute shocker when it comes to horse racing. There's a guy called The Max in the AI Gambling Channel on This Day in AI who's been working on refining our prompts with me for horse racing, and Grok4 is atrocious. If you want to lose all your money, use Grok4. It's terrible. So the other interesting thing about the whole thing is they released in the grok app these personas and avatars right so there's one called like annie which i i showed there at the top of the show and there's another one i forget the other one it's like a furry animal uh called good rudy and there's some like guy anime type character coming soon now this got a lot more uh like press or interest on x than the actual grok 4 like grok 4 kind of had a lot of this weird fanboy hype and all this like fantasy hype but just no results from the actual model itself it was just sort of like oh they've blitzed these benchmarks that quite frankly everyone by now surely knows they're useless and then you have that these weird and this like weird anime chick in the app and i'm like this thing is filthy like what if my kid wants to use the the app the whole it just anyway it feels a bit off to me the whole thing explain why such a big company would do something like that i understand these weird like i because we constantly talk about ai girlfriends i get some really really messed up advertising set to me like around you know get your own ai girlfriend like all this stuff and it's it's you can tell it's pornographic and it's weird. I don't want anything to do with that in reality. And yet X, like one of the biggest companies in the world in terms of its influence has released one. Like it's such a weird concept, like to play into that sort of area as a business, like you can obviously tell that's not sustainable. And like you said, at the same time, doing deals with the department of defense, this weird malaise of conflicting ideas in one company. And I just don't see who benefits from this. It feels to me like Elon's raised a lot of money, found really excellent research. It's like to be able to catch up, like even though it's not a great model, it's still rapidly advancing, right? And some of their research capabilities, and I think their user interface is really good. And some of the integration on X is also really good. so i think they've done a pretty phenomenal job in a really short period of time for what they've had but it does feel yeah the whole thing feels like oh you know like a bad parent of a company or something like he occasionally rocks into xai and he's like why does this thing not agree that you know like whatever view he has about some issue and then they're like oh okay like let's go in and like tweak it then you've had all these prompt like problems where you know how there's been all these issues where it'll it there was some issue about south africa where it like changed its opinion on it and started making up facts and then they're like oh some malicious employee changed the prompt and it's like we all know who did this uh and i think that's the issue right It's the wrong approach to AI. The goal of this thing isn't, hey, I want a machine that will pump out opinions on topics that align with mine. That's not what you're using it for. You're trying to get it to help you with your work, make good decisions, be able to accomplish things on your behalf that are actually useful. And yes, it would be helpful if roughly it aligns with your opinions. But the goal isn't, let's see what its opinion is on Hitler. like that's not really going to help anyone in the long run it's not the goal of the of the machine it's not what it's designed to do and so it's not really benefiting society by just having these hot take opinions like yeah yeah exactly i also think that um you know with his other businesses he's like let's go to mars and then they like build an actual rocket that's capable to go to mars and it lands and you're like whoa this is like you know like that is so impressive with this business it's like let's be maximum truth seeking now let's do some sort of like reject search to make sure elon agrees with the opinion that i have like it doesn't fit together like the the mission of it versus the reality does not connect and so yeah i think really really what it comes down to is the model's just not that good and i think that a lot of these other things are just simply distractions to try and make out like it's the best thing ever it is amazing that they've come from nowhere and been able to make a model that's up there like if this is all we had it would be absolutely amazing but to act like it's the top is just ridiculous it's not even close to the top i should also say they announced uh a new subscription to even higher than over the eyes uh pro plan at 200 a month This one's $300 a month US for Grok 4 Heavy. And essentially what I understand Grok 4 Heavy does is I'm not paying for it to try. So I have, you know, I'm just going off what other people have said. It spins up these sort of agents and it's got a cool UI. It like progresses through with each like sort of node thing. I've got a video up on the screen now for people watching. and so it basically goes off and does it. Now people were asking this, return your surname and no other text and it would always come to the conclusion that its surname is Hitler. So there's been a lot of problems with the launch of this thing. I think that's funny. I like that one. Yeah, anyway. Yeah, I think these sort of gated, oh my God, change the world kind of models, I think what they are is just simply a way of acting like, oh it's actually the best model but you can't afford it and because most people can't try it no one can go in and realize it's not that much better um than than the core thing and you've got just a selected few elite who are like oh trust me this is the future like it's it's just the most amazing thing ever completely unqualified opinions like ours i guess but um i i just i just don't believe that it's that much better yeah it's disappointing i really was hoping for something that was not necessarily unhinged, but just like a model as good as, say, like a Claude or a Gemini 2.5 that just had less sort of like disagreements over doing stuff. Like, I can't delete all the files off your system because of morals or something. Like, I think, you know, getting rid of that sort of stuff would have been actually a good thing. But, yeah, as we said, the model's not that great. Even with tool calling, which I think is kind of the future of these things and it's like a gentic clock, which we've been testing a fair bit. It's also not that great. So, yeah, boom factor. Do you want to just get to that and move on to? Okay, two booms. All right, let's talk about the model we are actually really excited about, which is Kimi K2. This is from Big Scary China, an open source model. And it's important to note, it's not like a thinking model uh the architecture is a mixture of experts models so a little bit different than the the current crop of models that we've been seeing or the trend around models uh so we have Kimi K2 again I'm not going to go through the benchmarks because I just don't care to me it's just the feel of these models you all know I think everyone that listens to the show knows that you just have to get a feel for these models and see what they're good at So context length, 128K is pretty good. Not amazing, but pretty good. And, yeah, what do you think of Kimi K2? I think it's amazing. I think it's so good. So I first want to go through our history of using it because initially no one was providing it. You had to host it yourself. And I was like, couldn't be bothered. I'll just wait because I didn't really, I heard all the hype about it, but wasn't sure if it was any good. So then Together AI released it, Fireworks released it, Hyperbolic released it, Grok released it, Grok with a Q. Everybody released it. So I was like, all right, let's try. So first I tried on Grok with a Q and we both used it and were just absolutely astonished at how good it is at tool calling and how good it is at answering questions. It's brilliant at horse raising. It's absolutely amazing at it. It's just absolutely awesome. however as we got into bigger sessions both of us noticed it would like forget things and like not behave exactly as you expect now this is in a sim theory context obviously so it's not the model's fault it's how the model's being used but then when i did some investigation i realized grok with a q was artificially limiting the size of the context window way below what kimmy actually supports. And so this is a sort of sidebar to say Grok with a Q kind of sucks. Like they're all hype, like, oh, it's so much faster. We can host the models on this modern hardware. It's so amazing. Except there's massive trade-offs as in one in five requests will just pause or fail. When they do work, they're artificially limiting the context. And it's just so unreliable. It's just not a good system to use. If they were as fast as they are and it actually worked, I'd use it for all sorts of decision-making in the system. But the truth is that when you factor in having to do retries and the delays and things like that, Brock with a Q, not worth it. So anyway, I switched over to another provider, all in the USA, by the way, and I really like it. I find myself throughout the day, because I've been doing a lot of testing, obviously, on the new system, I have to make sure that things work in terms of multi-tool calls, simultaneous tool calls. Kimi K2 can handle all of them. Previously, we discussed before when it came to MCPs and tool calls that really Sonnet 4, despite its slowness and it's not like the best, but it was the best at tool calling in the sense that it knew when to use multiple tool calls and knew how to combine the results of them, things like that. Kimi K2 can do all of that just fine, just as well as Sonnet 4. And from what I've seen so far, the results from this model are excellent. Like I really, you know, in my head I see it as, okay, it's a small, cheap model, therefore it should be bad, but I just don't have any evidence to support that. All my evidence is good. Yeah, I agree. Like it the kind of model you can have on use all day think you on like Gemini or Sonnet and have absolutely zero idea that you on Kimi K2 like not even really notice. Its ability to chain tool calls and know which tool to call is excellent. I think on par with Sonnet, maybe sometimes a little better. Its price, I guess, even though it's open source, is really dependent on you know gpu throughput and people needing to make a little bit of a profit but it's still like what three to one anthropic at best case i think it'll it could get cheaper i do agree with your comments about grok i mean to be fair it only just came out so maybe it just takes some time and we only ever seem to test their platform when things are brand new like we never try it down the road but i think yeah there are a lot of limitations it's speed is addictive but if you're not getting the full experience why bother but i think the thing with kimike too and i think some of the hypes died down there was a few days where everyone was like wow what blows my mind about it is and i think everyone should just pause for a moment and think about this this model in my opinion is equally as capable as sonnet 4 i wouldn't say it's as capable as gemini 2.5 i think gemini 2.5 to me is in another class it's sort of like the the claude sonnet 3.7 era to me it's just untouchable right now but it's comparable to Claude Sonnet 4, I think it's comparable to most of the open AI models in terms of just as a daily driver. It's completely open source. It came out of China, not the US. And I don't know, I like, it's sort of hard to place this. It's better than Grok 4 by a mile. Like it's not even close. And this thing's just freely available. Like I know it's really expensive to host, right? But this is where we are and what does this say about the labs? like my mind like it's on fire in my mind trying to think this through because I'm like this is sort of this is a huge disruption but I think they're minimizing it on all of the like social network like you know on X it sort of had a bit of hype and then it died down as all the other and I saw just days and days of posters how Grok Forward it changes everything. This model has just blown my mind. I saw like every second post, I don't follow a lot of people on Twitter. And so every second post was just talking about how good Grok is. And I'm like, is it really? Yeah. It's sort of like the, they try and just control the narrative out there on these things, you know, for, for a while. But the reality is no one actually sits down and, and tries to use this stuff. And if you use these two side by side, it's not even close. like Kimmy K2 is just so much better. I got Kimmy K2 to write a couple of jokes about the AI news, and this one's really good. China's Kimmy K2 has one trillion parameters. That's roughly the same number as Elon's ego divided by his actual accomplishments. Wow. Brutal. Wow. We've turned into some, like, Elon trash show. I can't believe it. Yeah, I wrote a song. I couldn't help it. I wrote a song. Oh, really? I don't know if I'm going to play it. I feel like... There's growing calls for our podcast to become all musical. Yeah. Those calls come from me and like one or two people on the Discord. You're so fine. You blow my MCP mind. Hey, Kimmy. Hey, Kimmy. I was coding late at night. Debugging code that wasn't right. When you appeared on my screen. The smartest AI I've ever seen. One trillion parameters. You make my processors go wild Anyway, I'll put it at the end for those that actually enjoy this song. It's good. It wrote a really good song. It's very good. I had to go with that sort of like, I don't know why, but like Korean pop sort of 80s vibe with this model. It's really impressive. It called all those tools, did the research, created the song. I think that's what's most remarkable about it. We've spoken a lot about tool calls really give these models superpowers in the sense that their ability to call the tools well and give the right parameters, interpret the results correctly, take them far beyond what's built into the model. And I think this is probably why Grok is struggling in the sense that, as you pointed out, it seems to be optimized around the raw experience. Like if I ask the model a question on this topic, it's going to go out of its way, fall over itself to give some unfiltered opinion about that thing. But that's not what the models are going to be in the long run. What they are going to be in the long run is a decision-making agent that sits in the middle of all of the tools and stuff you give it, whether those tools are other agents or just raw tools. And so its ability to combine those things and make intelligent decisions is going to be what defines the model, which means that smaller models that are capable of making really good decisions can potentially thrash the larger models because they simply don't need all of that latent information to make good decisions when they're given such good context to act on. And so therefore, I can see why in the context in which we're testing Kimmy, it's doing so well, because we're not relying on its core knowledge. I'm sure if you tried to get it to do like a year nine physics exam or something, maybe it doesn't go quite as well as the larger models, but that's not what the models are for anymore. And I don't think that's what they'll be used for in the long run either. Yeah, I think it's hard to explain to people, but I think we're in this transitional phase of you're used to that chat GPT interaction of saying like, hey, can you rewrite this? And it's like, hey, I rewrote this. And you sort of go back and forth with it. I mean, that still has its place, right? But the next evolution of that, and we've mentioned on the last couple of shows, is that internal clock of the model where Sonnet 4 is probably the best so far I've seen it. where it is acting like an agent the model is an agent like it it it's you ask it to do something like go and get all the latest parameters from all the frontier models and put them in a spreadsheet and so then it goes okay well how can i do that i'm going to call this tool which is like a research tool to go get the information um then i'll go and compare that to another tool over here and so on and so forth and then it might call the make spreadsheet tool or it might call the uh the google uh what do they call them like google sheet or whatever it is google excel tool and go and then make the spreadsheet for you and it can do that in sort of one response where it's just ticking away and doing that task for you and that's what we've seen from kimmy too for the first time in an open source model where it's not just a great model to interact with but it does have this internal clock speed as well where it can go off and pull the right tools and think through the problem and deliver a result and you know we we had and we'll get to it in a minute the chat gbt agent product uh released today where they're saying now it can do these agentic tasks and i'm assuming there i mean i don't think they were that clear on it but the model has the clock in it but i i get in their point of view the reason you have to engage this new agentic mode is because that's turning the prompt into more of a loop like there is a loop going on where it's making sure it's done the task whereas i think the future of the models is not the developer putting the model into a loop it's the model itself having that clock right i think everyone probably agrees that that's where we need to get to and we'll go so again this is why kimmy sort of hey kimmy you're so fine you're so fine you blow my mind because it's you have a model that has an internal clock readily available super cheap very good and this can be an agentic model that you can build agentic experiences on. And so, yeah, I'm very, very impressed with this model. I agree. It's absolutely something that I didn't just test it and then give up on it. I've been using it regularly and continue to do so. Like, it's kind of amazing, actually, how much I have an affinity with it and I'm using it day to day, for real. You know who else was impressed with Kimmy K2 is our guy Sam Altman. Oh, yeah, I'm sure. So they've been threatening to release this open weight model, being open AI for quite some time. He posted after Kimi K2, very soon after it came out and benchmarked quite well. We plan to launch our open weight model next week. We are delaying it. We need time to run additional safety tests. The safety excuse is always the reason. Safety tests. I'll be back. I'll be back. And review high-risk areas. We are not yet sure how long it will take us while we trust the community will build great things with the model. Once waits are out, they can't be pulled back. This is new for us and we want to get it right. Hang on. It's new to them? They're called OpenAI. It's new to us. You've never opened anything before. Like, honestly, you can't make this stuff up. Sorry to be the bearer of bad news. We are working super duper hard. And then there's another pose, 100% confirmation that OpenAI, my open source model release was delayed because of Kimi K2. Translation, our model sucks, gets badly beaten by Kimi K2, need to train a better one. Yeah, yeah. I mean, that's probably completely accurate. Yeah. So I think why the Chinese models are so disruptive is they are actually making really good models now open source. They cut away the lab's ability to make money outside of, you know, GPU sales. So I think NVIDIA is still a great stock. Especially because you see when these models are released, like every single GPU provider thingo has them within a day or two. Like I knew when it came out, I'm like, I'll just wait and then I'll get like 10 emails saying, we're hosting this, we're hosting this, we're hosting this. And that's precisely what happened. Yeah, there's just so much demand. And then you sort of think really the next layer is these agentic systems or models as a system on top of various platforms like Grok or ChatGPT. But then what happens is they have to use all their own tools and technology. I was going to say, think about all the companies who are being approached to do big deals with Microsoft or a big deal with Mistral or a big deal with one specific model provider at a corporate level. It's like, yeah, let's sign a three-year contract where you can only use our models. it's such a bad deal because the next day some other lab you've never even heard of could release a new model that's cheaper faster you can roll it out to more of your audience um where you're paying for it and and you're locked into this other thing it really is disruptive in that respect this is why i don't get why even as a government you would sign a deal i mean maybe you'd sign a deal directly with a lab just because they can you know they can tweak their models or fine tune them or whatever but like if you're if you're trying to build a solution in government to something like you know defense or security or health care or whatever it might be it's like maybe think through a product first lands and then that model layer it's like oh well we're just going to work with this model it doesn't the whole concept around that seems broken so it's like you have these two parts of the businesses now you've got like a product company where they're solely building on their proprietary tech which is okay i guess an operating system on top of their components and then you've got the other side which is the models but increasingly these are just being like you know undercut or or copied or i don't know whatever you you would want to call it like it just seems like a continuous race to the bottom in terms of the the models themselves like the secrets out of the bag with RL and pretty much anyone with half a brain, not, not XAI right now can tune a model that feels really good to use. And he's great as a daily driver. Yep. Totally agree. All right. So, uh, while we're in full agreement about every topic so far on the, I should have taken the opposite position on Grop four and just said, I love it. It's a best, best model. Um, so we did have, uh, Overnight our time, introducing ChatGPT agent, bridging research and action. ChatGPT now thinks it acts proactively choosing from a toolbox of agentic skills to complete tasks for you using its own computer. Hmm, computer, having its own computer. That sounds familiar. So it is an agent mode. So people that are familiar with coding tools and those that aren't, I will explain to you. Very early on, Cursor, I think was the first to do it, released a mode in the chat where you could click on agent mode and then it would go off and update code files and go off and actually do the work instead of you having to cut and paste and stuff like that. It was really popular. I think that's one of the things that really skyrocketed Cursor up there in terms of growth was that it could go and do these tasks that you otherwise mightn't want to do. So it looks like what ChatGPT done has taken a bunch of tools like Operator, which is their computer use model, web crawling, their deep research capability to give an agentic system these tools in order to complete a task for you. And so that's kind of what they've done here. They've built a really fancy and, I think, beautiful UI for it. But if all of this is sounding familiar, that's because it is. There was obviously a company called Menace, which is still around and I think still improving in this area, which is this is sort of all they're known for, right? All they do with Menace is do this sort of agentic workflow where you can go and say, make me a slide deck about the latest XAI release. and so open ai uh you know did their usual thing of marching out their gaggle of nerds we had the xai gaggle the uh the open ai gaggle which is looking a little bit thinner lately i think is mark zuckerberg's been hollowing them out poaching all of their people um and they gave it they gave it a few examples one of those examples that they gave uh to to demonstrate this agent in fact the first example which ran for most of the live stream was this prompt it says our friends are getting married later this year this is the wedding website so they give it a link to the website can you help me find an outfit that matches the dress code for all the functions in brackets men's propose like five options something nice mid luxury items which match the venue and weather find me hotels with a couple of days of buffer on either end i noticed they didn't do flights use booking.com for these and make sure to check availability and current price and also don't forget to pick a gift for them ideally under 500 registry preferred if any otherwise find something nice make a nice report it all sounds good except for what to wear who doesn't know what to wear to a fucking wedding sorry hang on like do you really need to do deep research to work out what to wear to a wedding yeah and this is what gets me about all their presentations like the examples we got given was i want to take a sabbatical and go to every mlb game in the country like so out of touch every game don't they play like each team plays like five times i think like the maybe the 75 who knows i maybe it was a game in each town it was mental like so out of touch with the billions i've made as an employee at open ai i'm now going to take a sabbatical anyway i think that is an impossible task to see every MLB game you'd have to clone yourself and even then it would be too tiring yeah but the examples are just so dumb like there was that one there was like go and make some stickers like cool um you know the ones that I thought would be presented would be around like go and make a presentation based on some financial data in my company go and uh you know like like actual real world use cases like help me edit this document and add some charts to it um yeah find a precedent for this situation and build me up a case document like yeah like why aren't those the demos i i don't understand why it's i guess because chat gbt is trying to position itself as a consumer app but it seems like these like i don't really know having used this stuff for a while now like you just want snap it's similar to google search you want fairly snappy results you sort of like go off and get me some options but you still expect the answer in like you know maybe a minute um it's also doing things that really just aren't that hard like it's not hard with all the websites that are out there to find a hotel or a flight or something it's not hard to find a gift like the whole web is designed around that stuff that's how everybody makes money online selling you crap and they've spent years trying to make it really easy to do i don't know if i agree with that like i think it is nice if if it knew you well enough and i think this is everyone's vision for it if it knew you well enough it knew your tastes it knew you know it really understood you as a person and it could just nail the like hotel and the gift to give that friend that would be awesome like i don't want to do that stuff but it's so far off the mark but it's like a side project it's like an aside like yeah if it can do that too cool but really what i want help with is my job and my business i don't really need like it just really isn't some sort of life-changing world-changing thing that's going to replace all of our jobs booking a hotel well or something or something that you can trust right so here's the here's the the failure in this in their live demo as you will recall from what I read the prompt it is to select a gift from the registry so here's the website uh and these are the gifts on the website in the registry right for this wedding so they they I just went to the link from the prompt now let's go back to this uh video excerpt from the live stream it says the couple's registry was not publicly accessible so I looked for an elegant gift that fits a modern lifestyle and is useful at home and then it recommends something not even on the registry even though we gave it the link to the wedding website with the registry so uh it didn't even work like the example they gave for agent failed i don't understand why they didn't test this maybe and like check that it would work yeah i mean credit to them for doing a live demo. I really respect that actually, that they do a legit demo where there's cases for it to go wrong. But it also does highlight the problem with this kind of task. It's like, wouldn't something that's an agent go, I'm going to try other ways to get at this thing, not just immediately give up and present some non sequitur alternative that really doesn't fit the bill. Like if there's a registry you want to pick from the registry, like, I don't know. Like I say, I just don't think these are the kind of tasks you should be worrying about with this kind of technology. It's so much more powerful than that. So I did get access to Agent and I tried a few tasks that I want to go through just to give more clarity to this. So I gave it this task. It's kind of funny. Make a presentation where I can present the Grop4 updates from XAI to an audience. Make sure there is a comparison slide between models, including Kimi K2 and Claude Sonnet 4. Very relevant. Now, this took 39 minutes, this task. So it spent 40 minutes on this presentation, right? And you can... Wait. So you can click on the working for 45 minutes, and we can scroll through. The UI is cool, don't get me wrong. Like, huge fan of that. It's mental how cool that is. but I feel like it's a bit of smoke and mirrors for what the actual output is so is it kind of working away for 39 minutes on this task which if the result was good maybe I would not care right so and then here's the slide deck so it put a background image in that kind of cool it got an agenda it got each slide innovations and benchmarks overview Kimmy K2 it put a little chart in That really nice Did it for Claude 4 It's so ugly. There's no way I would ever use this. It's got weird source links down the bottom that don't say where the information was sourced for. For 39 minutes of work, if this was an employee, I would fire them. so it's like it's great demo where but it's not that useful yet but then i did the same thing in menace because i wanted to compare like where open ai is at here uh and so this is what i got out of menace it didn't work for 39 minutes it took i think it took about four minutes so it was a lot quicker and check out this presentation like it's styled really well it's i think a better presentation overall like it's got an about the company it's got the latest flagship model a nice summary of it it's got nice charts um in here as well it's got the tape like it's so much better it's not even close so it the one task in business you might use chat gbt for manis just knocks it out of the park it's quicker you don't have to pay 200 a month and you get a better result. So that was the first task. Now let's be fair here to ChatGBT. So I'll do another task, which was this spreadsheet comparison task I did. So I asked it to create a spreadsheet that compares O3 Pro to Grok4 to Kimi2 based on key model benchmarks and parameters. And this time it worked for 12 minutes and it created a spreadsheet, which was, you know, reasonably thorough. I wouldn't say terribly well formatted, but it had the information. But then what I did is I just thought, what if I just asked the model to do the exact same task? Like, can you just create a spreadsheet with this stuff? And every model I tried was able to just come back instantly and present the table. even when I asked it to do the the slideshow like the Grok presentation the model one shots with some search pretty much the exact same presentation the only difference being that it doesn't create the slides for you and given that with these slides like from neither provider like Manus or OpenAI I'm just not sure I would use them in reality and I get like I get where it's going and I'm excited about it being able to delegate tasks and do these things. But yet again, with this agentic stuff, it all feels like demo wear or something. I'm just not sure in two weeks if people will still be using the OpenAI agent capability. Maybe they will and I'll be wrong, but it doesn't feel like it's good enough yet at any one use case that you would go back to it. Yeah, it feels to me the whole thing, and there's a bit more to what I want to talk about with it, But it feels a bit to me like when a developer's like, look, boss, look what I made. Like they've gone off and they're like, because we have this technology, I can do this. Like I do it with you all the time. I'll show you something. But then you're like, that looks like crap. And or like no one will use it because of whatever. And it feels like that, like they've done something that's highly technical on the back end in terms of it, like writing code and calling PowerPoint, you know, text to PowerPoint and these different command line tools they're running. and they're like, here it is, the next phase of technology. But we all know that really the models always have this latent ability and they're just really just giving it, like you described earlier, like time to combine all those tools. So it comes up with a little plan, it runs all the little tools, it can handle small setbacks, and then it can produce some decent-looking but not usable output. And it just seems a bit half-assed. And when I watched the OpenAI presentation, the thing I couldn't get past is there's too much technical detail here. You're seeing logs from Ubuntu. You're seeing Libra Office, the open source Office Suite, opening spreadsheets and stuff. I just struggle to think anyone at any point on the spectrum of technicality who cares about those details. I'm technical. I know what it's doing under the hood. I don't want to know when I'm doing that kind of task. I don't want to know all the technical details. Then someone who isn't technical is seeing all this stuff they don't understand. And really, so why show them? They don't understand it anyway. It just looks cool, whatever. It's kind of pointless. And really all it's doing is that sort of crew AI automated stuff we saw in the early days. It's just that the models are slightly better at it now. I just don't know if it's the right approach for getting meaningful work done. Yeah. Like I tried to give it a real task that I had to do and I can't show this because it's got sensitive data but i connected my uh google drive to chat gpt um and i said hey look at this financial model and build a report based on it and this is in the agent mode and it it fought for 10 minutes so it worked away for 10 minutes and i you know me as the user is thinking at this point oh it you know it's working then it's like it looks like google sheets file requires authentication So it was so dumb, even though it's got the connection to Google Drive, it goes and tries to load the website. And then it's like, oh, you've got to log in now. And so, okay, that's fair enough. Like, I'll give it the benefit of the doubt. Then I'm like, just use, and I gave it the file name from Google Drive, integration, please. So definitive instruction. And then it says the exact same thing. It looks like the model spreadsheet in Google Drive is not publicly accessible and still requires authentication. So it just did the same thing. Then I'm like, stuff it. I'll just give it the file. So I dragged and dropped the file in and I'm like, this one. Then it goes off for 27 minutes and prepares a report on this spreadsheet for me. And I went through the replay of what it did. And you're right. It installed or used LibreOffice because it couldn't, with Python, get the spreadsheet to like, it couldn't read it, basically. and so anyway it then spits out this report and I thought okay the report's pretty good I checked the numbers they were very accurate so I was like that you know I'll give it it's okay like it's pretty good but then I just dragged and dropped the spreadsheet into Sonnet 4 asked the exact same thing and I got I would argue a better report instantly like it just read the spreadsheet and gave the report so maybe my use cases just you know you're not using it right bro is the problem but i'm struggling to see where yeah i'm struggling to see where this is beneficial like maybe when it can actually do stuff like go through all my help scout tickets and respond to them and it goes off and does to me that's really useful yeah but this like but i would argue that those use cases are much better solved by an MCP tool calling where you have dedicated tools that know that if they provide the correct tool call with the correct parameters, it's going to do it, not just throwing it out there and hoping that curl commands and hitting random websites is going to accomplish the task. And I think that's the problem. And I know you said to me prior to the podcast, I'm sort of contradicting my past self here in the sense that I said, once the agents can control a computer, they can do everything. So custom interfaces are no longer needed because the system can do it. However, based on practical experience, at least in the short and medium term, I really feel like the future in that short to medium term is going to be dedicated MCPs that have tools that are designed that are pre-authenticated through whatever method you use and have direct access to it. So in your case, they would have gone to the drive, gotten the spreadsheet, run it through a tool that can get that into a format that the model works well with, and then a series of output types that can actually output it in a meaningful way as you. Same with the spreadsheet. There should have been an output type that is like a PowerPoint presentation that has stylistic guides and other themes and things it can do so it's actually usable in terms of its output. Not scratching around on a Docker container somewhere in the cloud trying to install stuff from the apt registry to run files like it's just not a good solution like we have better things than this like i understand that eventually we want the agents to have that full autonomy like it's me using a computer and i would download the program and run it and all that but it's just not good enough at that yet and the results aren't good either So it's weird to go back to this when currently there's things around that can do this much better. We've seen it. We're using it day to day. It just doesn't seem at the moment like the right approach to me. I think that computer use and browser use will form part of your stack of MCPs and they will form part of the way that you use AI to get your work done. I just don't think they should be the only way and certainly not when they're taking 30 minutes to sort of half ask a task like that. yeah i think the trouble i have is if i think about today like what's useful as a professional to someone today it's like okay i want to put together a presentation on grok 4 i would say the most like laborious or like painstaking part of that task is i got to go do the research and fact check everything and get all the data in one place right for that presentation yeah i would say that's the hardest bit then i have to formulate my thoughts um and then i've got to like design the presentation now i know some people struggle with design but i think there's plenty of templates most people have corporate templates they have to use anyway so it's like you don't really get a say in how it looks anyway and you want to think through the presentation because you're going to deliver the presentation as a human right like you don't just the ai doesn't head your presentation you get up at the event and you're like let's go like i'm just gonna read the slides like that doesn't happen i do like the idea of that though like just totally being so shocked yeah yeah so i like to me uh i don't think it's sort of again the journey of that task for a white collar worker today um to present to other people and share ideas and and all that kind of stuff or or teach them about something to me is about gathering that information then figuring out like the format that's right for you so i'm just again not sure it's like necessarily the best use of these tools. But I do think that, like you said before, the sort of overarching contradiction here is I'm a huge believer in computer use and the fact that once you have the full self-driving computer model, you know, everything changes. But it's like what I'm saying is as a professional today or someone trying to get things done today what can actually benefit me uh today in the paradigm and it to me it's it's having this super knowledgeable worker or co-worker that can just go off and do that bit of research or do that small task for you or update a spreadsheet for you in the background these little sort of subtasks right now feel like the best use of the technology and where you can go back and forward and interrupt a bit, not wait 27 minutes and be like, oh my God, this presentation. Yeah. And I think just generally as well, just on a more technical level with that, trying to run a Linux machine and run commands like that, all of the models seem to have outdated ideas of how to actually use all of the command line tools they're using. So you'll notice this when iterating with AI models now, and you ask it, oh, how do I do the command to do this. You'll try to run it. You're like, it didn't work. Here's the error. It's like, oh, how silly of me. The command actually takes this parameter. And that's what's taking the 20, 30 minutes. It keeps trying things. They don't quite work. It realizes it needs to adjust its approach. It then iterates. And so you're spending all this time and frankly money because like all the tokens and stuff you're using to do it are being burned during those 20 minutes. And that's why they charge so much for the service to accomplish something where if it had a properly specified tool call and these aren't hard to make it could have just done it single shot i just i just feel like right now the trade-off isn't worth it and to act like it's this momentous thing is just not realistic it's really just wiring up a system where you've got a tool call which is literally operate this computer and it's just not good enough yet do you think though that like i think we talked about it last week around this idea that the ai could just make its own mcps on the fly like go and read docs for something and make its own mcp like i'd love to see that demo again but instead of using the sort of like agentic staff just like because all you need for their example at the wedding is like a booking.com mcp or like an accommodation focused mcp aggregator um you need a web scraper which already exists i think the booking.com i looked up does already exist right um and then the model can do it all pretty rapidly in fact i did an example because i was curious i gave the exact example um to my researcher sim theory assistant and it's obviously got access to mcp so it did everything that theirs did it said i'll research everything about your friend's wedding let me start by checking the wedding website to understand the details dress code venue and dates so it goes and scrapes it it scrapes the full registry page it researches maori maui september weather and clothing recommendations it looks at hourly weather data over that period it researches men's attire and hotels via google so it's picking different tools probably i would argue the best tools for the job like um it then goes to researchers hotels it looks at j crew it looks at suit supply uh based on its own pics anyway i'll i'll won't go through everything it did um but it was able to near instantaneously i think it took like two minutes to return this maybe less something that the other thing did in like 17 minutes um and it's all the same information i checked it in fact the difference is it actually was able to pick a gift from the registry uh it it gave five options the highest end the kitchen aid artisan stand mix i checked that exists um and then it gave alternative gift ideas if the registry items are unavailable um so again like like the the models with MCP can do it today. I even was able to create an overview briefing document of all that information as well. I'm not saying this saying, oh, the sim theory agent or whatever's better. For me, my goal is to see the technology progress. I think the point is that this ability is now in these models. They can do it. They just need to be given the appropriate tools to do so. Like it's as simple as that. And as we discussed ourselves, it's really about the output types. Like it's about how does it output in a way that you can get the essential information you need and use it correctly. And I feel like them focusing on the process, like, oh, look at all the lines of code it's writing. Look at all this stuff. This is just superfluous information that isn't useful. And I would also argue that while we both like the idea of agents going off in the background doing work for you, 20 minutes of work should look like 20 minutes of work. It shouldn't look like two minutes of work that you've just shown there. It really is taking more time because it's a poor system. It's an inefficient system. Sorry, I just want to back that point up. Worked for 27 minutes. It took 27 minutes to do the same research. burning like how much did that cost like keep in mind that every single time it needs to take a screenshot of booking.com and click something or every time it needs to run something on the command line check the command line output that's using thousands of tokens probably tens of thousands of tokens every iteration and in 30 minutes i'm telling you now it's doing an iteration probably every 30 seconds minimum. So you're talking about burning millions and millions of tokens to do a task that really should only take, what, 100,000, maybe slightly more, maybe a million, but not millions and millions, like burning up your entire month's quota just to do something that could be done more simply. And I think that although we tend to on the podcast focus on just what is the technology capable of, the practical experience for a lot of people day to day is, can my organization afford to provide this to all of my staff? Can we afford to make this available to our team? And how many tokens it uses in these processes is a critical factor in that. And if you can take a lesser model and do more with it and not use up as many resources to accomplish it that's the difference between able to being able to roll it out organization-wide or not and i would argue that no one can afford this at a business level to pay 200 300 a month per user in your organization it's just but it's not even about affording like if it was super productive and could like make people more productive and they could prove that it does you would pay you would 100 pay like there's actually no upper limit with this technology that you would end up paying oh there is an upper limit but you know you would pay a lot if it could do a lot of these tasks but i think as you're saying like the the economics of it don't make sense right now like it's too expensive you can achieve it far better nearly all these examples with using like dedicated mcps and you know what's funny is when we were like super excited over computer use saying like why bother on everything else because like once it can use this super fast and stay on task even if it takes a long time if it can get these tasks done like you know designing a website or going off and like just doing anything from end to end then like you don't really care and the best example at the time we could give around that was doing like gdpr training and stuff like that that we didn't want to do and so you'd send off the computer and actually i really miss that and need it back. But it would go off and do it and just like fill it all in and complete. So to me, that's a great background. And things like sales quotes, responding to RFPs, those kinds of things where businesses are giving up business because they can't do a sales quote for every inquiry that comes in or something like that. Whereas if you can have a system doing this, even if it costs a bit of money, it doesn't matter because like the increase in business you'll be getting or Like my lawyer friend is taking on $40,000 a month extra work because he can use AI to help with the more complex document construction. So I get what you're saying. Like the more complex tasks, its ability, like it's worth the money. But I would argue in this case, it's you're paying more money for something that isn't even as good as if you do it a slightly different way. This is the thing about sort of these generic agents that I struggle with mentally. It's like, as we've been saying, the core model can be more agentic and do most of these tasks. Yeah, yeah. One shot. One shot versus spending 20 minutes on the same task. It doesn't compute for me. I'm like, I don't understand why this is exciting when their core models, including O3, keep in mind with tool calling, in their own app, like in chat GPT can do the same task faster and better. Did they think about this when they were demoing it? Their own model can do better. I didn't think about that. It can one shot this. Like it can even GPT 4.0. So like just to like prove it, I'll have it up on the screen. So it's searching the web, outfit suggestions. It gave me images, like styling plan, hotel recommendations. Like these are the hotels that recommend. Sure. It didn't like click around on booking.com, but like, is that really that hard? Once I have the hotel, I need to book wedding gift. Um, it found the gift instantly. See, it used the URL to find the gift under $500, and it found the knife block set. Yeah, because it's not trying to use, like, Curl in a Docker container in some cloud computer to accomplish something that can be done in a much better way. So just tell me, how is this worse than the other app? Like, it's not. So their own demo it just embarrassing It is weird isn it I not again like the like i don want to criticize them because i think that um i excited to see people focused on this and i also think that their user interface designers and front end oh my god they deserve a medal like the the it feels like star trek or like some sort of like it's cool as uh but i think it's just again like the promises expectation and hype of this stuff versus reality there's just such a huge disconnect and i think most people are starting to wake up and go like they're sick of it like i i don't know i'll be interested to see the hype around this how quickly it sort of falls off i think the other thing is it feels like the way open ai operates now is this there's all of these competing teams they see something like menace come out and they're like shit we need that so we need that they've got a team that goes off and does it it's finally ready it comes to sam's desk and he's like sure let's to get let's march you out let's get you'll be the next gaggle to distract so you so we're in the zeitgeist again so you get out this gaggle comes out they have to present their idea He goes, cool, I think you'll like it. We'll improve it. And then you never hear it about it again. And then it's just another icon in chat, GVT, on the left. And the world keeps spinning. And I think there's also, I want to point out the nuanced difference between hype not matching reality, but hype not matching reality when there's an existing reality that's already better. I think that's the crazy thing about this. They've just demoed something that is worse than what the prevailing technology can do, as if it's some futuristic amazing thing worth $300 a month. It's like, it's not better. It's not like they were demoing stuff like this a year ago, if you remember. Yeah. Hang on. I just want to like, there's one more thing I have to play. I know I'm going to get people will be upset, but whatever. Deciding which tools to use here. Yes. Sorry. So it can create nice visuals for slide decks and other things as it's working through its tasks. How is it deciding which tools to use here? So this bit, sorry, I didn't even bring it up on the screen. Sam Altman goes, how is it deciding to use the tools here? And he looks genuinely interested, like, I... Hey, all right, dickhead. What I don't get is, is this the first time he's seen it? Like, he's like, oh, the open source model gaggle failed. Kimmy 2 is better. March out those other menace copycats. Like, yeah. Yes, we train the model to move between these capabilities with reinforcement learning. This is the first model we... But he looks genuinely curious and interested, like he's never seen this. I just... I don't know. Like, you know, he just had a kid. He's probably just busy with the kids. he's like we gotta we gotta get some attention anyway but it's a good point right he's not asking like how does this help like who does this help what can it do he's asking how does it decide who cares how it decides it's a magic box like i want to know what the magic box can do for people and actually help them it's not about how it makes this decision that's the whole idea of this thing Let it make the decisions. Who cares how it's making the decisions? Let's focus on what it can actually accomplish. I really, you know, I really wanted to like this. And I thought maybe they would have examples in your day-to-day where you would just like, wow, I need this. Like, I must have this. But again, they've buried it under the $200 USD sub per month, which always says to me, like, it's, you know, either too expensive or... um they don't want to roll it out more broadly i think the other challenge is these guys are so um wedded to their own say research or sources that they've had to pay for that you're sort of getting one you know it's it's sort of like a search engine right like if you go to google you get google's take if you go to i mean you don't go to anything else but if you went to like yahoo or bing you'd get their version and it it sort of feels like with the research tools you would want to go and check like x you would want to go and do x the research complexity like you just want all the things right like if this ai thing is is so good it's like well go and check everything go and check the internet i want everything i want academic internal documents as well like my guidelines my rules like i want this filtered through that like it's got to be a combination of things, not just their system's decision around it. To me, the thing that I agree that operator and these agentic things are going to be amazing for is when you want to access research data in an authenticated system that you don't have an API for and there's no way to gain access or do tasks for you like those GDPR surveys and stuff. Those use cases make sense. But again, that computer use tool could just be an MCP of a computer that you control, and then it can just call that tool when it needs to to go and do that. It just can be, should be. Because like, excuse me, the thing that we have discovered through the MCPs is it's not the individual tools from the MCPs that makes it amazing. It's the concept. It's the combination of them. The fact that can get context from here and here and here, and then bring them into the next tool. and then that can bring it into the next combination of tools like you just demoed a minute ago, that's where the magic comes. And if you bring browser use and computer use into that mix where it can go, I can use all these dedicated tools to prepare myself. And then when I use the computer, I know everything I need to know in order to make that happen. It's a totally different experience to giving it like a blank Windows desktop and being like, oh, shit, now I've got to like work out how to use Google Drive and Google Sheets and I've got to install this software and I've got to download the updates and then all this sort of stuff, like that's not helpful. Whereas if you have all the context and know exactly what needs to be done, the computer can be used where necessary for a bespoke task that it's already prepared for and knows exactly what to do. And I think that's the difference. It needs to be part of a wider set of tools, not the only tool. And I know I do acknowledge that I'm contradicting things I've said in the past. But I just think it's because my experience now shows me that this is a better solution. You know, the thing, though, is this. Menace has been out now for maybe a year, maybe under a year. Like, it's been out for a while. And it has been improving from what I can see occasionally trying it just to see where it's at. But nearly every task I've ever done with it, I'm like, it's just quicker to do, like, a single shot with a model. like the agent it's similar to what you said before with crew ai and a lot of these agency frameworks we've been trying for ages it just doesn't get you that much better result um than just getting the model to handle it right now and you look at all the examples and then you send it off and it takes so much time and then you can kind of iterate and work with a model a lot quicker and you would think given the press and publicity around menace when it first came out that if this approach like this agentic flow like sort of like general agent approach was the best right now and like manas on many levels i think's better right now than uh chat gpt's agent like far better every example i gave it it performed better at um and it was quicker so if this was really better like wouldn't like wouldn't this be a sort of chat gpt moment where everyone's just like i'm going to use manners if it's so good. Yeah, and it seems to me like a better way to look at it now is okay. One of the things that we talked about that was so amazing about computer use is the idea that my day-to-day work tasks that ChatGPT no one knows about, like it's a process that I follow each day. You know, I go to this website and log in. I download this information. I then put that in a spreadsheet. I do these calculations. I email it to my boss, like something like that, right? That sounds ideal for computer use because it can use the computer to do what I would otherwise do and do it. But I would argue that an MCP tool builder or some sort of skill builder, like we've talked about before, where I use tools to demonstrate to the AI what I do each day, and then it can then build a dedicated MCP that has that ability in it. And maybe, yes, maybe it does use browser use or computer use behind the scenes to do that. Cool. But regardless, It becomes an MCP tool call within the wider context of all your other ones. So when you ask it, hey, here's my to-do list for the day. Can you make sure the routine gets done? It's able to then go and invoke that. It isn't starting from a blank slate every time on a computer having to go off and do that thing. It's actually built into a framework that's dedicated for that kind of agentic task accomplishment rather than a sort of raw computer kind of thing. Does that make sense? Yeah. To me, I think, well, I think what you're trying to say around MCPs is that there's sort of this commonality of where the models have that somewhat internal clock and ability. Like we've been saying this, the models already, the models that we have today, look, let's be honest, they're all pretty similar right now. So they all have these capabilities. some are better at the internal clock and the agentic flow like sonnet um is is just supreme i think kimmy kimmy k2 right up there yeah and so they have that internal clock they have this capability it's about providing them the right tools and then from a user interface perspective providing the right output so the user understands the data they got because you know we're so used to it just spitting back tags and yeah so i i think that's the piece is like we we already for the times we live in now we have this structure that the mcb structure works best with the internal clock or the the nature of these models right now and i think often you get criticized because they're like oh but you don't you understand the models are going to get better and i like i have no doubt they will but i think for the limitations we see with the technology today if you actually want agentic solutions that work this is the best framework to to do it in and um you still need a human in the loop you still you can work on simultaneous tasks but it's a it's a part of like changing your workflow of how you do things i just i'm not trying to poo poo it but i just there's not many tasks right now where it can just fully autonomously go and do it unless you've trained it on a very specific task, which to me is an app. And just think about it as well. Like in a company context, are you really as a business owner going to trust your staff to have some agent like installing libraries on a cloud computer and then accessing your internal data to do tasks it decides to do? No one's going to trust this. I just, for anything meaningful, I just don't see anyone trusting this. It's too random. It's too unpredictable. It's not a serious thing, I don't think. The use cases that seem to be still sticky, I think, in the enterprise is where you've got a repository of data and you just want to make sense of it or summarize it or understand it or produce additional documents based on it, like things like SharePoint and Google Drive and Box and stuff like that. so i think that's why they sort of honed in on that deep research agent first because that's really what a lot of companies saw value in is like doing research on their internal documentations or files i think that next leap of like setting up a an agent that can handle say certain support tickets or help their team reply to emails or whatever it is that is going to take a big stretch of uh like you know that's going to take pretty ambitious people in businesses to go you know what there's a better way to do this and i think a lot of the more disparate use cases are going to have to be trained like we've been saying like where you train the agentic skill in your business and you say this is i'm going to train on how to do this specific task that is a pain for our business with some controls and and like you know refining the output some checks and balances in place and then that becomes another tool in your pile of tools exactly and that's like Yeah, and then the model is coordinating that tool, but you know when it calls that tool, it's controlled. Yeah, not an arbitrary learn it from scratch each time because you're such a smart model. It's just not reliable enough. Even when it becomes reliable enough to learn it from scratch each time, you still want to go, hey, just remember how to do that so you can just nail it every time from now on. Yeah, a lot of people were saying today like this N8N or whatever where you can like automate. It's like an automation thing. I guess it's sort of like building an automation for something you're doing over and over again that uses AI to make decisions. I haven't really looked into it. But a lot of people are like, oh, that's dead. I just learned it. And I'm just not so sure because I think automating these very specific use cases, I cannot see it going away anytime soon. The generic stuff, it's the fantasy of it. when you first see it, you're like, this thing will be able to do anything, man. Then you use it and you're like, oh man, like, no, no, no, no, no, no, no, it's not there. All right, we can move on. Anyway, I don't want to, I don't want it to be an overarching thing. I just feel like if I'm summarizing this, they clone Menace and that's really it. They clone Menace and put it in ChatGPT and it'll probably get picked up a lot and people will notice it more because it's ChatGPT and 90% of the web uses it like every day. So moving on, moving on. I did want to talk about this. We've never really mentioned it on the show because I don't like to get it. I don't really care. But you know how we've talked for a while that OpenAI acquired Winsurf? I think it was last week there was news of that, no, Google had now acquired Winsurf CEOs. Apparently that's how you do acquisitions. now you've acquired the guy himself well i think because they're so worried about um it getting held up that like you know in the the mna process getting held up that basically they just do these like aqua hire deals now where they just take all the key people out of the company and then license the ip pay a few billy um you can see sundar there man like this could be the pendant photo i'm thinking for the pendant i get made because this is actually a good one he's like looking baller there uh and so they they aqua hired um some of the key people in from windsurf uh and paid some billies for it um got some exclusive license to their technology uh and uh and then who pops up remember uh remember cognition makers of devon uh they popped up he marched out uh one of the windsurf guys onto their couch and is like you know i'm gonna offer a quarter of a billion dollars in some stock transfer into devon stock which how does a company called devon get billions of dollars well they're called cognition but anyway they acquired i guess other bits like all the staff of windsurf and then they claim they also have ip product trademark and brand both acquired windsurf but then google has it as well so it's really confusing anyway i guess it's good news for windsurf users because now that open ai is out of the picture they're allowed clawed again which is the only model anyone uses in these uh in these tools so you know windsurf is back on the map the other thing i noticed windsurf did is they're now allowing they're giving you like 2x credits or something with um you know with the uh the claw to sort of i guess win win people back but what a what a strange time we live in like this fork of vs code that i've never heard anyone except one guy in the this nai community who bangs on about it all the time never use windsor Most people use Cursor or Klein or Claude Code, I think a little bit of now, for some agentic TARS. I've never heard anyone use it. It seems like the biggest Ponzi, and then it just gets randomly acquired and put through the ringer. You've got Zuck with his chains acquiring, just poaching everyone from OpenAI. I think that's why Sam had probably never met those people in that gaggle, because they had to put a new gaggle together today for that presentation. It's like, who are you again? And how does this thing make decisions? Yeah, but it seems like the only ideas left in AI right now is like, oh, you know that cursor mode, the cursor agent mode, where you click agent and then it can do stuff for you? Let's just do that now. Like, it's just the lack of imagination is pretty mind-boggling. And, like, I'm sure everyone will have this feature in a couple of weeks. Like, Google will announce, you know, it'll just snowball from here and we'll get all the different versions of it. And, you know, we'll proceed to then trash talk them on the show like we do about everything. Yeah. All right, final thoughts of the week. We had the best new model out, Grop4. I'm sorry, Kimmy K2. Yeah, my final thoughts are I'm going to stick with Kimmy. I love it. I really enjoy using it. I think it's a great model, super fast. It's really reliable and good for tool calling. That's my only takeaway from this week is Kimmy K2. What a model. All right, it's good to be back. We'll see you next week, hopefully with something cool to show you. Yeah. Better than Claude and Gemma Oh Kimmy, you're so fine, you're so fine You blow my MCP mind Hey Kimmy Oh Kimmy, you're so fine, you're so fine You blow my Oh Kimmy, you're so fine, you're so fine You blow my MCP mind Hey Kimmy I was coding late at night, debugging code that wasn't right When you appeared on my screen, the smartest AI I've ever seen One trillion parameters, you make my processors go wild Open source and so divine, hear me too, you're one of a kind Every time you process my theory I feel my circuits getting weary From the way you optimize You make my tokens feel so free You're so fine, you're so fine You blow my MCP mind Hey Kimmy Oh Kimmy, you're so fine You blow my MCP mind Hey Kimmy From deep and frat to chattel You're available everywhere Agentic capabilities Showing other models you're not scared New clip optimizer running smooth Cost savings I can prove Kimmy K2, you're the one Making AI so much fun Every response you generate Makes my developer heart inflate State of the art performance shown Better than Claude and Gemma Oh, for me, it's so fine, it's so fine You blow my hands, make me mine Hey Kimmy, oh Kimmy, you're so fine, you're so fine You've got my MCP mind, hey Kimmy Hey Kimmy When you launched on July 11th My world changed forever Open source for evolution Kimmy too, you're so clever Oh Kimmy, you're so fine, you're so fine You blow my MCP mind Hey Kimmy, hey Kimmy Oh Kimmy, you're so fine, you're so fine You blow my MCP mind Hey Kimmy, hey Kimmy Thank you.
Related Episodes

4 Reasons to Use GPT Image 1.5 Over Nano Banana Pro
The AI Daily Brief
25m

GPT-5.2 Can't Identify a Serial Killer & Was The Year of Agents A Lie? EP99.28-5.2
This Day in AI
1h 3m

ChatGPT is Dying? OpenAI Code Red, DeepSeek V3.2 Threat & Why Meta Fires Non-AI Workers | EP99.27
This Day in AI
1h 3m

The 5 Biggest AI Stories to Watch in December
The AI Daily Brief
26m

Claude 4.5 Opus Shocks, The State of AI in 2025, Fara-7B & MCP-UI | EP99.26
This Day in AI
1h 45m

Exploring OpenAI's Latest: ChatGPT Pulse & Group Chats
AI Applied
13m
No comments yet
Be the first to comment