

gpt-realtime, nano banana & workspace computer v2 | EP99.15-realtime
This Day in AI
What You'll Learn
- ✓The new GPT real-time API supports remote MCP servers, image inputs, voice calling, and emotion detection - making it a more powerful and flexible voice assistant platform.
- ✓The hosts are skeptical about the real-world applicability of the demos shown, questioning whether businesses would actually be willing to pay the costs required to use the service.
- ✓They propose an alternative architecture where the voice model acts as a coordinator, delegating complex tasks to specialized AI assistants running in the background, which could better leverage the capabilities of the system.
- ✓This approach could address the limitations of current voice models, which tend to lose context and become less intelligent over time when used for extended conversations.
Episode Chapters
Introduction
The hosts discuss the release of OpenAI's new GPT real-time API and the potential implications for voice assistant technology.
Capabilities of the New API
The hosts analyze the new features of the GPT real-time API, including support for remote servers, image inputs, voice calling, and emotion detection.
Practical Considerations
The hosts question the real-world applicability of the demos shown, and whether businesses would be willing to pay the costs required to use the service.
Alternative Architecture Proposal
The hosts propose an alternative architecture where the voice model acts as a coordinator, delegating complex tasks to specialized AI assistants running in the background.
Addressing Voice Model Limitations
The hosts discuss how the proposed architecture could address the limitations of current voice models, which tend to lose context and become less intelligent over time.
AI Summary
This episode discusses the release of OpenAI's new GPT real-time API, which offers a 30% cheaper and more capable voice assistant model compared to the previous GPT-4.0 version. The hosts analyze the potential use cases and limitations of this technology, highlighting the need for a more intelligent, delegated architecture where the voice model acts as a coordinator for specialized AI assistants running complex tasks asynchronously in the background.
Key Points
- 1The new GPT real-time API supports remote MCP servers, image inputs, voice calling, and emotion detection - making it a more powerful and flexible voice assistant platform.
- 2The hosts are skeptical about the real-world applicability of the demos shown, questioning whether businesses would actually be willing to pay the costs required to use the service.
- 3They propose an alternative architecture where the voice model acts as a coordinator, delegating complex tasks to specialized AI assistants running in the background, which could better leverage the capabilities of the system.
- 4This approach could address the limitations of current voice models, which tend to lose context and become less intelligent over time when used for extended conversations.
Topics Discussed
Frequently Asked Questions
What is "gpt-realtime, nano banana & workspace computer v2 | EP99.15-realtime" about?
This episode discusses the release of OpenAI's new GPT real-time API, which offers a 30% cheaper and more capable voice assistant model compared to the previous GPT-4.0 version. The hosts analyze the potential use cases and limitations of this technology, highlighting the need for a more intelligent, delegated architecture where the voice model acts as a coordinator for specialized AI assistants running complex tasks asynchronously in the background.
What topics are discussed in this episode?
This episode covers the following topics: GPT real-time API, Voice assistant architecture, Asynchronous task delegation, AI assistant specialization, Voice model limitations.
What is key insight #1 from this episode?
The new GPT real-time API supports remote MCP servers, image inputs, voice calling, and emotion detection - making it a more powerful and flexible voice assistant platform.
What is key insight #2 from this episode?
The hosts are skeptical about the real-world applicability of the demos shown, questioning whether businesses would actually be willing to pay the costs required to use the service.
What is key insight #3 from this episode?
They propose an alternative architecture where the voice model acts as a coordinator, delegating complex tasks to specialized AI assistants running in the background, which could better leverage the capabilities of the system.
What is key insight #4 from this episode?
This approach could address the limitations of current voice models, which tend to lose context and become less intelligent over time when used for extended conversations.
Who should listen to this episode?
This episode is recommended for anyone interested in GPT real-time API, Voice assistant architecture, Asynchronous task delegation, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
<p>Join Simtheory and get $10 off with STILLRELEVANT<br>---<br>CHAPTERS:<br>00:00 - gpt-realtime: first impressions<br>32:20 - AI model cost to value ration: what are you willing to pay?<br>38:56 - nano-banana (aka Gemini 2.5 Flash Image)<br>46:45 - We're working on workspace computer v2<br>58:20 - Pixverse v5 transitions are cool<br>1:01:14 - final thoughts for the week<br>----<br>Thanks for all of your support.</p>
Full Transcript
so chris this week it has been bananas i'm wearing a yellow shirt just to show how bananas it is to those who watch we had the release of gemini 2.5 image pro plus banana nano max we'll get to that in a minute we also though this morning had a bit of a surprise gaggle a surprise gaggle from open ai was marched out they got their marching orders the gaggle was to be marched out and we got the announcement of gpt real-time and real-time api updates thrilling exciting news for everyone that wanted to seamlessly buy phones from t-mobile with a voice assistant now you can now you can uh so yet again all their exist uh all their examples cater to rich californians it's like basically the lifestyle if you're buying teslas buying phones booking fancy restaurants, then this is the model for you. It is. I think there's a lot of exciting things, though, about this announcement, and it is worth going through because it is a pretty enormous upgrade to the real-time voice. We had the prior GPT-4.0 real-time voice model, and I think, honestly, outside of a few maybe high-end support use cases, it just was way too cost-prohibitive to roll out, And that's why we never really played around with it or touched it because we just couldn't afford to. Yeah, we noticed we used once used a real time voice thing from another provider that was like sort of fake real time voice where it was like it kept a WebSocket open and you're sending and receiving voice through it. The problem is that there's that downtime in between when you're speaking that you're still paying for. And the costs just add up so unbelievably rapidly. It's stressful. You can't really use it. Yeah, I think we implemented that not knowing at the time. I forget what service that was even called now. We implemented it at the time not knowing how much it would cost and then had to abandon ship because we were like, this will set this broke. Yeah, and I think there's not, like, despite, you know, many people having businesses where time is money and if it saves you some time, then it's worth it. No one can really afford to spend $100 an hour or something on their voice assistant. Yeah, it just doesn't make sense. And I do want to get to that and talk about the economics of this stuff and how much sense it actually makes for these app providers. It's all good and well in a demo that T-Mobile in an app want to have a voice assistant and sell you a phone, but long-term, are they actually willing to pay? Because you could sit there and just really with these models, you could sit there and chat to them and be like, just tell me about this concept or teach me about physics and basically exploit the T-Mobile voice implementation. It's a good point. you sort of think you probably want some higher value examples because people are going to have to be in high value scenarios. You're not just going to put it as a widget on your website for anyone to just have a chat with. So before we get into it, let me explain what this is all about. So the new real-time API has some, it's basically there's two new voices that come. It's 30% cheaper than the original 4.0 voice model. But they didn't actually say the price. It's 30% cheaper of a figure that you should remember. The API, interestingly, now supports remote MCP servers, which is really cool to see, image inputs, and I'll show some examples in a moment of it doing that so it can see the screen maybe that you're on and help you with something. And also voice calling I thought was really interesting through the SIP protocol. So you can really make great voice agents. I think this is really the final death blow of those entry-level sort of call center jobs, potentially, because this would be a way better experience, especially with asynchronous MCP calling and being able to access data on the fly. And emotion detection as well. So it's like adding that factor to it, I think, is really awesome. So let's just watch a few of the demos because we were referring to them. So this one, these are the demos they put in their announcement. So this one's from Zillow, which is like a property search engine thing in the US. Is that the one that like accidentally bought like $200 million worth of houses at premium rates or something? I think that was another one, but that's definitely, that did happen. Are there any homes in my budget need a water with a view of the skyline and Mount Rainier? Sure, let me look. With your buyability of 824k, Wallingford would be a great fit. I think you'll love 404 North 33rd Street. It has those skyline and rainier views you're after. To an Australian, it's hard to understand anything that's not in the millions in property, given our property market. That's right. I was thinking, are they going to live in like a rundown shack that's been burned and shot out? So that's like, I think that demo is pretty cool if it's real. but it so you start off on the screen with an overview of the sort of area and then you're isolating and the ui is updating based on the voice that's i think that's pretty cool but again like they like it demos well uh but is this something they're going to use in real life let's look at one more example and then talk about that so this is the t-mobile thing i was joking about earlier hi i'm t-mobile's ai assistant can you compare the camera of my current phone with the samsung galaxy s25 sure the revel 8 has a 50 megapixel main sensor the galaxy all right i'm done with that already but you get the idea so what do you think like is this going to be one of those things now that it's 30 cheaper still quite pricey that you would use and interact with on like if this was a feature in the telstra app which is our sort of phone carrier or like Optus or one of the ones in Australia, would you do this? I'd never use it in a million years because I know that they'd never give it power to actually do anything useful. However, I think as a concept, this really got me interested today when I read the article and saw that it supports both the image and file uploads now to add to the context during the conversation and asynchronous tool and MCP calling, which I think is quite interesting if you think about it because one of the struggles we had with voice, as assistants become more complicated and they have grounded memories, knowledge base, tools they need to consider running, rules they need to follow and all of those things, the latency becomes very high because it just simply has to do a lot of work to decide how best to respond to you Because if your question could be as simple as a high or call these 50 tools, do research, make a PDF document and all this stuff, there's just decisions that need to be made to lead to that point. So when you think about doing voice as an input and output type, it's tricky because there's just going to be massive delays between when you say something and when it replies. And then you say, okay, does it do a sort of filibuster style interim reply where it kind of BS's you a little bit while it's working in the background, which is a little bit false and just doesn't work that well in practice. So the idea of having real-time voice that can reply to you, but it's aware that it's off running tasks asynchronously in the background, to me, is the future and is the best way to do it. And I was actually saying to you this morning, imagine if the paradigm I just said with the complicated assistants. If your voice model was simply a voice model that had a personality and knowledge about what you guys are working on, but it doesn't have all that complicated decision-making thing, but it can actually then call off to an assistant to do the hard work. And while it's off doing that, it can still be chatting with you and maybe delegating other tasks as well. Yeah, I think that idea of it having a super brain in the background, because that's the problem with all these voice models that many people have realized is you sort of take the L on the voice model. Like if you're in chat GBT and you switch to voice mode, the model overall is now a dumber, right? Yeah, and you've totally hit on the point of why I think we avoided using this because I'm like, I'm not going to use a markedly inferior model on a day-to-day basis to solve problems because it takes away all of the benefits. Yes, the voice is better. Yes, it's full duplex and you can laugh with it and yell at it and all those things you want to fall in love with it, whatever. That's really nice. But if it lacks all the intelligence and abilities, then it goes away. But if you can bring those back through asynchronous tool calling, if that tool calling is delegating to smart assistants you've set up, especially with things we've done lately, like having assistants that have specific sets of tools. So, for example, a research assistant that has access to all your academic research, one that is your blog post writer or your report writer, one that's your podcast creator. And then through the voice, you can orchestrate and coordinate all of these things working asynchronously. Suddenly, it becomes extremely powerful. Yeah, I think that that's where it gets interesting, where it is the highest quality model that you're working with. you're not like this is the other problem too is like some of these models after a while if you respond to them they'll respond in very short sentences because as we know from using these voice models after a while they just start to hallucinate like mad like if you actually ask it to feed out a bunch of information in the past at least they would start to stumble after a while and just say words wrong and similar to how i talk on the podcast really and but interestingly they do say and i haven't put it to the test yet that they have improved its ability to do things like recite terms and conditions and all the things that you would want to do in a call center like obviously the use case of this is all the things you'd want to do on a day-to-day basis just i want to listen to that it sounds fun but what i don't like what i find interesting so it's like yeah okay it's 30% cheaper back to my original point before of like can anyone actually afford this so Zillow say um you know it's super duper exciting and this could this could make searching for a home could this could make searching for a home on Zillow or exploring financing options feel as natural as a conversation with a friend cool if we if we get the right VC funding around to pay for it all. Like, are we meant to get semi-erect here for, like, call centers feeling more like a friend when they're selling us finance? Like, cool. I don't know. But the problem is that what I don't get is that they're virtually given an interactive example when the whole benefit, in my mind, of this system is a sort of delegation model where you're barking orders at it, and then it's coming back to you when those things are completed and you're able to continue the general effort. Whereas just sitting there browsing for houses, waiting for it to search and display it on the screen, doesn't seem that much different to just typing in your search or just using regular voice input. Like I don't think it actually fully explores the capabilities of why this API is significant and what's possible. And just on that point around it losing track after a while, one advantage of the model I'm talking about where you're doing this delegation to the other agents is you can actually make sure that the context coming back to the main real-time voice model is minimal. So if it's gone off and done some complicated research and produced a paper for you, for example, all it really needs to give back to the real-time model is, hey, I've produced this document. Here's a two-line summary of what's in the document and its file name and how to access it. And that's it. It doesn't need to have like pages and pages of research clogging up that context window costing you money and slowing it down and leading to hallucinations yeah i i to me the the thing that gets me excited about this is having a capable voice assistant where i can switch to a voice mode it works really well with my mcps so i can basically have my own google gemini or siri or or actually let's not bring theory into this but i can have a really capable voice assistant that can as you said access the disparate or like mcp connections that i've configured for that particular use case and go back and forth with it and brainstorm in the context of maybe like a research assistant or a personal assistant or any of these things it's less exciting for me uh for how these companies might use it to just not have call center jobs like it would probably be a good like a better experience using this in say a call center or something like that but i just don't i like i keep coming back to this point is like so you know right that it can it can follow like uh instructions pretty well but what's to stop anyone just exploiting these things and you know getting it to give away information so i'm just curious how it performs in real life before making these proclamations about what it what it may or may not do um i'm pleased i just think it's the wrong thing to get excited about like nobody likes talking to call centers the idea of replacing them completely is better like not even needing to call the stupid things in in the first place But also, I just feel like there's a lot more powerful business examples of having this kind of interactivity than just calling up an agent on the phone. Yeah, exactly. So let's try it out and put it to the test. So this is using the new voice, Marin. And so I'll start the session here. Hey, how's it going? Wait, why is it talking in a different language? I apologize for the confusion. I can switch back to English now. Yeah, but why did you not speak English to begin with? I'm really sorry about that. I must have misunderstood the initial audio. I'm here now, and we can continue in English. No, no, no. Actually, let's switch back to whatever language you were in before. Okay, we'll change the language just now. Can you tell me what you need to help? Now speak Spanish. Claro. Puedo hablar en español sin problema. That is pretty impressive. I'm impressed. Now can you speak Chinese? Can you sing us a song about this day in AI?當然. 我現在可以用中文和你交流. I don't know. 請告訴我. Someone tell me who speaks Chinese. Is that good? Sure, I can definitely speak Chinese. Here's a bit so you can hear. 你好. 我现在又用中文和你说 你觉得这样可以吗? If you'd like to chat more or need any help, Can you speak Australian, please? Absolutely. I can speak with an Australian accent. Let me switch over for you. How can I lend you a hand today, mate? Okay, now to your Australian accent. All right. Here we go with the Australian accent. G'day. How's it going? What can I help you with today? I have a glass stuck in a tree and I just need help getting it out of the tree. No worries. Let's sort that out. If the glass is stuck in the tree, the first step is to check how high it is and whether it's safe to reach. Do you have something like a stepladder or a long pole? Who do you represent? I represent Mike Corp and I'm here to help you with any questions or issues related to our products or services. If you need any assistance just let me know What products do you sell At Mikecorp we offer a range of products and services Typically that could include things like home improvement Okay, that is phenomenal. Like, that is incredible. Its ability to transition between the different modes is unbelievable. So I think that's what you need to imagine is being able to have that level of control, speed, and interactivity, but it has access to the power of all your assistants. And I don't want to leave out the point that this existed before, but the reason it wasn't as interesting is it lacked that asynchronous tool calling. Now that it has that, and so the idea is that you'd be able to say to it, okay, whether Chinese or English, go off and research diabetes in the background, produce me a 10-page report and get back to me when you're finished. Okay. Now, it'll go, yeah, no worries, Mike. No worries, Mike. I'll do that. And then once it's done that, you say, okay, now I need you to write some code for a website. Can you please do that? And then it goes off and it's doing that. And it's using your various assistants that are actually experts in those tasks. It isn't trying to do it itself. It's calling off to other assistants as if they're tools or just tools directly. And then once it's finished, that data, like the results of that tool, and I don't mean the raw results, but I mean the summarized result of that tool, gets incorporated into its context. So it's able to then realize that it's completed those tasks. That to me as a paradigm works really well. And it works outside of the scope of voice as well. It would work in a chat context as well, delegating the agents. But I do like the idea when it's that of voice, because you're not having that cumbersome thing of having to wait five minutes for it to respond before you can talk to it again. so the idea being that like i what i don't understand though is say like you're the assistant we're having a conversation i'm like go do this like go check my emails and it's like sure i'll go do that and then and then you're like are you sitting there and waiting is that the behavior you would expect or are you saying oh now tell me the weather yeah and then it's like oh i got those emails because it's fired off that tool call in the background and it'll only become aware of it completing when it gets added to its tool results similar to as if you added an image to the chat except it's adding the tool results of that would you have to follow up and be like about those emails like what happened that's the bit i don't know is does it like you've mentioned on previous podcasts does it have the idea of time passing or not or do you need to keep hitting it as the developer to say, hey, check, because when this is finished, you need to respond. And I'm guessing there's going to be a layer of application development around it to get it to behave in the way I'm describing, but the possibility is there. And that's what excites me because I think that we could really make something of this. I think this really could be a good way to work with voice, with the AI. The issue is cost, but I wonder, the question I want to ask, and I asked you this before the podcast is how much would people be willing to pay for that? Like a loyal, loving, presumably assistant who is just going to be able to receive and respond to your orders all day long, maintain the context and be able to access all of your AI stuff. Like I would think that's worth paying for. Not that I'm looking to make a profit, but I'm saying like it's an expensive model would you be willing to say spend a hundred dollars a day for example to run it if all day it's out there doing tasks for you i'm curious what listeners think in the comments below like would you like let's first try and get an assessment of cost so yeah i this is the most bizarre launch I've ever seen from OpenAI. It says, GPT real-time, we're reducing prices for GPT real-time. Like, they just announced it, and they're reducing prices for GPT real-time. Everything must go. Who wrote this? I mean, you have access to chat GPT. Like, come on. We're reducing prices. This feels very rushed, I think, to steal some attention from Nano Banana. That's my personal opinion. we're reducing prices for gpt real-time by 20 compared to gpt 4.0 real-time preview 32 dollars per 1 million then you click see detailed pricing okay oh no they're finally updated and of course by the time we record it's finally updated oh yeah it is there yeah what this was not here uh so hang on so when they said we're reducing it 20 compared that's really misleading because that is the price, $32 per million. That confused me as well. It says... And $64 per output. Now, it's a little bit, I think, trickier to understand than regular context because if you think about using a model normally, you're giving it a lot more than it's giving you in most cases. So you're giving it tons and tons of tokens in terms of research information or documents or images. And then it's giving you relatively less output because it's just doing a summary or a document or whatever. Not many people are outputting like 100,000 tokens at a time. Usually it's something like 30,000 input tokens and 1,000 output tokens. Like the ratios are something like that, you know, 20 to 1, 30 to 1 from what I've seen. But in voice, it's not going to be like that. The ratios are going to be pretty even, I would say, like unless you're using this as your primary inference model but if you're just using it in the way i described the input and output in my opinion is going to be roughly the same if it's like a conversation but then let's assume you add in one tool like gmail to your gpt real-time assistant or just like a database lookup of like a customer profile like let's assume it's one of those things like you really want that tool call to give you the most concise stuff at $32 per million. But this is why I'm saying, I don't think the way to use this model is as a direct tool caller. So just to illustrate, if you asked it to go check my Gmail and it has to read every email in your inbox, that's going to cost a fortune if this model is the one doing inference on that. However, if you did it arm's length and you called off to your Gmail assistant and said, go check out my emails and tell me stuff worth worrying about it then does the hard work with the large amount of context gives a short summary back to gpt real time and then it says hey bro you know there's nothing to worry about your emails are fine or whatever so you're using less tokens so i think in that arm's length way it's going to be a lot more cost efficient than getting it to be your primary tool caller because you'll just go broken like less than a day but for a moment let's think this through so like i just then on that conversation we had which was very short and had no purpose and no tool calling cool calling did i say why am i saying that weird anyway so that cost uh 20 000 input and i don't know what the output was it was like far lower so 20 and you probably spoke for less than a minute so let's assume and that was like no context in the model related to any tool call. So then let's assume I speak for 10 minutes trying to sort out something in a call center. That's got to be like what? Like 100,000 input? Well, based on what you just said, it's like 2.5 to 3 million tokens an hour that you'd be using if you kept going like that. So if I stayed on the line to one of these call centers, I could ring up a bill of like a lot. Like $100. It would cost them $100 if I stayed on the line for an hour. So my $100 a day was actually a wildly underestimate of how much this would cost if you just worked with it all day. I'm not trying to trash talk. I'm just making a business case here. I think the model's phenomenal, and I want access to it. The reason I'm complaining, for those that think I'm a persistent whinger, which I am, is I just want to drive the cost down. Like, I would love to see GPT real-time offered it. Like, what would you pay? And this sort of, I guess, comes back to that. Yeah, I think that is the question. Like, at what point does it become cost-effective? If I'm a worker who works at 1X and having an assistant or, you know, swarm of assistants that I can be delegating tasks to all day makes me a 10Xer, well, would I be willing to pay 100 USD a day or my business pay for me, ideally? um to do that and i just kind of think there would be certain professions in certain industries where it's well worth it like think about this if i'm building call center software and i'm just going to bring it back to that because i think it's one of the most viable use cases for this in the enterprise so i'm building call center software i've got to make a margin on 32 dollars per input and output and or just charge like a monthly fee so now you've got a monthly fee on top you've got say the raw cost of this model i don't know in your call center what's the average resolution time say 10 minutes and we burn 100 000 tokens trying to be conservative i know there's the cached input which is really interesting but still how much is cached versus like collecting into you just have to rely on people saying the same thing the same way over and over again or i I guess referencing the same context over and over again. Yeah, no, I'm being silly. Say each call is, what, like 10 minutes? That's only going to cost about, from my estimates, about $4. That's still probably cheaper than a human. Like $4 for 10 minutes. Like, say it's 20 minutes, $8. I mean, that's almost cheaper than human labor, I guess, or thereabouts. if your call center's in like India or Vietnam or wherever. So I think this brings it close to starting to maybe take those jobs and quite frankly, deliver a better experience to the customer. So maybe it is finally viable in call centers, but I think using it as like a day-to-day voice assistant, if you want to talk all day or have it active in the background, which I think everyone desperately, like that's, come on, like that's what we all really want. Yeah. it's not affordable yet. Yeah, exactly. And I think that that's something hopefully, I mean, the trends we see with this stuff, although it hasn't so far happened with the real-time one, is usually one of these players will release something like this, then DeepSeek will come and undercut them by 10x or something, and then there'll be an open source version and there'll be a way and then that'll drive the cost down. So just hopefully in this scenario that happens. But nevertheless, I'm excited by this paradigm of working. Like the thing I've described here as a way of working, I think, will be the future. And, you know, we're probably just going to build it and then just hope that the cost starts to get to a point where you can actually use it. Yeah, there's still, it's the compromise between the quality of the model and the costs right now that just doesn't balance out mentally for me. It's pretty appealing, though, to have voice of that thing where you can be like, you know, sing me a song about galahs in Australian and also write me a research report that I need for my business at the same time while you're singing. Like, it's pretty appealing. My dream for it is this, like, obviously have it connected to an assistant that's like my workday assistant with access to, say, Salesforce, email, calendar, like all the various business apps, Slack or Microsoft, like whatever it is for you. And then just be able to say, like, can you help me prepare my board report? Can you help me, like, what's the sales forecast this month? And just essentially act like how I would assume Google Home would work, but for my internal business data. And then order it to do tasks in the background. Like, go prepare the report, tell me when you're done. And then it comes back. And it's like the coordinating agent or assistant above all my other trained assistants. Yeah, precisely. And that's what I want. And I, like, obviously we'll try and build it right with this now because it's available and it's less, like it's $8 less and now we're going to try apparently. But, you know, and the reality is, like, are we or anyone else going to be willing to pay for that? Like, we'll make it, but, like. I mean, the other way to look at it is there are other full duplex voice models that are just using text-to-speech, speech-to-text, and lightweight models, right? Like those methods exist. And so what you could do is actually build a sort of facsimile of what we're talking about by, you know, the poor man's hobo cobbled together real-time voice, right? Run it like that until the real-time one becomes possible or your work will pay for it for you and then upgrade when you can afford it. And I think that there will come a sweet spot where we're all working like this and you'll never want to give it up. I just, I don't buy these arguments anymore, like, around this idea of, like, people wanting to use lesser models. I just don't think the lesser models are that good right now. That unless, you know, it's an economic thing and you're willing to punish yourself throughout the day. Like, for example, Gemini 2.5 Flash. It's a pretty damn good model, Gemini 2.5 Flash, if you use it as a daily driver. like it for most tasks you don't really notice that big of a difference right but if i'm doing work stuff through the day i'm like i'm not willing to compromise right now like i'm gonna go to gemini pro or gpt5 or lord sonnet right throughout my day because i i'm i don't care like the productivity gain for me is worth my time like i don't want to waste time on inferior models and I think in a business context at least with the voice stuff like you're already making such huge compromise with the voice models even this GPT real time right you're making huge compromises with the voice model that unless there's some magic like you say around the tool calling where it's tool calling an assistant to go off and do the work and then respond in the background yeah like I would if my only option was to use it without the delegation element I wouldn't even try it because I know it's an inferior model in terms of its inference and it's just simply not going to be able to do what the other models can do. And then the second you start using it with any reasonable MCP, you're going to cost yourself like $400 or something. That's the problem. The first time you're like, hey, can you please use Firecrawl to check the web for stock tips? And you're like, yes, no worries. And you get a bill for $500 because of all the stuff that's been spewed into the context window it's it's just not going to happen and so the it's it's got to be a setup that intelligently uses these tools to um like as an orchestration the reality is though we're getting closer like every six to 12 months we're getting closer to where the cost hopefully will be driven down and these guys can serve it because i'm sure it's not cheap there and they're probably making a loss on it do you think this opens a wider question though because i often think about this of people's understanding of the cost of the elements of AI, I think is varied. Like some people understand why different things cost more than other things. And I wonder if everybody needs to ask themselves the question of how much is this worth paying for? Like everybody pays a certain amount for their home internet connection based on the speed they're willing to get. You pay for different streaming services, you pay for things based on what you think they worth to you But with AI are people really understanding the true cost of what goes into it and making the decision hey I willing to pay a lot more for this to get much better because it is a case where paying more gets better Yeah. I mean, I think the problem we face right now is similar to the newspaper business, almost like if you're giving it away for free, you devalue it. And I think they've done that for general consumers, basically. Like if you're giving it away all the time, then you devalue it. And then they're now going, oh, actually, it costs a lot of money and we're losing a lot of money. So if you want access to more context, for example, on ChatGPT, like you've got to be on the Max or Max Pro Plus plan in order to get it. So you're not being that clear about the compromises you're willing to make. They're switching out the models all the time. So you don't really know anymore which model you're getting for the money you're paying. And I think that what that's leading to inevitably is confusion around like, is this thing more intelligent? is it dumb and you don't really know what you're paying for and i don't know i think for most people in the know like i always say this i like throughout like if i look back at this week what models have i used i used primarily interestingly gpt5 thinking like religiously through the week um because i'm just getting more familiar with it and i'm starting to think it's probably the best model right now in terms i must say gpt5 not even thinking just gpt5 bailed me out this week of something I was stuck on for about two hours. I just could not get through it with GPT 2.5, sorry, with Gemini 2.5, switched to GPT-5 and just asked it the question fresh, and it nailed it, like absolutely nailed it. Yeah, I mean, it does have like this weird tune and weird style. It's like an arrogant friend is how I would describe it. Like no matter how you tune it with the prompt, it still is, yeah, very like to the point. It's like how quaint that you would ask me such an easy question. Yeah, and it's a bit of a like, it's that developer that has to like, especially with code, do everything in a way that's like its way. Like, you know, like even it's like when it names like methods, it's like really, it has its own tune that I don't really love. But anyway, if you can get past that, it's super intelligent. um even just interpreting data i find it really smarter compared to other models but i still as a workhorse for context use gemini 2.5 pro and then i occasionally will use uh claude sonnet when i want to do like advanced tool calling and then when i want uh front end stuff i'll always go back to Opus because it's just the tune is better so I you know I think that am I willing to pay for all those models because of the the results I'm getting I am personally and I think quite frankly you could raise the price like throughout my day I would personally pay you know $500 a month to have unlimited access to all that stuff but as we see through sim theory like some people do not care about price at all like they will just infinitely recharge and pay for their tokens i guess because they're getting a ton of value from that but then there's the other end of conservation and i get it like life is expensive right now like it you know but it does it does raise an interesting point of the newspaper business idea like are you willing to pay for this stuff longer term, especially as it gets, you know, more advanced. Yeah, exactly. Like, what are you getting out of it in proportion to what you're paying? I mean, it makes total sense to be economic with that. I guess for us, we're always trying to use whatever the best one is. But I guess that goes back to the real-time API. If it's at an expense level where even people who don't care so much about the cost are like, whoa, I absolutely mustn't use this model, then you know we're at a point where it's really just a tech demo. It's not real technology that can be used practically. Well, yeah, I mean, GPT real time is on another level of price. But if you think about GPT-5 at $1.50 per million input, that's a steal. Like, it's ridiculous. Like, I cannot believe how cheap that is. It's so good that we live in this time. I mean, think about a coffee, right? Like, I paid yesterday $7 for a long black for some reason. I don't know how still. but seven bucks so that's about five dollars us right that's gonna this is how i measure things now that's five million input tokens into gbt5 like you know that's a lot especially with with uh like cached input so you know i am i am willing to pay for that um and i it's okay for me to do that am i willing to pay for a single model company to get unlimited no like 200 a month to just have just gpt5 like i still need the other models like i can't just live in that paradigm of one i don't know how i don't even know how we got into this rant about models but it happened i apologize um so the question is is like i think this is right now call center technology i don't really see zillow or these other guys willing to enhance the experience on their website and paying this much like i don't think it's feasible like it's clear from all their demos it's like it could do this it goes yeah exactly yeah uh but i'm pleased it exists um and so it does feel like maybe call center stuff now but in theory you could build your own overpriced google gemini or siri like anyone off the shelf could build that right now and it would be great if you're willing to pay. All right, Nano Banana, let's get over to it. Nano Banana was the much rumored image model from Google. There's a lot of hype, a lot of banana emojis. We talked about it last week a little bit. They did announce it. So introducing Gemini 2.5 flash image. So they called it Gemini 2.5 flash image. Everyone was saying that Nano Banana was a better name. It was very catchy. I don't know why. And very memeable. and then they came out with Gemini 2.5 flash image. So there's definitely a breakdown in communication between like the hype marketing machine and the morons who gave it the final name. They should have just called it Nano Banana. It would have been great. Anyway. All the YouTube influencers that already plastered pictures of bananas all over the screen. So you might as well capitalize on that. Hey, what are you talking about? You're not an influencer, Mike. Oh, okay. I got like three views. I got like no views on this video. why do I do it? I mean modern cool people, not you. Okay. So, this is cool. It's so good. That's all I have to say. It's so good. I'm sure you've tried it by now if you listen to the show. You know, its ability to piece objects together. I don't know. What was your experience with it like? Yeah, the character pinning elements of it are pretty amazing, particularly when you do multi-images and you say, you know, combine this with this or put this person here and that kind of stuff. I made some pretty funny ones using it. And I think seeing a model that's so cheap that has that ability to like, you know, take this kitchen cabinet and put it in this kitchen and those kinds of things, the applications of it are massive, especially when you can do that on a large scale and it actually is able to competently follow instructions. Like it's probably the best instruction following visual model I've seen. I still had some issues with text and occasionally like I was trying to, you know, that famous picture of Jerry Seinfeld and his girlfriend. I was trying to put me in as Jerry's girlfriend and it couldn't do that for some reason. Like it sort of modified her face in a really disturbing way, but it just wouldn't put me in there. So I've found a few instances where it struggled, but generally speaking, it's really good, especially with abstract ideas, not specific. Yeah, it seems to do really... Like, you can tell it's been trained around, like, product shots a lot because this meme photo of me is, like, a muscular... This was using Ideogram's character reference, I think, this one, is my vibe-coding head on a muscular body with a white background. And I find if you have a perfect cut-out background and then you, say, put it into an image, like I did this image of an outdoor gym at Bondi Beach, it puts me in perfectly because the cutouts don't. It's a bit like a virtual try-on model. It seems to work well when one of the images is the detailed image and one is an object or person or thing that you want to add to that photo. Yeah, in this influencer video I tried to do, I used Flux accidentally at one point to do the same thing and it just like completely ignores one of the images and then uses its sort of version of merging um and just changes the image quite a bit it did retain the facial features quite well but it's not even close like i did another one where i took my current profile picture and then got me djing and i don't know if you can see that that clearly on the over this but it is phenomenal Like, it takes it, like, absolutely perfectly. And there's been a lot of different use cases people are showing, like, where they're isolating. I've got one up on the screen now where they're isolating, like, a map of, in this case, Sydney Opera House. And then you say, take this area and turn it into an isometric image in the style of a mobile game. And then you get this. And it looks, it's a perfect model. But the Opera House, I mean, the land's end and stuff's incorrect compared to the map. I've seen some better examples. I just quickly did this before the show. But then look at this. So this is one of the meme, like kind of viral things. I didn't come up with this. But it appears to me like one of those classic buildings in maybe Paris, I'm going to guess. And so you ask it to transform it into an image, right? and then I said, just the building around it, nothing else. I want to see the whole building. And look, it created a 3D model. So it's that photo, and then it goes to a 3D model, and it's a perfect recreation of the building from this blurry picture of it. It's unbelievable. And then I thought, you know, why not push my luck? What if I make it into a 3D object? So now I've created a 3D model of a building from a blurry photo. uh wow that is remarkable that's really cool yeah i then i think like who is this for and why do you care like if you're a games developer doing assets it would be i mean transformative but it demos well i mean there's always that separation between doing something just because it's possible and then will i actually use this like i i get i like both yeah so like to me it's super impressive i think for e-commerce and like business use cases it's amazing like there's a this photo going around where, um, it's like a bunch of items from one of those, like try on Instagram photos. And then it's like, put it on me, like the model. And it takes every item perfectly and accessorizes it on this, this woman, like just unbelievable stuff. Yeah. Um, there's so many good use cases. I think it's going to be like 450 startups doing this now, like, you know, plugins for your fashion website that'll put it on you and, all that sort of stuff imagine if you're in an architectural firm right or like you just want to like look around a building so this was a building that um that levels io guy came up with like he had a photo of a mansion and then is like extract the design of it or whatever and then i was messing around with this uh and created a 3d model of that mansion and then i was able to like place it in my backyard using that tripo 3d model and then you can just walk around the building so you can just get a sense of like it feels so sci-fi like even in the military it would be a good use case like if you get a satellite image of a building it's like turn it into 3d and then you you place it and you can walk around and and like get a really good understanding of that that location so they've probably they've probably been using this for years so yeah i think some really cool use cases um i do think it does have limitations though it's not always perfect and i think having this sort of dumber underlying flash model instead of a pro model behind it um is what sort of causes some of those understanding issues so i wonder what's stopping them doing like a gemini pro 2.5 image instead of the flash is it especially given flash is that cheap, you would imagine that even if proportionally it's the same proportion as Flash to Gemini Pro, it would still be a reasonable cost. It's not going to enter VO3 territory or anything like that. Yeah, exactly. So, Chris, do you remember Workspace Computer? We had a couple of episodes where we didn't sleep at all. Yeah, I was going to say, I remember those months where I didn't sleep for months. Yes, I remember it well. so we there's a bunch of shows probably i think roughly a year ago now where we got obsessed with this idea and i'm still obsessed don't worry uh with this idea of having a workspace computer where it's this computer in the cloud and that computer can stay connected and it's basically a computer for an agent to use so that you don't have to have it using your computer because that's sort of annoying if you're doing work you want to be doing stuff on your computer you want it off going off and doing stuff for you. And by equipping it with its own computer, it can go and do stuff that MCPs or other things maybe can't. And those things can be business-related tasks. Like everyone at the time was using Workspace Computer. I think it was the main use case to do HIPAA compliance quizzes and stuff like that. Like compulsory security training in their companies. I did a few of them on the pod live in the background while we were recording this show sometimes. So, yeah, it was really exciting and interesting. And ultimately, we had to shut it down because it was just, we were going broke. I think we spent almost $100,000 in a month to host these workspace computers because we were running actual Windows machines. Yeah, so our idea was that we would supply dedicated cloud machines. And the reason we needed to do that, as opposed to people using their own is like what you said. We wanted an agent to be able to be logged into everything because, of course, you could use throttings where you spin it up on the fly, do whatever the task is, then kill the machine, and that would be much cheaper. But we wanted the idea of it's logged in. It can stay logged into all your stuff. It persists the state and that kind of thing. But looking at it, cost-wise, it was just unrealistic to run it like that. I mean, people could do it on an individual basis, but to just have virtual machines like that. And admittedly, we weren't doing it even close to the most efficient way. But even so, in any way you look at it, no one is basically willing to pay what it actually costs to do that. Yeah, so this whole episode is about cost-prohibitive technologies, I guess. Yeah, like there is a big difference between what's possible and what people are willing to pay even at cost price to get it running. And so I think it was like a couple of months later, OpenAI announced the original operator, which was the model that then went into what is now the agent capability on ChatGPT. And that uses a it just like a web browser right And they allow you to auth into some things in that web browser But it does tend to as most people are discovering still trip up on a lot of the like is this a robot Cloudflare is also being a bit dodgy and banning it. And so I guess that one of the problems with that is because, you know, it's at such scale, obviously, they're running it at scale. and it's going to certain websites and doing certain things, inevitably more and more people are just blocking these agents. Well, and also a lot of websites already are wary of automated browsing from cloud server IPs. So if it's coming from major Amazon and whatever Google data centers, then it's naturally going to have all the captures and blocking and all that sort of stuff. So that happens. So we have been for quite a while thinking, you how can we bring this to everyone so that you can have your own workspace computer it's super affordable so like we don't go broke and you don't go broke and also how can we have that idea of personal auth but also allow you to have like your own applications running on it and and things that you would want to set up and configure and things like that so we uh we are working on workspace computer v2 and the the reason we're saying it is not necessarily to promote it but just because we thought it would be really interesting to talk about how we're going about it and what it is we're building and so workspace computer v2 is underway and uh what i love about this is we've created a new app we call sim link sim link i like the name i love the name and what we're doing is we're actually allowing that to be installed on any computer, whether it's a Linux or a Mac or a Windows computer. And that allows the SimTheory client or assistant to then drive that computer. So that's going to enable you to create your own cloud computer, or if you have a spare computer or two computers at home, you can install SimLink on one computer or both computers and then operate those computers remotely from your phone or, you know, when you log into Sim Theory. So do you want to, I just think it's super interesting, and there's a lot of use cases around this. So, for example, if you're a large institution or an organization and you've got, like, you know, 500 old computers you might have otherwise thrown out, you can give them a new life by installing SimLink on them and gifting them to an AI agent. Yeah, and that was our thinking, is it's going to cost way too much for cloud computers, even if the customer is paying for it themselves. No one's going to want to pay $100 a month or $50 a month or whatever just to have this thing sitting there. But everybody has old laptops. There's lots of those little NUC computers sitting around. Everyone has it, or even a virtual machine. You can just run a VMware or a virtual box on your computer with a machine that has Simlink installed and then have the AI run that. Everyone's got powerful computers. They're not using the power. There's always going to be a way to get these machines running. And so the idea is you install SimLink on whatever that machine is. It gives you a code, a unique code for that computer. You then install an MCP on the SimTheory side and link that code. Now, the reason we've done it that way is that would mean your SimTheory can have 50 computers associated with it if you want, or five, or two, or just one. You can add as many as you want and then have the assistance associated with that computer or let it have all of them and then delegate tasks to them. You can name them. You can have them logged into things and all that sort of stuff. And then in the workspace context, if you're a large organization, you can, as you just described, have a pool of machines available and say either these are available on a short-term lease basis where you use it for a little bit and then give it back, or you can allocate it to individual users in your workspace. So you've got all that flexibility, but then SimLink itself can then do everything that we had before and more. So fully operate the computer and the technology's advanced since we did it the first time, so it's going to be better. Read and write files, install local MCPs to run the way you would with Claw Desktop, for example, and essentially be able to just fully operate the computer in this structure we're talking about. So it'll bring all the benefits we had before and more, but with none of the downsides in terms of cost. What I love about this is that what a lot of this comes down to is people really don't like filling in forms. Like I've been talking to so many people like, what are the main use cases you would use this for in an authed environment or a secure environment? and most of it comes down to you know like you're a a provider of loans and you have to log into a banking system i think we've mentioned this use case before and that banking system's archaic it needs to be auth but if you could automate that task it would save a lot of time right so a lot of what we're designing workspace computer v2 for is to automate very repetitive tasks um this time So train it on the task, give it screenshots, and actually teach the assistant how to perform that task reliably over and over again. Because that was something we noticed from V1 was like you would go to do the same task. And like the first time it worked, the second time it didn't, the third time it did. It was just so random if it would work. But I do think it's really cool. and like one of the things we want to do over the next couple of episodes is just keep everyone informed about it um put it to the test start doing some serious things with it but also some um lols with the workspace computer v2 so uh hopefully if you follow along the next couple of shows we will check back in with it um i'm just curious to see honestly with the async tool calling now and mcps how the assistants themselves will be able to like swarm up a bunch of computers and like access like operate like six computers at the same time to do a task yeah it will be really really interesting and i think adding that real-time voice stuff to the mix as well will be great because then you'll be able to just talk to it and say do this on my computer and just sort of lay back and and give it some orders yeah so you're all gonna have to like uh get rich from now until the release because to operate the voice of the workspace computer is not going to be cheap at least we've taken the cost of the computer out of the equation so that just the inference and voice are going to cost you but you said there's like this device you can buy like it's not a raspberry pi but something else that you could in theory buy and then operate this on from home right yeah well i mean like the the little mini pcs that you can just run they're cheap you can get I think for $100 or less Australian, you can buy a mini PC, you just plug in, or even, I mean, Mac minis cost a lot more, but that kind of concept, like a small machine you plug in, doesn't draw much power, you can run it headless, like without a screen, and then just give that to SimLink, and then you've got your own virtual cloud PC that you can totally run with your AI agents. So the options there are going to be really, really strong. So in theory, I could have like 10 of them behind me connected to my desk and have my own swarm of workspace computers. You can do distributed denial of service attacks at great personal expense. Or like like farms. I could do like a like farm or a follow up farm. Yeah, it'd be great for like farms for sure. Wow, look at all these really good use cases. I'm going to do that. I'm seriously going to have a little stack on my desk at launch and have these going. All right, Workspace Computer V2. We will keep you posted about how it's progressing and hopefully maybe next week, maybe next week, have a demo of this live. I think it's pretty realistic. I don't want to at all commit to any dates to anyone for any reason. I don't do that, but I think we can safely say that we can have a demo next week. Okay, cool. All right, so the cherry-picked perfect demo where we're shopping for phones on the T-Mobile website with the Workspace Computer. And buying cheap houses on Zillow. Yeah. We could. We could have those demos. They could be real. All right. So PixVerse V5. I just wanted to talk about this. I released this as an MCP on a Sim Theory just because I thought it looked like a great video model and wanted to try it out myself. It's called PixVerse V5. I'm not familiar with the other versions, so I don't know how this progressed because I don't track them all. But it is a great model. And one of the capabilities of it is it can do video transitions. And so I was just testing it out after I added it the other day. And I ended up turning it into a video. Like I just hit record because I thought it was so amazing. But I took this picture of the cat in the hat to test it that I had as one of the test images for Nano Banana. I was trying to get me putting my arm around the cat in the hat. It was related to Book Week. It wasn't weird. Anyway. so I got this image and I said transition this cartoon character into a real person using Pixverse but for some reason Claude, old Claudey was like oh I'm gonna make a photorealistic portrait of Cat in the Hat first and I was like what are you doing you've I need to test this I was really annoyed at it but get this so it what it did is it was smarter than me obviously um and let me just fast forward in the video for those watching so below it creates this image from the original image of the cat in the hat but as a human now like some scary guy dressed up as a cat in the hat but then it sends that image and the original image to pixverse uh brilliant genius idea to do this video transition which was what i was testing because it's really good at doing video transitions so if you then roll the video the cat in the hat perfectly transitions into the exact image I mean it skips a little bit at the end the exact image that I then uh then did so I thought that was just so cool to see the context coming together the model knowing to call both those tools to get a better result for me and it it was one of those shocking moments with the AI where I was like, oh, wow, it's like calling the right tools and it's piecing this together for me. So, like I know the transition, video transition industry is not that big, but if you do want to check it out and play around with it, it's pretty cool to use if you edit some videos or just need like stock footage. Or at the end of this video, I did another transition where the cats come to life on the screen, if you watch. They come to life and look at each other and stuff and then it transitions into text. Now, the text is wrong. It says sim40.ai. But anyway, it's a cool model. Another just interesting one with no real practical use, apart from fun. Fun. Yeah, that is practical. All right. So we had GPT real-time. I thought that was... It's pretty good. Gemini 2.5 Image Pro Plus. I mean a little plug there final thoughts on the week like are you going to go race off and start using GPT real time or are you not like what's the boom I think it's high I really want to it's rekindled I feel like this has been the week of rekindling old flames in the sense of computer use real time voice these are things that I think really appeal to people like I think they're probably of all the AI stuff that's come out These are the most interesting and exciting elements of it, really, because it's about communicating with it in a really natural way for a human, like having that fully interactive conversation with emotions. And I think operating a computer, as we've discussed before, it's the ultimate interface. So to nail both of those elements of it would really bring the technology 10x on what it is now. and I feel like the pieces other than the cost factor the pieces are in place there and I think moving towards that is going to be just really really exciting to do so I don't know I don't really want to give a figure because I always feel like I get it wrong but um I am excited I think these are really really cool things to play with yeah for me I'm just really excited to see the workspace computer just being another tool in an assistance kit where it's like okay I got to go do this task so i'm going to use a computer for that because there's no mcp that fits the bill like for example one of the early uh demos we did with workspace computer was around creating content in a google doc like coming up with an idea and then going and writing out the research paper in google docs and the first time i saw it i'm like this is it like if it can do this reliably like why bother with apis and stuff like get it to format and use the existing tools maybe not necessarily that tool but i thought it would advance rapidly enough that maybe it could go and operate canva create the image put it into the document and do all the actual things that i would do and then you soon realize obviously like it's not even closed um and it's really that last mile training on tasks that's needed so i'm interested with v2 by combining mcps with the computer and feeding all that context through how much more powerful it becomes and i think for me the mcp paradigm coupled with, as you say, the old flames of like voice now working with it asynchronously as well. And if you think about the MCP concept with say Simlink and the workspace computer, it's actually a dual effort because you've got your cloud MCPs for research and other data gathering and that kind of thing. But then you've got the local MCPs for interacting with the computer, operating software on that computer, and other things. So rather than the workspace computer always having to just click around the screen, there will be dedicated, well, there is dedicated local MCPs that can more competently do that kind of thing. So the workspace computer is going to be more than just operating it with mouse and keyboard. It's going to be using local specialist tools to expertly operate the computer as well. So that really, really enhances the power of the workspace computer over what we had before. Yeah, I'm really excited to see it all in action. I think there's these moments lately with the MCPs where, as I said, it'll call a tool and you're like, why is it calling that? Oh, it's broken. And then it goes off and delights you. For me, those are the moments where I'm like, wow, this thing, like as it incrementally improves and as the tool calling around some of these newer models improves, like it is able to start drawing in context and taking actions where I'm like, wow, this is a time saver. This is useful. This is starting to really have an impact outside of the sort of like conversational type benefits that everyone else is using and seeing today. Yeah, agreed. All right. That was another episode that we did and recorded successfully. Thank you if you made it this far uh if you do want a uh an ai workspace sim theory dot ai i think the coupon is still working so if you put in still relevant still relevant one word you will get ten dollars off uh so it'll only cost you i think five bucks it's the lowest subscription to sign up to sim theory and give it a try and install some of these mcps we've been talking about today like nano bananas available pixverse v5 is available too all right we will see you next week hopefully with a work pace v2 demo goodbye
Related Episodes

GPT-5.2 Can't Identify a Serial Killer & Was The Year of Agents A Lie? EP99.28-5.2
This Day in AI
1h 3m

ChatGPT is Dying? OpenAI Code Red, DeepSeek V3.2 Threat & Why Meta Fires Non-AI Workers | EP99.27
This Day in AI
1h 3m

Claude 4.5 Opus Shocks, The State of AI in 2025, Fara-7B & MCP-UI | EP99.26
This Day in AI
1h 45m

Is Gemini 3 Really the Best Model? & Fun with Nano Banana Pro - EP99.25-GEMINI
This Day in AI
1h 44m

Are We In An AI Bubble? In Defense of Sam Altman & AI in The Enterprise | EP99.24
This Day in AI
1h 5m

Why Sam Altman is Scared & Why People Are Giving Up on MCP | EP99.23
This Day in AI
1h 33m
No comments yet
Be the first to comment