

Building Voice AI Agents That Don’t Suck with Kwindla Kramer - #739
TWIML AI Podcast
What You'll Learn
- ✓Voice interaction is a key interface for the next generation of AI-powered applications, but adoption has been slow due to technical and usability challenges.
- ✓The technical stack for voice AI includes models, APIs, orchestration layers, and application code - each layer presents design choices for developers.
- ✓Pre-built voice AI platforms provide an easy way to get started, while custom orchestration layers like PipeCat offer more flexibility and control.
- ✓Developers should experiment with voice interfaces and think about how to integrate voice as a primary interaction modality, not just a supplementary feature.
- ✓The rapid evolution of speech recognition and generation models is enabling new possibilities for voice-first user experiences.
AI Summary
The podcast discusses the current state of voice AI and the challenges of building practical voice AI agents. The guest, Kwindla Kramer, co-founder of Daily, shares his perspective on the evolution of voice interfaces and the need for new interaction paradigms as generative AI models become more advanced. He highlights the technical stack involved in building voice AI applications and discusses the tradeoffs between using pre-built platforms versus building a custom orchestration layer.
Key Points
- 1Voice interaction is a key interface for the next generation of AI-powered applications, but adoption has been slow due to technical and usability challenges.
- 2The technical stack for voice AI includes models, APIs, orchestration layers, and application code - each layer presents design choices for developers.
- 3Pre-built voice AI platforms provide an easy way to get started, while custom orchestration layers like PipeCat offer more flexibility and control.
- 4Developers should experiment with voice interfaces and think about how to integrate voice as a primary interaction modality, not just a supplementary feature.
- 5The rapid evolution of speech recognition and generation models is enabling new possibilities for voice-first user experiences.
Topics Discussed
Frequently Asked Questions
What is "Building Voice AI Agents That Don’t Suck with Kwindla Kramer - #739" about?
The podcast discusses the current state of voice AI and the challenges of building practical voice AI agents. The guest, Kwindla Kramer, co-founder of Daily, shares his perspective on the evolution of voice interfaces and the need for new interaction paradigms as generative AI models become more advanced. He highlights the technical stack involved in building voice AI applications and discusses the tradeoffs between using pre-built platforms versus building a custom orchestration layer.
What topics are discussed in this episode?
This episode covers the following topics: Voice AI, Conversational AI, LLMs, Orchestration layers, User interfaces.
What is key insight #1 from this episode?
Voice interaction is a key interface for the next generation of AI-powered applications, but adoption has been slow due to technical and usability challenges.
What is key insight #2 from this episode?
The technical stack for voice AI includes models, APIs, orchestration layers, and application code - each layer presents design choices for developers.
What is key insight #3 from this episode?
Pre-built voice AI platforms provide an easy way to get started, while custom orchestration layers like PipeCat offer more flexibility and control.
What is key insight #4 from this episode?
Developers should experiment with voice interfaces and think about how to integrate voice as a primary interaction modality, not just a supplementary feature.
Who should listen to this episode?
This episode is recommended for anyone interested in Voice AI, Conversational AI, LLMs, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
In this episode, Kwindla Kramer, co-founder and CEO of Daily and creator of the open source Pipecat framework, joins us to discuss the architecture and challenges of building real-time, production-ready conversational voice AI. Kwin breaks down the full stack for voice agents—from the models and APIs to the critical orchestration layer that manages the complexities of multi-turn conversations. We explore why many production systems favor a modular, multi-model approach over the end-to-end models demonstrated by large AI labs, and how this impacts everything from latency and cost to observability and evaluation. Kwin also digs into the core challenges of interruption handling, turn-taking, and creating truly natural conversational dynamics, and how to overcome them. We discuss use cases, thoughts on where the technology is headed, the move toward hybrid edge-cloud pipelines, and the exciting future of real-time video avatars, and much more. The complete show notes for this episode can be found at https://twimlai.com/go/739.
Full Transcript
I think there's an existence proof that you can use LLMs in conversation very flexibly from the growth of the enterprise voice AI stuff we see. And I think the delta between what we're seeing there on the enterprise side and what you're seeing on the, you know, kind of chat GPD, advanced voice, Gemini Live side, if you want the hot take expression of it, those are demos, not products. The version of it you interact with is a demo, not a product. They could be products, but for a whole variety of structural reasons at OpenAI and Google, they are not products today. All right, everyone, welcome to another episode of the Twinwell AI podcast. I am your host, Sam Charrington. Today, I'm joined by Quinla Kramer. Quindler is co-founder and CEO of Daily and the creator of PipeCat. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Quinn, welcome to the podcast. Thank you for having me. I'm a big fan of what you do on the show. Excited to be here. I appreciate that. I guess this is technically your second time on the show because we did that kind of panel discussion interview at the most recent Google I.O., which was a lot of fun. Shout out to SWIX from Latent Space for introducing us and putting that all together. But I think we found a lot of interesting things to kind of talk about and vibe on. And I wanted to dig in a little bit deeper about what you've been up to. And for folks that didn't hear or don't know you, that's going to be primarily around voice AI is what you've been focused on. but you know let's give you an opportunity to introduce yourself to the audience yeah i'm quindle holman kramer i'm an engineer i've been doing large-scale real-time network audio and video stuff for most of my career i co-founded a company called daily we make audio and video infrastructure for developers so if you're building something like a telehealth app or an education app and you're trying to connect people together you can use our infrastructure on our SDKs. When GPT-4 came out, it started to look to us like not only could computers do all these amazing new things around structured data extraction and kind of open-ended conversation, but that those things felt like maybe you could have humans talking to computers in a new way. So we built a bunch of stuff, experiments, stuff with customers. And I got more and more convinced that voice AI and real-time voice AI was a big part of this platform shift we're all excited about. So we open-sourced all the tools we built internally at Daily. That became PipeCat, which is now the most widely used voice agent framework, or before 2025, I would have called it an orchestration layer for real-time AI. That's awesome. Yeah, it's amazing how quickly these terms are evolving. I find it funny that voice tends to be, you know, people either love it or hate it as an idea for the way to interact with AI and computers in general. It has always been fascinating for me and really exciting. Like, I don't know if I was in high school or junior high school and was like into dialogic boards and stuff like that. And I was super excited about, you know, Twilio knew them when they were really early on. And this, you know, idea that we can like control computers and interact with computers via, you know, just via natural speech, I find fascinating. How did you, you know, get into it? What was the spark beyond the kind of business opportunities you saw? I mean, I am like you. I think it's super interesting to be able to actually talk to a computer and have that be a big component of the user interface. And it seems obvious to me, like I think it does to a lot of people who've been sort of consciously trying to experiment with this, that this platform shift towards generative AI is going to require a bunch of new interface building blocks. And we haven't even started to scratch the surface there yet. And I am pretty sure that a big, big part of those building blocks is going to be figuring out what voice first UI looks like. People are really comfortable talking. And even though those of us who like are used to doing this to interact with computers feel maybe a little weird when we talk to our computers, you get over that pretty fast. And all of a sudden it opens up this whole sort of like efficiency channel that's very, very different from the mouse and the keyboard. And we just couldn't do it before because we didn't have a way to take that unstructured conversation actually turn it into something computers could do something with but llms can totally do that and they're just as good at processing your voice as they are at processing a text stream from your keyboard and the step function difference between typing and voice as input when you add an llm to the mix is i think even bigger so you know i'm encouraging everybody i know to talk to your computer as much as you can if you're a programmer and you're interested this stuff, experiment with what like a little building block for, for a voice first experience looks like, because you're totally living in the, living in the future when you do that. It frustrates me to know that every webpage that has a form on it doesn't have a microphone that I can press and do it. And, you know, obviously I've got it, you know, with the various keyboards on the phone, but you have to go into the fields and I just want something to just take care of that. Let me talk to the web. I have probably had the same conversation maybe, you know, 50 times over the last month or so with both friends and people I'm working with professionally about voice. And it always goes like, and it's always with programmers and it always goes like this. I say, I'm trying to talk to my computer as much as I can. And these days when I'm programming, I talk more than I type. And people are like, yeah, I don't see it. Like I'm not really like I don't really like talking to the computer. And also, like, what do I do in my open plan office? And my response to that is twofold. First, I totally hear you. It's a shift, you know, changing a big part of your like professional day. Like that's a big deal. I'm not discounting that at all. But also, we started daily to do video and audio communication stuff in 2016. And I've been doing startups long enough that I was lucky that I only had to pit investors that I already knew. So very easy conversations, people who knew me, people who were sort of biased towards, you know, taking what I was saying I wanted to do for a new company seriously. Still 12 out of 15 of those initial pitch conversations in the 2016, summer of 2016 for daily went like this. I said, I think we're all going to be doing real time audio and video all the time on the internet. That's what I'm starting a company around. and professional tech investors would say to me, I don't know, man. Like I, I like, I like phone calls. Like if I want to talk to somebody, I just want to talk on the phone. I don't want to have to set up a video call. I was like, okay, you will, you will. That's nuts. That's nuts. It's been astounding, like how quickly the technology evolves, you know, or has evolved over the past, you know, and this, I guess is also like preaching to the choir, we all have been living this and holding on with like white knuckles. But I remember I couldn't be more than six months ago, like before, or maybe a little more than six months ago, like a little bit before vibe coding was the cool word. I was essentially vibe coding an application that let me, I think at the time I was like trying to track macros. And so I was, I like built this app that let me like speak what I ate, you know, with units and quantities and stuff like that. So a little bit more granularity than just take a picture. And it would like parse that and then go find the macros and stuff from a database. But the, I mentioned it because the way that I did the voice was like capture a segment of voice locally and send that into, I think, Gemini or some model, doesn't really matter, to transcribe it and get that back and then like ask get to pull out the quantities. And probably like two weeks after, like I stopped working on that, the way I would do it, like totally changed. Like OpenAI came out with like the live voice and, you know, all the other model providers have since, you know, come out with their versions of that. And so I think I bring that up to say that, you know, two things, like just to nod at the way this technology has been evolving, But also to note that, you know, for me, like, OK, I'm going to like capture a recording and send it into an LLM was like very easy cognitively to understand. But then the, you know, the live APIs and like, you know, WebRTC and all this other stuff seem like, you know, if I spent more than a couple of minutes like digging into it, like I'm sure it would have been naturally natural and easy to work with. But, you know, it was a little bit more complicated. And so I wanted to use this opportunity to ask you to give us like a primer on like getting started with voice. Like what's the way to think about it as a or the dominant like abstractions for building voice AI applications nowadays? Yeah, I have this conversation a lot, too, because there's so much new interest in voice. And I'll try out my latest version. You can tell me if it lands well. So because you've got a technical audience, I think it's worth just talking about the stack, like talking about what the stack is probably helpful. At the bottom of the stack, you've got the models. So you've got the weights, basically. uh so whatever those models are including like the multimodal models you're talking about like the open ai real-time model the gemini uh live model or you can have you know text mode llms and you can do text to speech and speech to text and kind of glue everything together or you might have a bunch of small fine-tuned llms all collaborating but at the bottom you got the weights on top of those you've got the apis that the model providers or whatever your inference jack is providing so like you're you're hitting an http endpoint from open ai or from google or a web socket endpoint from somebody above that because generally we're trying to do non-trivial things with this technology we're trying to go beyond building demos above the apis you've got some kind of orchestration you're gluing things together you're implementing the kind of pipelining of data to make it possible to do the multi-turn real-time real-time conversation and then on top of that you've got your application code, which is sort of sitting on top of that kind of orchestration layer. And so if you put all those things together, you've got a real application. Now you can get started with voice AI today using a platform where everything is, all those things are bundled for you into one kind of interface where you build something. Maybe even you just build in a dashboard, you don't even have to write any code. So companies like Vappy have pioneered that all in one combined batteries included with really nice dashboards really nice kind of developer tooling on the other end of the spectrum you could build kind of everything yourself you could make all of those individual choices in the stack yourself you can mix and match you put everything together kind of as a as a programmer you know writing code with libraries what i work on a lot is this orchestration layer called pipecat which tries to give you a little bit of have your cake and eat it too, where it's easy to get started because the core implementation of things like interruption handling and turn detection and multi-turn context management are all there for you as Python functions, basically. But you also have complete control and you can mix and match all those parts. And that's an open source project? Yeah, totally open source, totally vendor neutral. I spend most of my time these days on Pypecat because it's such an interesting new vector for everything we do in the real-time world, but it is a completely open source, completely vendor neutral project. Most of the big labs contribute to PipeCat. Hundreds of startups contribute to PipeCat. There's probably 120 contributors now in the GitHub repo. In thinking about the relationship between PipeCat and Daly, talk a little bit about the overlap just so I can kind of understand. It's not that Daly is commercializing PipeCat. It's that Daly is providing infrastructure for people who are building these applications and PipeCat just makes it easier to build those applications, whether they're hosting on Daly or elsewhere. Yeah, that's exactly right. Many, many more people use PipeCat without Daly than use PipeCat with Daly, which I think is a mark of success for an open source project. We at Daly are the very low level network infrastructure. So we move the audio and video bytes around the network at super high reliability and super low latency. So anytime you need to do things very fast, in a real-time interaction on the internet, we'd love for you to think about using our network infrastructure. But, you know, PipeCat supports lots of different options for network transport. Daily's just one of them. We do increasingly try to help our customers get up and running with production quality voice AI infrastructure. And we have built a, on top of our global infrastructure, we've built a hosting platform for PipeCat or related voice AI things called PipeCat Cloud. But that's, again, totally separate from the open source project, which has no commercial dependencies at all. And so PipeCat Cloud would be kind of analogous to like Cloudflare and Cloudflare functions. Like the cloud is like the runtime and the PipeCat Cloud would be like the runtime environment and daily would be like the level infrastructure. It's not a perfect analogy, but. Yeah, I mean, I've been trying to figure out what the best analogies are in a bunch of ways for this new era. We work for voice AI agents. Yeah, the original like web hosting platform that I thought kind of did a really good job, like balancing flexibility with high level abstractions early on was Heroku. So I sort of think of Pipecat Cloud as Heroku for voice AI. but you know maybe to a technical audience it's even more clear to just say you push us a docker container and we auto scale it and monitor it for you and everything else you you get to choose so it we're just taking our global infrastructure that's very good at scaling things everywhere in the world and we're hosting a docker container that is wired up for ultra low latency voice ai for you running on kubernetes or something else yeah yeah it's it's a lot of kubernetes under the covers. And I'm sure we will talk more about this from a bunch of angles, but there are a lot of things that make voice workloads different from the HTTP workloads or even the WebSocket workloads that we all spent a lot of our time building. And one of those things is you have to have this very low latency network transport and you have to support long running conversations. And we get a lot of people who come to us and say, I tried to build this on AWS Lambda, or I tried to build this on GCP Cloud Run, both of which are fantastic platforms, but do not have the components you actually need to support the voice workflows. I'm sure they will at some point, because this space is growing so fast. But there's just a bunch of things you have to do that are not the normal Kubernetes config to get the voice platform stuff to kind of run, to scale, to have the cold starts be right for voice. I was just thinking cold start times and stuff like that. Yeah. We spent so much time thinking about cold starts. Let's dig into those challenges because I think that's in a lot of ways where the rubber meets the road and kind of differentiating, you know, web applications, as you mentioned, from voice applications. I think at a high level, the things I talk a lot about with people building, like going from prototype to production on voice ai today are evals or reliability latency and then the sort of fundamentally multi-model nature of almost every production voice app and we can take those in order if you want or you can throw some out that you yeah well let's start from let's start from like lower level types of concerns, which is latency. So for example, when you describe the stack, I was wondering, you know, I had questions like, did you have to, you know, write your own container runtime or is there like a latency optimized container runtime or like how much tinkering in the stack, you know, do you have to do to support voice applications like, you know, custom kernels and all kinds of weird stuff? Because that says what's different between that and just like spinning something up and, you know, DigitalOcean or AWS or wherever. So everything is trade-offs. Probably earlier in my career, we would have written our own container runtime because there are advantages doing that. But what we did this time was we said, okay, every developer is going to be able to use Docker. Let's stick, let's optimize other places. Let's stick to Docker compatibility because that's going to make the onboarding and the growth for, you know, every developer who uses Pipecat Cloud a lot easier. So we kept vanilla Docker but then we have to surround that with a bunch of fairly specialized Kubernetes stuff on a couple levels One is just all the cold starts and rolling deployment stuff that specific to voice AI that you were mentioning So you got to get those Docker containers loaded You got to get them wired up to UDP networking You got to have auto auto you got to have schedulers and deployment logic that doesn terminate half hour long running conversations which you know all the all the all the default Kubernetes stuff when you push code, there's fairly short drain times. You have to have long drain times and a bunch of stuff that goes with that. The other layer is you've got to support UDP networking. So you have to support WebRTC because for edge device to cloud real-time audio, you need to not be using WebSockets or TCP-based protocols. You need to be using UDP-based protocols, which the wrapper for those these days is WebRTC. So that's a big deal to wire up Kubernetes properly to UDP and do all the routing and be able to start the web RTC conversations. There's just a whole bunch of little things you have to kind of customize in Kubernetes. And so the biggest single thing that you do for latency is you get that network layer right with the UDP networking. Then on top of that, you just try to pull out every tiny bit of extra few milliseconds of latency everywhere in the data processing pipeline, which you never really have to worry about doing if you're doing kind of text mode, HTTP based inference, because a few tens of milliseconds here and there, like you don't really notice it, but you really, really notice it in a voice conversation where you're trying to get below a second of voice to voice latency. I'm curious in talking about the networking stuff when we were chatting before, you mentioned that you listened to the recent episode with Vijoy Pandey from Cisco. And one of the things that he talked about that I found interesting was this slim protocol that they're building and promoting or starting to promote. Any takes on how that fits into meeting the requirements that you're describing? Or are there other efforts like that? Yeah. I like that direction. I'd actually love to talk to Vijay because I'm interested in how that work he's doing can play nicely with the stuff we're doing. Slim has a bunch of stuff built in that I think is really important. I don't think it has UDP support yet. So from my perspective, that would be like a really good thing to add. There aren't any other real-time oriented transport sort of standards yet, other than what we're doing in the PipeCat ecosystem, where we've defined a standard that lets everybody plug into PipeCat. I think there will be a real need for a real-time standard. The way I talk about it with the partners we sort of bring into the pipecat ecosystem is for better and worse because standards are always like that the open ai chat completions http standard became the standard for everybody who does text-based inference and now people have built a bunch of stuff on top of that and a bunch of improvements to it but chat completions is what we all use if we're say i might use open ai or i might deploy my own model you know using vllm or whatever i'm going to use chat completions we don't yet have that standard for real-time multimedia. We need it. And we are definitely working towards that in the Pipecat ecosystem because I think we've built a lot of, a lot of, we've learned a lot of lessons about what that standard needs to look like. Yeah, yeah. And I think this kind of goes back to something we were talking about earlier, which is, or even goes back to the story I mentioned with my own experience. I think when a developer who's not used to living in the voice world hears UDP and WebRTC, they're like, wait, I don't usually have to think about that stuff. So talk a little bit about the, you know, the APIs and abstractions that, you know, make sense in the voice context, you know, whether it's, you know, what you're doing with PipeCAD or, you know, other popular options if they differ and how developers should think about, you know, kind of the way that they suggest interacting with voice LLMs. Yeah, I mean, the basic idea that you sort of have to use as the first building block when you're thinking about writing these voice agents is you've got to move the audio. And there's video agents too, but I'll just because they're earlier in the growth curve and they're super exciting. But let's talk about audio for the moment is simpler. You've got to move the audio from the user's device to the cloud where you're running some piece of code you wrote that takes the audio, processes it however you need to process it, runs one or more inference steps with one or more models, then generates audio at the end of that processing loop and sends it back to the user. And then does that over and over and over again for every turn in the conversation. managing things like knowing that the user might interrupt the llm and needing to handle that gracefully or you might even have long running tool or mcp or function calls that are running in the background and the llm might actually want to interrupt the user at certain points so as you start to build these things out and production cover more and more use cases you have more and more complexity but the basic idea is move the audio to the cloud because that's where you have the processing power to get the best results from inference and then move the generated audio back to the user so you can play it out in real time over the speakers or you know headphones that the user is wearing uh i think that maybe gets us to challenges one of the ones that uh well we i guess we're kind of going through challenges maybe doesn't matter one of the things you just mentioned there is like activity detection and interruption handling. And that is something that I think that folks who have tried to use voice AI systems, you know, whether it's OpenAI or Gemini Live, you know, that strikes me as the biggest like user experience hurdle that we have right now. you know, curious whether you agree with that, but also like how you see that evolving. I find that I enjoy using, you know, ChatGPT, advanced voice mode, Gemini Live, but the conditions have to be fairly perfect for it to not feel like a kind of stunted conversation. Like, you know, forget about doing it in the car. Like it's very difficult. So what's the path as an industry for us to overcome that? Is it better models? Is it better infrastructure? Is it at the API level, you know, glue or, you know, pre-processing or something? How do we, you know, get there? And, you know, starting with, do you agree that that's a big hurdle? Like, are you seeing that also? Or am I just using old models or something? So I think there's an existence proof that you can use LLMs in conversation very flexibly from the growth of the enterprise voice AI stuff we see. It's a little bit under the radar to people who are building consumer stuff or just experimenting with the new tech. And I mean that like in the best possible way. But I have had multiple industry analysts tell me that the fastest growing Gen AI use cases today from a monetization perspective are programming tools. And the second fastest is enterprise voice AI. So there are things like call centers that are now answering 80% of their calls with voice agents. financial services companies that, or somebody's just taken out a new mortgage and they really, really, really want to remind people that, you know, the first mortgage payment is coming up because that's a known failure mode where you've taken out a new mortgage. You thought you've done all the paperwork to like get your bank account wired up so that, you know, and you haven't. And it's, it's, it's the end user, it's the customer's fault if that happens, but nobody wants it to happen. Everybody wants that, you know, mortgage payment to happen seamlessly. so you know you didn't have the human staff bandwidth to call every single new customer five days before their first payment is due before you just couldn't do it cost effectively now you can do that with voice ai we see a number of our partners and customers doing things like answering the phone for small businesses and they start out answering the phone with an ai agent when the business is closed, when they didn't have anybody answer the phone before. That goes so well after three or four or five months, they're answering the phone all the time. And humans are only picking up the phone when you actually really need a human, which is 20% or less of the calls usually. So there's just a huge amount of growth in these really working enterprise voice agents. And I think the delta between what we're seeing there on the enterprise side and what you're seeing on the, you know, kind of chat GPT, advanced voice, Gemini Live side, which I agree with, is that if you really are strongly incentivized to build a product that has a particular surface area that works, you're taking certain approaches. If you're building from the models up and your goal is to build the state of the art model and then wrap it in functionality that sort of shows how to use that model, you're doing something very different and your pain points are different, Your timelines are different. Your goals are different. I love what the live API team and the real-time API team are doing at those two big labs. I also think that if you want the hot take expression of it, those are demos, not products. The version of it you interact with is a demo, not a product. They could be products, but for a whole variety of structural reasons at OpenAI and Google, they are not products today. it would be interesting to me just from a thought experiment perspective what it would look like if you know either of those companies were super super serious about that product surface area and they might become but today they're not so you can solve all those problems with like background noise or interruption handling or maintaining the context in flexible ways depending on exactly what is happening in the conversation those are solvable problems today but they're product problems and you have to have a product team that's like working on those problems full time. Yeah, that's a super interesting take. And I don't think it would surprise anyone, right? Like chat GPT wasn't ever meant to be a product itself, right? It was meant to be a demonstration of capability. And you can see that if nothing else, like there's open AI has a lot more to gain by getting folks excited about the idea of using voice and chewing through a lot of voice tokens than they do necessarily, you know, for investing a lot of money in a specific voice product. Now, you know, there's always a product manager's view on that, like how good does it have to be to really inspire people versus not? But like you said, that's a product decision as opposed to the technology. I think it raises the question that, you know, their view of the world, and And this is something that came up in our conversation with Google as well, like is a very much kind of a single big model view of the world as opposed to building modular systems. And it sounded like one of the distinctions you were making between what they're doing and what you might need to do from a product perspective is, you know, build out a specific subsystem that's looking for background noise or looking for, you know, trying to detect interruption. Is that kind of the direction you are going? Yeah, almost all of the production voice agents today, especially in the enterprise side, are multi-model. So they've got a transcription model and then an LLM operating text mode and then a voice generation model. And you've usually got a little dedicated voice activity detection model to help you with turn detection. You might actually have a semantic turn detection model as well in that pipeline. If you're an enterprise and you're really concerned about certain kinds of compliance and regulatory stuff, you might also have a model doing some inference in parallel with the main voice conversation pipeline. That's like a guardrails content model. You might be doing a bunch of other stuff. So one of the things I think that distinguishes voice AI from a lot of other use cases is basically every voice AI agent today is multi-model as well as multi-modal. And that's just a very different architecture from what Google and OpenAI are pushing towards this worldview where these incredible state-of-the-art models kind of do everything. And they're much less multi-model in their philosophy. I think that's actually a really interesting question for all of generative AI, you know, that I think all of us who are building these solutions think about at least a little bit, which is how much does the future world where we're building this stuff look like we're using those SOTA models? and how much are we using, you know, smaller, mid-sized, maybe fine-tuned models? Or are we sort of doing all of it depending on, you know, what we're doing at the moment? I think nobody really knows because, as you said, this technology is evolving so quickly. Yeah, yeah. I think that theme for folks that listen to the podcast, that theme of, you know, build a system out of, you know, modules versus train some end-to-end thing with lots of data and solve the problem in that way is one that comes up quite a bit. The example that comes to mind is autonomous vehicles or robotics, embodied AI more broadly because we've got this rich history of kind of physics-based models or SLAM-based models in the case of autonomous vehicles that can play a role in the solution and have a lot of interesting, you know, properties, you know, but that is, you know, often put up against the promise of the model being able to figure out things on its own that we can't teach the model, you know, based on, you know, our own view of the world. And so it's interesting to hear that in this space as well, like that modular approach is kind of where folks are building today. And I do think that will change, but I think it'll change in complicated ways that are hard to predict. I mean, we definitely feel that tension you're talking about every day because the speech-to-speech models from OpenAI and Google are genuinely better at audio understanding and at natural voice output. So if you're, one thing I often tell people who are asking me for advice is if you're building something like a language learning app or a storytelling app for kids, you probably want to use those speech-to-speech models. But if you're building something where you've got to go through a checklist, like collect a bunch of information from a healthcare patient before their visit, you really probably want to use the text mode LLMs and a multi-model system because you can guide and control and eval in real time whether you're getting what you need just much, much, much more reliably. and you know we've i always said we would never train models at daily because what we do is infrastructure but i got so frustrated by the turn detection problem uh you know late last year that when christmas break came around and i didn't have to you know do actual meetings style work all day um i trained a version of a of a turn detection model an audio input turn detection model that came out well enough that we released it. And now there's like a Pipecat ecosystem around it. And there's a totally, you know, really, really good, totally open source, totally open data, open training code, turn detection model. And I fully anticipate that turn detection model will not be useful two years from now, because we will have embedded that functionality into these bigger LLMs. But you sure need it now to build something that's kind of best performing conversational dynamic agent. Yeah, that's super interesting. And I'd like to maybe dig into that in a little bit more detail, if only to help folks get a sense for like how these modules fit into a bigger system. So talk a little bit about what are the inputs to this turn detection system? What are the outputs and how those are used in, you know, orchestrating an AI flow? Yeah. So the classic pipeline looks like you've got audio coming in from the network connection. You have to chunk that audio up into segments because no matter what kind of LLM it is today, the LLMs all expect you to ask them to do one thing like at a time. You have to fire inference. And this is another tiny little aside, but another big architecture leap I expect to happen in these LLMs in the near future is, yeah, 100%, like bidirectional streaming all the time. You're always streaming tokens in, you're always streaming tokens out. When you're not, when the LLM isn't talking, those tokens are like silence tokens or stop tokens or whatever you want to call them. When the LLM is talking, they're meaningful tokens, but you should always be streaming. Or thought tokens, totally right. Right. And there's some architectural experiments where you actually have multiple output streams. You have an audio output stream, a text output stream, some kind of internal dialogue output stream that's being fed back in all the time. So like there's going to be new architecture stuff that changes how we think about these things. But today you take that audio you chunk it so you can start to think about how to feed it to the LLM You decide those you making those chunks based on trying to decide when the user feels like they're done and they expect the LLM to respond. And that's called turn detection. So the voice activity detection you're talking about is not feeling fully natural is today just a fixed window of the user is not talking anymore it's like 800 milliseconds if the user doesn't talk for 800 milliseconds you decide to respond that is not great because often i pause longer than 800 milliseconds when i'm trying to figure out what to say to a human or an lm well just people like they have to like look off to the side and figure out what they want to say and come back, right? And that, depending on the conversational flow, that can be, you know, very short or that can be very long, even in one sentence, even with one person's speaking patterns. The thing that I, sorry, the thing that I experience the most, though, I think it's the flip side of that. And it is maybe overaggressive turn detection. I don't know what it would be like, but it's like the, you know, the advanced voice mode speaking and then it just stops. Like I said something, but I said nothing. It just heard some background noise and it got thrown off track. And it's like waiting for me to say something, but it can't figure out that I'm not saying anything. So those two things are slightly different. OK, talk about the linkage. Yeah, they're linked because they're implemented in the pipeline by the same components. And that's a good call out that maybe they should be more specialized components as we evolve this stuff. But the beginning of almost every voice pipeline is a small specialized model called a voice activity detection model. And that voice activity detection model's job is to take 30 milliseconds or so of audio and say, this looks like human speech or this doesn't look like human speech. It's a classification model. and then you decide you do both turn detection and interruption handling based on that model's classification of those speech frames so the turn detection would be say there's an 800 millisecond gap that's a turn the interruption handling would be i got three frames in a row that look like speech that's an interruption um and so you know if that's not tuned exactly right you cough it can cause the model to be interrupted or somebody's playing a really loud radio in the next car over and the radio announcer is like, call K-105 now. That's an interruption. So the next step in both turn detection and interruption handling is to make those two components more sophisticated, make them more semantic, make them more aware that some kinds of background speech are background speech, not primary speech. And there are a bunch of techniques for that, But those that we are making progress on both those problems, but we're definitely not, you know, universally there yet. Yeah, I think it was actually at IO. They did a demo where like someone's talking to a voice agent and then like someone comes into the room and is like talking to them and it doesn't throw it off at all. So that's maybe an existence proof of progress there. I don't recall what specifically they were doing to enable that. You may know. So this is another good example of the small model versus big model approach, both of which are valuable. The big models are starting to be trained to understand interruptions natively and to be able to understand both because if they're multimodal, they have access to the audio. And they also have a lot of like semantic understanding of how language works. So when you combine those two things, you ought to be able to tell, hey, this is a radio in the background talking. This is not the person I'm talking to talking and ignore everything that's not the person you're talking to. So you can do that with the big model. You can also specially train a small model to try to separate out foreground and background speech. So one of the models I often recommend to people, which is extremely good at that, is a model by the company Crisp that you may know of, KRISP, because they have some really good desktop audio processing applications. They also have models that are designed to be run as part of these generative AI workflows to do exactly this kind of primary speaker isolation. And running those models as part of that initial stage of the pipeline makes a huge difference in enterprise reliability. Okay. And so if someone is starting and they're listening to what we're saying and, you know, they were thinking that they had this problem to solve and they needed to call an API and now the problem just got a lot bigger because they need all these different components as part of an orchestrated system. uh yeah should they have that fear or are there like templates or something that they can get with pipecat that like does all of the crap that they don't care about and they could just plug in their thing yeah there's a various starter kits for different use cases in the pipecat open source repos that are you know 75 or something lines of python code including all the imports and have all of these pieces totally standard in the pipeline and you can just change out the prompt and you've got a working voice agent that you can run locally, you can deploy to the cloud, and you can start to iterate. You suggested this earlier as we were kind of ticking off challenges, but, you know, evals has got to be a big one. It's a challenge for folks that are building text-based applications now. We're, you know, starting to make progress there, but it's a, you know, it's an evolving practice. What's the state-of-the-art or landscape like from a voice perspective? Last year, almost all of us had only vibe-based evals for our voice agents. And, you know, sounds like text-based agents, actually. Yeah, it kind of does. But I think we're probably, I mean, I think we're probably a little bit, you know, six months behind or something, the text-based agent teams and getting all the way there. Although we're making progress. And there's a couple of things that are harder for voice agents about evals. One is they're always multi-turn conversations. Like just the definition of a voice agent is that it's a fairly long multi-turn conversation. And the other is that whatever your pipeline is, if it's the three kind of transcription LLM speech model pipeline, or if it's the voice to voice pipelines from, you know, the live API or the real time API, you've got audio in there as well. So you've got this end-to-end problem that includes not just text, but audio. You have to figure out, do you just do your evals based on text or do you try to incorporate all the failure modes that are additional to the text failure modes in audio? So those are the things we grapple with kind of uniquely in the voice space. I think it's worth talking a little bit. I'm curious how you're thinking about this because you talk to lots and lots of people. What we have learned in the voice space is the multi-turn stuff takes you way out of distribution for the current training data from the big models. And you can look at all the benchmarks for like, here's how good instruction following is. Here's how good function calling is. Those are totally a good guide to how well your agent will perform for the first five turns of the conversation, as you get 10, 15, 20 turns deep, those benchmarks just sort of your, your actual performance on instruction, following function, calling falls off a cliff. So you almost have to build custom evals in the voice space because you kind of don't have benchmarks. You're kind of out of distribution. Like every agent is just different. I've had people who tell me Gemini two, five flash just doesn't do what I want to do at all. and other people tell me gemini 25 flash is the best model by like a factor of five for my voice agents like yeah i get it we're just kind of do you see that in the non-voice space as much i was just going to say i don't think that that's unique to voice i think um you know for a while now the you know public benchmarks have become you know noisier and noisier with regards to an individual, you know, engineer's ability to, you know, get the results they want for their thing. And so, you know, oftentimes now, you know, when there's a new model, you know, you look at the model's performance on the benchmarks, but then you're also going to social media and hearing people talk about like, you know, their private benchmarks and, you know, you're running against your own, you know, whatever your pet problem is or your, you know, product, you know, requirements are and how you've captured those in a, you know, eval or benchmark. I think it's the same across the board, but it does strike me as being harder, you know, with the voice for the reasons that you mentioned, like if text is my intermediary, that maybe solves a lot of the problems. But if, you know, for example, the problem we were talking about with voice activity detection and turn taking, like it doesn't necessarily help me evaluate that part of the process unless I'm like end-to-end feeding some voice in and, you know, somehow instrumenting the system so that I can, you know, evaluate. Yeah. Even, you know, thinking about like how I might do that. Like it's not obvious, not obvious how I would do that end-to-end as opposed to, you know, yeah, sure you can evaluate a voice activity detector in isolation, but, you know, as part of an end-to-end problem, it becomes, I think, a little bit more interesting. Yeah. And how do you kind of build a success metrics rubric when you've got even more moving parts, including things like conversation length and, you know, number of pauses in the conversation, were those pauses expected, were they not expected, number of interruptions in the conversation, is that good, is that bad? You really have to build up the intuition. And I think this is similar to all evals, but you know, your domain is going to be specific for your application. Always. I often tell people get to the point where you feel confident that the agent works based on everything you've been able to throw at it from a vibes perspective, and then do a little bit of production rollout. But before you do the production rollout, make sure you can capture all the traces, at least capture all the text. And then you will start that data flywheel where you've got enough captures that you'll be able to just kind of manually start to try to build that intuition up about what success is and what what failure modes are and then you can iterate on that in a bunch of ways including just do text-based evals for a while that's totally better than nothing or start to try to either build yourself or leverage some like eval or ops platform tooling or that's more specific to voice which more and more of the evals and ops tooling folks are starting to build audio support which is great um there's i think at least half a dozen uh pipecat integrations with you know good ops and evals tools that hopefully make it easier once you get some of that data flowing into the system to do that kind of end-to-end analysis you're talking about yeah yeah yeah yeah one of the things that i i find interesting in this conversation and i think it goes back to a conversation i had not too long ago with Scott Stevenson, who founded DeepGram. It's, I guess I would put it as like, in the context of this modular versus NN or like modular versus single model architecture, like text as an intermediary is an observability strategy, right? It's like, you don't necessarily need observability. Like we don't even know, unless you're talking about anthropic circuit tracing class things, like how to observe inside that single, yeah, multimodal LLM, like you get a lot just by doing, you know, text as an intermediary in terms of being able to evaluate and monitor what the system is doing and enforce some controls, et cetera. Yeah, you really do. If you have that, what i would one way to put it is almost every enterprise use case needs that text for observability for compliance you know for other other reasons you kind of have to have it i think it's true that i think it's really interesting to think about the ways you could use text and audio together too in consumer applications i we have been building enough of these things long enough that we've started to have a lot of fun, I think, and some opinions about how these UIs need to evolve. I was having a conversation with one of the big labs people before they released a voice product, and they said to me, yeah, nobody wants to see the text. When you're in voice mode, you just want to talk. And I was like, no, actually not. There's a whole slice of these use cases where what you want is to talk and then you actually primarily read and you maybe have the voice on because like that's a useful channel and you can look away if you want to but like the mode is audio in text out from a cognitive perspective and then there are other voice applications where you literally have no way to display the text because i've like called the agent on the phone or whatever and so there's just this huge spread of use cases and as we move towards these kind of next generation UIs, you have to figure out how to support a huge variety of things that people are going to actually want to do. And text matters, voice matters, images matter. Increasingly, video input and output are super useful modalities. So there's just a ton of new user interface experimentation that I think we are just barely starting to do. You've mentioned video a few times. What are some of the use cases you're starting to see? and what are you excited about in terms of opportunities there? I'm excited about the real-time video avatar and real-time video scene models getting out of the uncanny valley into a place where they're as good as the real-time voice models we have today. I think video, I mean, we have this like progress of technology throughout history. It's always like text, audio, video, right? And as you add video, you know, you get sort of this more kind of deeper level of engagement and connection. So I'm a big believer that a lot of the things we do with real-time AI conversation are going to have a video component. I think it's a year or two away before we're all the way there, but I think we're starting to see from models, models from people like Tavis and Lemon Slice, really interesting adoption. we're seeing the adoption there i think mostly in things like education and corporate training and job interviews but i think we're as the cost comes down and the quality goes up i think we're just going to see a huge amount of social and gaming use cases i mean one of one of the thought experiments for me is you know what would tiktok look like if it wasn't feeding me a bunch of really well tailored to my revealed preferences pre-recorded video. But if it were generating it, yeah, 100%. Like that's what the next TikTok is going to be, right? Well, we've seen Amazon start to experiment with like this choose your own adventure style of production. But, you know, that's very granular. You know, it's still very much a traditional production model. Like if those things can be generated on the fly, That's, you know, a very different world for them and for media, you know, production and consumption. One of the first voice AI things we released publicly in like 2023 was a choose your own adventure voice interactive story generator for kids. And it was a very like clarifying moment for me when we built that and thought it was compelling enough to show other people. because lots of us have kids at daily and our kids were just like, oh, yeah, no, I would obviously talk to this thing forever. And you see like you can see technological progress sometimes best when you see people younger than you take to it in a new way. You know, we've talked a little bit about the challenges, you know, from a voice perspective. Like, are those challenges kind of the same, but more for video or does video introduce new challenges or what are the new challenges? I'm imagining it's yes and. So the single biggest challenge for video right now is that it's so much more expensive that the use cases are limited. So that's going to take some time to, you know, kind of push the video per minute cost down of the basically the GPUs. The infrastructure. That was my question. I'm sorry, not infrastructure. Inference. As opposed to transport. That's right. And storage and... It's the inference. It's the GPU time. You just can't run that many simultaneous video generations on H100 or whatever. So you just have a lot of GPU cost. As that comes down, I think that will go away. And then the next interesting thing is there's sort of all the things we talked about with voice. Latency matters. everything is sort of multi-model you have a bunch of stuff going on that you've got to orchestrate that's even more true for video because if you think about video what you've really got going on is a an avatar or more than one avatar there's voice generation there's body pose there's facial expression those are like three different things now maybe it's one model producing those things or maybe it not but those are sort of three different things that something is orchestrating Then there the scene and the lighting and the camera movement That three more things that if you creating a really dynamic real video experience you at least in you may want to change all of those things dynamically so you're just sort of layering on more and more like multi-modality or or multimodal complexity that makes me wonder what you've seen in terms of pushing inference to the edge, you know, probably not much for video, but like for voice, it seems like we could be close to that. And if that's the case, how does the pipeline or the orchestration need to change to accommodate it? That's a little bit related to the question about whether we're going to use these really big models for everything or whether we're going to use a bunch of different models. If you can use medium-sized or small models for everything in the pipeline, you can run a bunch of stuff on the edge now. Like on my fancy Mac laptop that I paid a bunch of money for, I can actually run a really good local voice agent. I can use one of the open source transcription models. I can use something like Google's Gemma open weights, you know, 27B model or the Quinn 3 series of LLMs. And then there are several really good open source voice models. And I can just wire those things all up locally with no network at all. That's out of reach of most people's devices. You know, the sort of typical laptop can't run good enough models to do a good real-time voice agent. The typical phone can't. But, you know, we're two, three, four, five years from the typical device being good enough and being able to run a lot more stuff locally or being able to run parts of the pipeline locally and call out to the cloud only when you really need more information. horsepower. And I do think that's the future. I think there's so many advantages to running a bunch of stuff locally that the hybrid pipeline in the way I think about the world, the sort of processing pipeline, the hybrid pipeline is where we're going to get. And is the pipeline amenable to that or where are their tight couplings? So for example, I'm thinking about like voice activity detection? Like, that's probably a small enough model that I could run it on my device, but is the latency between that and the cloud where everything else is being processed such, you know, so large that it's going to be that the signal from the voice activity detector is out of date by the time it gets to the rest of the pipeline? Like, those issues, you know, have to be, you know, significant barriers to hybrid pipelines. Where are there specific places where you see opportunity or, you know, to shift out to the edge? Yeah, no, it's a great question. And there's no one size fits all answer, partly because the technology is moving so fast and partly because there's a big diversity of use cases. In some ways, the biggest reason just to do everything in the cloud today is you really do want everything. If you're doing multiple inference calls, you really want everything to be as close as possible to the inference servers. You don't want to be making multiple round trips from that client to the cloud. because the worst connection is always going to be the edge device to the cloud. The best connection is going to be server to server once you're already in the cloud. So it is easier to sort of engineer everything with send the audio and video to the cloud, do whatever inference you need, send audio and video back. But cost, privacy, flexibility, definitely motivate towards running pieces of those on device. and even though it's harder, I think it's kind of just engineering to figure out how to build the abstractions and make it pretty easy to build the Vibe lines. Yeah, just more code to write. More orchestration layer code to write. Along those lines, with VibeCat being new and this technology evolving quickly, do the Cogen models, Vibe coding platforms and the like, do they know about it? Are folks having good success, like, you know, building applications using those kind of tools? Or are there ones that do better than others or, you know, quad code and the usual suspects for everything now? Yeah, I mean, this is much very much on my mind because I increasingly see a lot of code and like the Pipecat discord that's clearly AI generated, which I think is great. Like we are we we are going to have a new generation of programming tools that make all of us more productive. it is challenging partly with open source stuff that has changed a lot in the year since you know there was a beta and is now stable but there's old versions floating around I think the coaching tools have trouble I've also been trying to figure out how do you package up so I don't think this is a solved problem and I would love to hear from people who have solved it better than we have there are a bunch of good canonical examples of what code structure and pipecat should look like And they're all in the main repo. They aren't necessarily installed in your Python env locally because they're examples. So it's not clear to me how to make Cursor and Windsurf and Cloud Code know that those are the canonical examples. And the project is big enough that none of those tools, as far as I can tell today, can pull everything into context. so the more agentic the tooling is the better job it does generating pipe cat code it's like cloud code is pretty good vanilla windsurf without a bunch of help which i use every day is not so good um so i would like to figure out how do we how do we point a programming tool at like the canonical example so they're always in the context and like you don't have the mistakes that seems super solvable, like the import is wrong. Like if I add something to the middle of my Python code, it sure seems like the import that gets auto-generated at the top of the file should always be right. But that's not true today. And I feel like that's solvable, but not completely solved yet. Is part of it like a conventions.md file that you have in your repo that you can then instruct people, like at least until, you know, that becomes a convention and the tools look for that file, you can tell people to register with their, you know, I know cursor has its version of project files and the other tools do as well. Like, is that part of it or does that not fully get you to where you're trying to go? No, I think you're right. I think that's the approach. And maybe it's just somebody needs to take a week and figure out how to make that file or maybe that MCP server or whatever for PipeCat work across all the tools. I've hacked, you know, improvements in for my own workflow, but it's clearly not like a packaged thing yet. And it really should be. And yeah, maybe it's just somebody needs to take a week and sit down and make it work across all the common AI editors. You mentioned MCP. What are the interaction points or opportunities with regard to MCP and these other agentic protocols and voice AI, voice agents generally, PipeCat in particular, you know, in that whole space, like what are you seeing with regards to the use of MCP, A to A, what have you? Lots of excitement about MCP. There's native PipeCat MCP client support in the repo. So you can just sort of add MCP servers and then, you know, like everything in AI, you have to like prompt appropriately so that you get what you need from those mcp servers but like then they're just in the pipeline meaning so in your pipecat orchestrated workflow that pipecat can call out to an mcp server so in that sense pipecat is like a voice ai application server and it's calling out to uh you know various mcp servers as a back end so you don't have to wire that up exactly and we built that on top of the native function calling abstractions in pipecat because i think in general that's how mcp servers are accessed by llm driven workflows right like there's different ways to access mcp servers but if what if what you're doing is talking to an llm and that llm is talking to an mcp server usually what's happening is there's like a tool call set of tool call definitions that are the are the glue but the difference between statically defining all the tool calls and just defining the tool calls that can access the MCP servers as you have all that beautiful, brittle non-determinism that you've talked about in other podcasts, which gets you a long way and gets you more problems too, right? So you can sort of just use MCP servers in a Pipecat voice agent. What I usually tell people is don't use an MCP server unless you have a very good reason to use mcp for two reasons one is that non-determinism like start with determinism if you know you need something move to non-determinism if you have a specific need for non-determinism um the reasons you might want an mcp server are you are building an ecosystem and you want other people to be able to add stuff to your ecosystem mcp is a very good abstraction for letting other people add stuff into your agent ecosystem. Another reason might be you really do have a specific workflow where that non-determinism is valuable and packaging up a bunch of endpoints into a single MCP server is a lot more kind of maintainable and modular if what you're trying to do is have the LLM that's driving the conversation pick from a whole bunch of different things to do anyway. But if you just have four or five things you know your agent needs to do, hard code those tools. Don't wrap them in an MCP server because you're going to get more evaluable, better results, and you're going to get lower latency. It's interesting because I think about it almost the opposite. You said start with determinism. And if you have a specific need for non-determinism, like then use the MCP server. I tend to think of it as like, you've got this generic MCP server, use that for your proof of concept. But then, you know, there's going to be some subset of all the tools that that thing exposes. and you will figure out through your, you know, POC and user interactions, like what those are, then like take off that outer wrapper of the API and just use APIs. Totally. No, that, that, that a hundred percent makes sense. The gloss I would put on top of that for the voice specific development cycle is know that in that first iteration, you're going to have much, much higher latency than you're going to aim for in production. And you're going to, you know, you're going to get to a point where you're like, okay, I need to bring my latency down from three seconds to 1.2 seconds. One of my big things I'm going to have to do is rip out as many of those MCP calls as possible. Okay. Makes sense. Makes sense. And it, it calls to mind a question, um, you know, in this space, like picking apart, you know, what is observability versus, is, you know, what is evals versus what is test and measurement? Like, you know, there are all these different terms, but I'm envisioning in the voice space in particular, you know, there's a category of tool that I might want that shows me like step-by-step latency throughout my workflow. Does that exist? Or is it easy to do by hand? Or is it something that is like lacking and really needed? the pipecat pipeline will produce metrics frames that show the latency of each step in the pipeline so it'll give you kind of a good starting point there's a couple of things that are harder like everything uh when you're really trying to dig down and pull out the last bits of latency there's a couple of things i the leaky parts of the abstraction yeah exactly all the all the All the measurements are gappy, right? There's always gaps between your measurements. So you got to be aware of those. There's a couple of things that people should be aware of. One is the network trend, the pipe gap pipeline is running somewhere in the cloud. It's giving you the metrics for everything that's running in the cloud. Then you also have that edge to cloud latency. You can measure that programmatically. It's actually very hard to measure that programmatically in any perfect way. so a little bit like you should do evals by hand to build up an intuition what i always tell people to do is if you are building a production voice agent record the conversation like offline like record the conversation from the client side load that recording up into an audio editor and measure the the silence periods in the waveform like that is not you can't cheat you can't get that wrong um and one of the things that that will highlight for example is that a bunch of the voice models will produce fairly long silence bites at the beginning of their voice output and then the text or the the speech you don't know what that is if you're just measuring the time to first bite from that inference call you actually have to look at those bites yeah and there's good reasons they do that because like if you if you start with silence you can tune the model to do you know much more complex things it makes me think of like windowing protocols or something like that like yeah totally so there's a bunch of little things like that in the pipeline that you actually have to sort of dig down and try to measure uh when you're when you're really trying to squeeze the last you know 100 milliseconds or so out but you get a long way with just sort of the standard metrics of how long did the transcription and turn detection take how long did the LLM inference take to start streaming? How much did you have to buffer before sending to the voice model? What was the time to first byte from the voice model? And you can sort of add those up and get a pretty good starting point. Very cool, very cool. Any, you know, if you had to shout out a few interesting like use cases that we haven't talked about that, you know, are inspirational, you know, for you and the community, like even, you know, bonus if they're public and folks can play with them, But perhaps not, you know, given that the focus is enterprise, like, you know, what's really cool that folks are doing? I mean, I'll sort of take it out of the enterprise and give a couple of things that I think are super inspiring that I've seen a bunch of good work on and that I would love to see even more people work on. One is I'm really convinced that AI and education is going to be transformative for our world. giving every kid a tutor is additive to everything we all do in the classrooms. I'm not talking about trying to replace, you know, the classroom teacher, but just giving every kid self-directed, you know, totally and infinitely patient, infinitely sort of scalable one-on-one attention is amazing. And I don't, I don't think we're talking enough about the impact on childhood learning and on adult learning too. I mean, like I use LLMs in that way. And voice is a big part of that because kids like all of us are very voice oriented. So sort of voice driven, but not only voice like tutors, I think are really amazing. And I would love to see more people working on that. The other thing I'm obsessed with is like, what does it look like when we generate UI on the fly? So if I'm having a voice conversation with an application, I want it to write UI code and display that UI code dynamically for whatever I'm talking about doing. And I see this increasingly in how I use programming tools. Like I was debugging some like very low level audio timing thing over the weekend. And in the past, I would have dumped out very detailed logs. And then I would have written some code myself to analyze those logs. Well, what I did all weekend, I didn't write a single line of log analytics code over the weekend. I captured all those logs and I gave them to Claude Code and said, here's what I think we need to do to look at these logs. Can you do that? And it could. It could debug based on very detailed audio timing logs. The next step would have been for it to graph and give me ways to drill down totally dynamically into those logs. So I would like to see people doing more experimentation with on-the-fly generated user interfaces um shrista basumalik who's the uh pm of of the gemini apis who you and i were both hanging out with at google io and i did a talk at swix's ai engineer world's fair with a little bit of auto-generated ui um and i would have liked to do a whole like long workshop on that uh at the world's fair because i think that would have been an amazing workshop but i think we should all like figure out how to do that at some big event coming up. Oh, that's awesome. That's awesome. Well, Quinn, thanks so much for jumping on and sharing a bit about what you and the team have been up to. It's very cool and very interesting stuff and certainly an exciting point in time and voice and video and multimodal AI for sure. Thanks for joining me for the conversation, having me on. It's always super fun to listen to you and super fun to get to talk to you. absolutely thanks so much
Related Episodes

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759
TWIML AI Podcast
52m

Why Vision Language Models Ignore What They See with Munawar Hayat - #758
TWIML AI Podcast
57m

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757
TWIML AI Podcast
48m

Proactive Agents for the Web with Devi Parikh - #756
TWIML AI Podcast
56m

Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures
Latent Space

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755
TWIML AI Podcast
54m
No comments yet
Be the first to comment