

Building an Immune System for AI Generated Software with Animesh Koratana - #746
TWIML AI Podcast
What You'll Learn
- ✓Building an 'immune system' for AI-generated software to ensure it is production-ready, scalable, and maintainable
- ✓The asymmetry between the rapid growth in AI-assisted coding and the slower evolution of processes to support and debug such software
- ✓Player Zero's focus on debugging workflows and proactive code verification to address these challenges
- ✓The need for specificity in task definition and environment when using agentic development tools like Cursor and Cloud Code
- ✓Concerns around developer trust in AI-generated code and the importance of understanding the code being committed to production
AI Summary
The podcast discusses the challenges of maintaining and managing software systems as AI-powered coding tools like Cursor and Cloud Code become more prevalent. The guest, Animesh Koratana, founder of Player Zero, talks about the need to build an 'immune system' for AI-generated software to ensure it is production-ready, scalable, and maintainable at enterprise scale. He highlights the asymmetry between the rapid growth in AI-assisted coding and the slower evolution of processes to support and debug such software.
Key Points
- 1Building an 'immune system' for AI-generated software to ensure it is production-ready, scalable, and maintainable
- 2The asymmetry between the rapid growth in AI-assisted coding and the slower evolution of processes to support and debug such software
- 3Player Zero's focus on debugging workflows and proactive code verification to address these challenges
- 4The need for specificity in task definition and environment when using agentic development tools like Cursor and Cloud Code
- 5Concerns around developer trust in AI-generated code and the importance of understanding the code being committed to production
Topics Discussed
Frequently Asked Questions
What is "Building an Immune System for AI Generated Software with Animesh Koratana - #746" about?
The podcast discusses the challenges of maintaining and managing software systems as AI-powered coding tools like Cursor and Cloud Code become more prevalent. The guest, Animesh Koratana, founder of Player Zero, talks about the need to build an 'immune system' for AI-generated software to ensure it is production-ready, scalable, and maintainable at enterprise scale. He highlights the asymmetry between the rapid growth in AI-assisted coding and the slower evolution of processes to support and debug such software.
What topics are discussed in this episode?
This episode covers the following topics: AI-assisted coding, Agentic development, Software maintenance and debugging, AI safety and robustness, Enterprise software development processes.
What is key insight #1 from this episode?
Building an 'immune system' for AI-generated software to ensure it is production-ready, scalable, and maintainable
What is key insight #2 from this episode?
The asymmetry between the rapid growth in AI-assisted coding and the slower evolution of processes to support and debug such software
What is key insight #3 from this episode?
Player Zero's focus on debugging workflows and proactive code verification to address these challenges
What is key insight #4 from this episode?
The need for specificity in task definition and environment when using agentic development tools like Cursor and Cloud Code
Who should listen to this episode?
This episode is recommended for anyone interested in AI-assisted coding, Agentic development, Software maintenance and debugging, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Today, we're joined by Animesh Koratana, founder and CEO of PlayerZero to discuss his team’s approach to making agentic and AI-assisted coding tools production-ready at scale. Animesh explains how rapid advances in AI-assisted coding have created an “asymmetry” where the speed of code output outpaces the maturity of processes for maintenance and support. We explore PlayerZero’s debugging and code verification platform, which uses code simulations to build a "memory bank" of past bugs and leverages an ensemble of LLMs and agents to proactively simulate and verify changes, predicting potential failures. Animesh also unpacks the underlying technology, including a semantic graph that analyzes code bases, ticketing systems, and telemetry to trace and reason through complex systems, test hypotheses, and apply reinforcement learning techniques to create an “immune system” for software. Finally, Animesh shares his perspective on the future of the software development lifecycle (SDLC), rethinking organizational workflows, and ensuring security as AI-driven tools continue to mature. The complete show notes for this episode can be found at https://twimlai.com/go/746.
Full Transcript
A lot of what we talk about internally is building an immune system. And this is something that I get really excited about because, you know, what is an immune system to us as humans, right? It is a barrier between our biology and reality. And it's constantly learning, right? It's constantly learning from every, you know, interaction that we have with reality. And we can't think about software as a machine anymore. We have to think about it as an organism. right it's evolving and it's evolving in ways that we don't fully understand and it's interfacing with reality in ways that we don't fully understand and a lot of what we're trying to do is build that immune system layer for your software all right everyone welcome to another episode of the twimble ai podcast i am your host sam sharrington today i'm joined by animesh koratana animesh is founder and ceo of player zero before we get going be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Animesh, welcome to the pod. Thank you, Sam. Excited to be here. I'm excited to chat with you. Looking forward to digging into our conversation. We'll be talking about a bunch of themes that have been very popular of late, VOD coding, agentic coding, but more importantly, how to make all these things that agents are helping us develop, you know, production ready, scalable, usable in an enterprise context. Before we dig into those topics, I'd love to have you share a little bit about your background and how you came to work on Player Zero. Well, first of all, Sam, thank you so much for having me on. I've really been looking forward to this. Yeah, you know, my name is Animesh. I'm the founder and CEO here. I've been working on this for a few years. And starting Player Zero was, it was a little bit like putting two and two together for me. For context, my dad's also a founder. So he started this company. He started a company in the healthcare space. It's called VendorMate. And he started it when I was in fifth grade, sixth grade, something like that. And most of my life, I basically grew up, I was his free labor, basically. I used to do I used to do all of his QA. I used to do all the support. It was just like, it was my way of hanging out with my dad. But, you know, at some point in the company's life, it basically, it grew from these like two guys in a basement to a team of about 60 or 70 engineers. And I very quickly realized that even at that scale, there were still only two people who actually knew how the software actually works. And those two people were my dad, who was the CTO and myself. And not because I was like, you know, smart or anything like that. It was just because I was there. I was there in the room when it was being built. And, you know, many years later, I was at Stanford. I had the opportunity to work on some really cool research in model compression and inference, like back before GPT was even a thing. And you were working with like Bate, Zaharia, and Peter Baylis and folks like that? Yeah, yeah, yeah. So these are, they were actually both at Stanford at the time. And, um you know i actually kind of stumbled into the research but um you know when when i saw this world where computers were talking to us right computers were writing code right it kind of gave me these flashbacks back to the days when i was helping my dad and i was like holy crap like all of a sudden there's going to be a future where there aren't two people who understand the code anymore it's going to be zero right and in that world how do we maintain it like what happens how do we support it um at any sufficient scale um and i think a lot of those realities are starting to be realized now um you started this company in in 22 um i guess late 21 um you know pre-chat gpt kind of with this understanding that what we're seeing now with vibe coding with agentic development, cursor, windsurf, the likes, you know, basically kind of anticipating the future that we get to live in today. With that in mind, like, especially you said you've been working on this for three or four years, like, what, how has it been to watch the advent of, you know, AI assisted coding and agentic coding, vibe coding, all of this, like it, you know, it both, you know, it's both something that, you know, we've been able to anticipate for a really long time, but also something that like exploded all, you know, seemingly overnight. And I'm curious, you know, when you think about it, what stands out for you as kind of the most interesting aspects of the way things have evolved. Yeah, it's funny. I don't think anybody could have predicted three or four years ago the way that, you know, AI has made its way into the way that we build software today. I just, I don't, I don't think anybody could have predicted it. And then I'd be lying if I said that I did. You know, I think I, I felt, I felt like something like this could happen, but not, not to this degree. It's, it's been nothing short of magical. So it's, you know, changed the level at which people are able to think and iterate, you know, the way that people can express their creativity. The barrier to building new things has decreased so much, right? And we see a lot of this in, you know, the Bolt and the V0 and that category of tools as well. um i think what surprised me the most has been how asymmetric uh this this whole growth has been um yeah i mean i think the um the analogy that i that i often think about here is like you know when the when the car was first invented we were living in a world full of you know horse-drawn carriages and all of a sudden the you know internal combustion engine was just invented and, you know, it's just dropped out of the sky. And that's basically the equivalent here, right? LLMs are the internal combustion engine. And, you know, now we're basically strapped that combustion engine onto the buggy and we're flying, right? We're flying down these dirt roads. And, you know, it's like, I don't think the rest of the process, right, has actually been realized. I think the assumptions for, you know, how do we operate software? How do we maintain it? How do we support it? How do we do anything other than just build it? I don't think actually have matured at the same pace as the velocity of just raw code output. And so this is like an asymmetry that I feel a lot of different engineering leaders are feeling. at the same time. A lot of different engineering teams are feeling at the same time. And I think unless it's corrected, there's going to be some sort of a reckoning, right? It's just things are growing way too fast. And fast is not a bad thing if you have the right process to deal with it. I saw an article yesterday or the day before that was like reporting on the annual Stack Overflow software developer survey. And the implication was like for the first time in, you know, at least a few years since we've been seeing AI stuff, developers, I should be careful about the specifics. I think it was like trust or something like that, like the developer trust of AI output, like took a downturn for the first time. Did you see that? I think I hear similar things every day. And that's so real. people don't actually know what they're committing to prod anymore right i think they know they know the high level right it's like oh it should do this and this is what i prompted it to do and they you know check it right it's like hey yeah maybe change these couple lines of code but there isn't as thorough of an understanding and that's not entirely a bad thing by the way Yeah, there's a whole spectrum of acceptance of that. You see articles all the time, people saying, hey, if you're not actually reading your code, you're doing it wrong. And then you've got this whole other category of folks out in the industry that are saying, really, the prompt is, before code was the intent, and that is what created the thing that you deployed, and now the prompt is the intent, and the code is just like, you know, the machine language that you compiled, you know, your intent down to. And so you don't really need to care about it. And the idea of asymmetry is really interesting. And like the speed at which we're kind of moving down this path, you know, it's creating really interesting times, particularly for folks that, you know, at least historically, like maybe this is, you know, change with, you know, DevOps over the past 10 years or so, but like historically, like folks would create software and then throw it over a wall to someone else to have to manage it. Right. Maybe we're doing less of that now. It's not going to work. Yeah. Yeah. No, but I just, I don't think the, like the wall has gotten shorter and the amount of software that we're throwing over it has increased as well. And so it's just, it's clear that the process has to change. Right. I think that's just kind of the fundamental here. But yeah, there's opportunity everywhere. Yeah, yeah. And so Player Zero starting with what aspects of the process or workflow? Yeah, so we're specifically focused on two parts of this. The first is debugging. And so that is when something breaks and your really large system, multiple code bases, whatever it may be, a customer calls you up and says, hey, this thing broke. Basically managing that entire process going from, you know, vaguely ill-defined ticket to here's a formal resolution in that code base. So that's one. Two is code verification. So this is largely driven by our latest launch around code simulations, where we basically, you know, look at changes as they're happening and then simulate real customer scenarios that we know to be important to your customers and verify them proactively in the SDLC. through the development process after every single commit and give you proactive kind of predictive feedback to say, hey, this is what might break that will matter to your customers. So let's talk a little bit about debugging as a starting place. What you described, like get a ticket, you know, write some code to resolve that ticket, commit that code via PR and, you know, have that kick off a CICD process and eventually end up out to prod. Like, that sounds like what a lot of people are using Cloud Code for and like these agentic coding tools now. In fact, you read Anthropics articles like people don't actually use Cloud Code anymore. They just write tickets and Cloud Code is sitting there watching the tickets and like pushing code into prod. Yeah, yeah, yeah. So, you know, I think there's a little bit of a like meta problem here, which is like how to build specificity to the ticket to begin with. You know, I think Cloud Code and Cursor are incredible tools. And among others, right, there's a bunch of agentic development platforms out there. And, you know, what they require is, you know, high kind of iteration velocity. They require you to kind of, you know, sit there and iterate with it. And they also require specificity on the task. So they need to know exactly what success looks like. And they need to have an environment in which to do that. I think a lot of what Player Zero helps with on the debugging task, I think we're quite good at going through the code base and giving you really kind of specific instructions about, here's where you should look and all those kinds of things. But I think there's a kind of meta reasoning problem that lives above this, which is how do you build specificity on the task when all your customer told you was the thing you broke and here's the error message, right? Or something like that, right? Or, you know, I have a surge of people complaining that my file upload isn't working anymore, right? And so how do you take a vaguely described problem like that, explore a bunch of different hypotheses, not only from just the code base, but also factoring in other types of information that you might have from your ticketing system more broadly or from your telemetry systems and bringing all those things together to basically construct different narratives, verify those narratives and then come back and say, hey, this is actually what needs to be done. And then on the other end, right, you could pass that on to Cloud Code, you could pass that on to Cursor. We also have a few workflows that'll go in and help kind of edit and build that code for you. But I think that that back half of the workflow is actually much more commoditized. But the front half of figuring out exactly, you know, what to do requires an extremely deep understanding of the code base, among other things, right? The code base and its relationship to reality. And I think that's really- Yeah, that last part, the relationship to reality sounds like it's carrying a lot of weight. Like when I think about a bug, a bug is something that happens because what you thought the code was doing and what your comments say the code is doing and what your documents say the code is doing is not what the code is doing. And so that seems like a particularly difficult problem for an LLM that's going to have all of this context that's basically wrong about the problem. Yeah, so this is actually where that second half of the workflow comes in, right? That second feature that we were talking about, code simulations. And so I'll go in a little bit off the path here, and I promise it'll come back to debugging. but you know we launched this feature last week and it's been incredible to kind of see exactly the kinds of stuff that it's able to turn up code simulations is built on two primitives so the first is we have this concept of scenarios and you can think of a scenario as like a memory for something that broke in your code base at some point in the past usually informed by a ticket something you know a customer said or something like that but we could look at a jira ticket that's four years old that said, hey, the file upload didn't work. And we could go in and correlate that with the code base and say, hey, let's design out a scenario or a memory that says, you know, initial state is somebody uploaded a file that's 50 megabytes large. They hit submit and we want to make sure that that file upload goes through correctly. Right now that's kind of encoded right in this memory bank and that's being accrued over time. and that ends up being simulated in the future when you actually go and touch similar areas of code. So let's say you're going and modifying something about the file upload process, something adjacent to that. We can basically recall that memory and let's say, hey, we're going to go simulate are you uploading a 50 megabyte file and saying that this thing should successfully pass. Now going back to debugging here this memory Yeah yeah How are you going to drop that and now we going to go back to debugging Well, just to round it out there. Circle back. Yeah. Well, just to round out the debugging piece here, this memory bank ends up basically becoming a library of specifications. Right? It becomes this really large library of how should the product actually behave. I was going to say it sounded like, you know, unit tests or maybe like integration tests or like end to end tests or like regression tests. Ultimately, like you got you have this kind of corpus of things that should work and that, you know, and every time you're trying to do something new, you validate it against all of that. Yeah. So I think I think that the premise is the same. Right. I mean, it's used for software verification. right tests are used for verification i think the core difference here is that we're not actually compiling or executing the code right we're basically simulating it and so this is you know i think a very uh a very apt analogy here is you take your most senior architect you put him in front of a whiteboard and you say you have to go line by line and say exactly how would the code behave under these circumstances. Right. And it's not 100%, right? And I think there's a lot of research behind this about how to make it close to 100%. But deliberately, it is not 100% because you're actually reasoning through the code as opposed to actually executing it. But that... Yeah, I mean, why wouldn't you execute it? Like spin up an environment and... Well, this is the age-old problem of testing, right? It's, you know, why not write more tests? And the reason is because I think for many practical scenarios, it's actually impractical to write a test. So like a very simple case you can think about here is like you have a queue, right? And I make an API call. The API call adds something to a queue. The queue then needs to be consumed. And then there's some state in the database that needs to be, you know, transacted against. and then there's some output. And so there's a couple of different asynchronous boundaries. There's a database. There's a little bit of state that needs to be managed, right? Writing a unit test for this is impossible because you're already across three or four service boundaries, right? Writing an integration test for this is really flaky because you might have to do a little bit of waiting and all that kind of stuff, right? And then actually executing it requires you to actually spin up four different servers, right? And so people are just like, eh, you know, let's not, right? so this is this is just the these are the practical realities right and and and again against like the backdrop of um you know vibe coding and authentic development i mean like we're not getting more rigorous we're not getting more rigorous and so this is this is like the the this is the thesis right and and like what we realized is like hey if you could actually predict with a high enough degree of efficacy and and it's it's pretty darn accurate, right? If you can degree, if you can predict with a high enough degree of efficacy, how a code base would behave across really large kind of interdependent components, you lose a little bit of determinism, but with it, you get a ton of flexibility. And that opens up a ton of doors, right? That opens up a ton of doors to be able to verify really complex practical scenarios for large and complex code bases. And this is everywhere, right i mean if you have a if you have a product that's lived for more than a year or two right there's stuff that's encoded and hidden in that code base that people have conveniently forgotten um and it lives rent-free maybe in like one or two engineers minds what the moment they leave right it's it's gone it's this is like institutional knowledge right that just lives rent-free in people's minds um and being able to bring that out of people's minds simulate it at the right moment and predict how something might break before it actually does is it's an important capability that I don't think exists in the current workflow. And so this simulation or reasoning through the code is happening with an LLM? Yeah, so it's actually happening through an ensemble of a few different LLMs and agents all kind of coming together. And so we actually evaluate the entire system together. The system, we call it Sim1, which is kind of our first iteration of all these different things coming together. It's a combination of a set of retrieval techniques that we've developed. So retrieval techniques across code bases, ticketing systems, all these kinds of things. we basically build a large kind of semantic graph and iterate and reason through that to be able to go figure out where in the code base we should actually go trace through. And are you building this semantic graph primarily as a retrieval vehicle to pull the right context that you need to answer a particular question? Or are you reasoning across the graph? Does that distinction make sense? Both actually. So you're actually tracing through the graph in order to be able to reason about how a particular, you know, scenario might actually propagate through a large code base. And so the graph analogous to like symbols in a code base, like function definitions and variables and that kind of thing, like... That's a part of it. But, you know, some of the customers that we get to work with, I mean, they don't have languages that have even parsers. So I'll give a crazy example. This is not most of our customers, but one of our customers has maybe 20, 30% of the codebase in Visual Fox Pro. And so there's no AST parser for that, right? Like you can't actually get a syntax tree out of Fox Pro just because there aren't, you know, off the shelf syntax parsers for that. How do you even get code? Like do you have to extract it from a registry or something? and it's a it's it's a pain um but but you know i think there's there's a lot of semantic there's a lot of different semantic techniques that we use um to actually extract structure from it how did you find the company that's still using visual box pro or vb6 when you uh i i think you know we've been around for a few years and and you know i think uh you get exposed to a lot of reality um and maybe it's more popular than i think but it's no it's in the class of like visual fox pro whatever no i i guarantee i guarantee you no engineers waking up one morning in 2025 and saying hey i'm gonna go build a new visual fox pro app guarantee you um but you know it's just it's just the reality of software i mean like there's there's debt that accrues and and there's there's pain and cruft associated with it um but yeah i mean Just to that semantic graph, right? So the point being that there's no parser for this language, but you're still somehow able to do like, what does that even, what are you doing? So we use a combination of different language model techniques. You're going and trying to extract structure and keywords from there. You're trying to extract relationships from there. There's clustering that you do. There's stuff that you can extract from the commit history, looking at how certain files co-vary. So you can say, hey, this file tends to also be edited at the same time as this other file. A lot of really interesting stuff that you can do there to be able to just start building relationships and understand how different parts of the code base resolve to one another. Another extreme example, this is not a Visual Fox Pro case, but one of our enterprise customers has about a billion, 1.1 billion or so lines of code that just manages their billing system. and 1.1 billion lines of code across 750, almost 800 or so repositories. It's like AS400 code or something. No, it's just like, this is all Java. Like it's like reasonably, it's like reasonably modern. But just a lot. It's just a lot, right? And so like standard retrieval just doesn't work anymore in these situations. And so, you know, this is one part of that, that larger system that we had to build, right? Sim 1. another part of this was we actually had to go and tune a set of models that could actually just go and try to predict the next state of the code base and so what that means is you know given a particular function declaration given a particular set of arguments that come in can you actually predict how the function will actually resolve what would it actually return and you can actually do that with the reasoning models at a reasonably high kind of effective rate. And you can do that even better if you actually start tuning it and kind of regularizing it against a bunch of different code bases, right? And so, you know, we spent a lot of time, you know, tuning those types of things. We have another set of models that are, you know, really good at running experiments. So you can actually take small sub-portions of code bases and then use things like code interpreters and stuff like that to actually go and evaluate exactly how certain sub-portions of the code base might evaluate. Yeah, that's where I thought the whole thing was going, that you, like, when you say simulator, you've got, you know, you're running them in an interpreter or in an environment of some sort. But you can't do that. That's just one thing. Right. Yeah, that's just like, that's a small part of it. But all of these things can actually come together to create a really magical output. And that's ultimately what, you know, Sim 1 is really trying to drive. trying to make sure I really am wrapping my head around like the kind of extent of the way that this thing is reasoning about a code base and we kind of throw that that terminology around like all the time now like you know I'm in cursor or whatever and I'm asking it to do something and it's reasoning about my code base but really what's happening is like some retrieval things pulling a bunch of pieces and giving them to an LLMS context and saying you know here's the first few tokens generate some more right and like uh and so then we have like reasoning models that like you know think more and generate more tokens in the process of generating those tokens but then you're adding layers on top of that which are it sounds like in a lot of ways like featurizing the code base creating graphs about the code base creating simulation results about the code base, creating other artifacts. Is that the right way to think about it? Yeah, yeah. So, you know, we're doing things like clustering, for example, where you're trying to say, hey, these files together represent a particular feature in the application. And these features tend to relate to one another, right? And so there's many different ways that, you know, this graph can be represented. Some things represent, you know, very low level details about the code base implementation. Some things represent, you know, kind of higher level semantic features about, you know, what is this thing actually achieving and what is it related to? And we're basically doing that in multiple layers of abstraction. And then we all represent that as a graph. And so exactly, like we're futurizing the code base in intuitive ways and, you know, declarative ways through ASTs, through, you know, kind of semantic keyword extraction. just like it's actually comes down to about like a few dozen techniques that we use all in all to go build this graph. And then we have another set of models that basically go and operate on top of that graph to be able to iteratively predict how a large code base might actually work. I think like another interesting conversation to have about, you know, why now? Right. Like why? I mean, reasoning is a reasonably new concept. right uh maybe eight nine months old now um it's crazy crazy uh that it's only eight or nine months um but reasoning's like a reasonably new concept and i think um a lot of what makes reasoning really powerful right is it kind of pulls in the system to thinking it it brings in a let me stop wait, reflect on the context that I might have, and then choose the next thing to do. And one of the really interesting things about this whole simulation exercise is it's very much in line with what made reasoning really powerful, which is that in many cases, reasoning was trained in this post-training stage through some sort of verifiable output through reinforcement learning. and in code simulations we have a verifiable output which is did we predict whether the code base actually behaved in the right way or the wrong way and if you predict correctly right you can actually kind of back propagate through the entire trace through all of the different you know actions and choices that the agent took iteratively in order to actually get to a particular prediction and make sure that we basically, you know, reward or, you know, disincentivize that particular trajectory. Is your ground truth, your kind of verifiable output coming from actually executing the code or from some knowledge of past code behavior? Like when we're actually running a simulation, right, where we don't have a way to actually verify. I'm more talking about training and RFT. and giving that signal to the model as you're training it. Yeah, so that is actually a combination of verifiable outputs. So we actually have code bases where we found specific types of issues and then after that, then back propagate against that, run a simulation and make sure that we're actually reaching the same verifiable output. And then there's actually ways to evaluate against kind of probabilistic outputs as well. And so you can actually kind of guess, for example, we went across a lot of different open GitHub repositories and we said, OK, well, what are open bug tickets there? And can we go and verify that those bug tickets are still correctly open? Meaning you have, that's really interesting. You have a ticket that was filed sometime in the past. Look at the current state of the code and try to predict whether that ticket is open or closed. Yeah, and so you can use things like this as heuristics to basically get 80%, 90% of the time. Yeah, it's still correctly open. And so it ends up giving you a directionally correct signal for being able to tune these models. I would think for a lot of tickets, you look at enough GitHub threads, and there's a lot of discussion about just providing basic information about the ticket or about the, you know, just kind of prerequisites to reproducibility. And maybe this goes back to where we started the conversation around debugging, but it seems like a lot of value would be created in just like, you've got this janky ticket with no detail. Like, can somebody go like, you've made the connection, run that up and, you know, run that. And what is that? Is it like you've got the ticket, like simulate it or infer information about, you know, the ticket from like, you know, you might have structured fields about the OS and the person mentioned the iPhone. And so you can make some inferences there. Like what's the process there? That's a that's a great question. You made the jump for me. So the the way this helps debugging is debugging is really about hypothesis testing And the best engineers in the world are the ones that are really good at taking morsels of context and then basically collapsing the space that they have to test to two or three really high value things. And that's basically what we're trying to do here, right? Where we can actually take these simulations and use them as a way to actually run a hypothesis test. That's essentially what a simulation is. And then you could start taking context from, you know, whatever's in the ticket or whatever is in your telemetry system and use that to heavily inform which hypotheses you should actually test to begin with. And so if you have a rigorous enough understanding of the code base, if you have a rigorous enough understanding of, you know, what the customer actually was kind of talking about, right? What were they complaining about? you could use that as a way to go and automatically come up with a few hypotheses, simulate them, and say, hey, this is the most likely reason why this particular problem happened. And so the simulation actually becomes a really powerful tool, not only for software verification, but also for debugging. And it creates this really interesting trade-off, which is, how important is the bug actually for you? Because you could actually turn a dial and say, I want you to go test a thousand different scenarios, a thousand different hypotheses and come back and tell me, hey, is this, you know, like exactly why did this happen? Now, obviously that costs money. It costs a lot of inference, right? That's an expensive thing. But you are almost guaranteed at any sufficient scale to say if I turn enough inference, if I throw enough compute at this problem, then I could figure out why this thing broke. I just need to test enough scenarios. So enough monkeys at typewriters and you're going to get the right answer. No, but it's just a really interesting trade-off. That never existed before, right? You can never actually scale inference or reasoning to the point where you can actually use that to say, I will always get you to the right answer if you give me enough time. Maybe the monkeys in typewriters is, I don't know if it's apt here or not apt, but continuing with that analogy, is, you know, it's not having one of those, you know, millions of or thousands of outputs be Shakespeare. It's being able to figure out which one is Shakespeare in a scenario like this, because you said that you're not like running the actual code. So these things are coming back with, oh, I think this is, well, and this is maybe a proceeding question. Your agents are going out and like simulating, you know, something. what what's the output is it english like i have looked at this ticket and i think it's this or is it like fixed code or is it better tickets is it uh something else it's it's incrementally adoptable um and so the output of a simulation is yes it's english right it says like a scenario is defined as an initial state, a few steps, and an expectation. The output of a simulation is, did the simulation result in the expectation or not? So there's a pass-fail result, and then there's an explanation as to why it passed or an explanation as to why it failed. Now you could take that simulation result and say, go back to player zero and say, hey, could you actually go fix this problem for me? Right, oh, I didn't realize that I broke this thing, right? Could you just go fix it for me? Right, and there's a nice little loop, right, that actually, you know, forms there. in terms of kind of ticket triage, right? I think there's something similar. So a ticket comes in, you know, we test a few hypotheses, we come back with a few recommendations. Someone often can come back and say, could you go write up a better ticket for me, right? I want to go pass it into Cursor or Quad or whatever your tool of choice is. Or you could just say, hey, Player Zero, could you actually go and make this fix for me as well, right? And so there's many different kind of ways that people actually use this. But at the end of the day, they're basically kind of flexing that muscle of, you know, Player Zero has a really deep understanding of your code base. And as we said, right, code base and its relationship to reality. And they're basically exercising that relationship, exercising that understanding to be able to create specificity to the problems and then verify software as it's being produced. Yeah, I'm still wanting to kind of sharpen the distinction between this and an agentic coding type of tool, like. Yes, like at a presentation level, like a ticket and, you know, copy pasting an error message into cursor, like are kind of different things, but they're kind of the same thing. Like something's broken. There's some, you know, a string of characters describing a broken thing. And I'm telling a system that I don't like this. Go fix it. In fact, like I don't even need to tell it. I don't like it. I could pass an error message and it knows it's an error message and it'll go find the issue and fix it. And so you're saying like one distinction is, you know, let's call it mechanistic. Like, you know, all these tools, like, you know, we think they're LLM tools, but they're really RAG retrieval tools. But we don't call them RAG because like that's for words and this is code. But it's all like how we construct this context. And you're saying like you do better things to construct the context with graphs and stuff like that. which, okay, great. But in terms of like, I'm trying to A, get to the question, but also get to the answer. Like there's a distinction here that I feel like we're trying to articulate. Totally. So let me... I'll pass the ball. Yeah. Let me try to create like a good mental model, right? Like what we're really focused on at Player Zero is what happens to your code after it's left the agentic developer. right after it's left the IDE and you know as we said earlier right like that is software verification so how do things break what things might break and then after that ultimately fixing it and you're right I mean there's a little bit of overlap in some of these workflows but I think like from a technical perspective I think there's like two or three things that really drive the difference one is yes like we take a lot of pride in our ability to understand really large and complex code bases and I think the muscles required to do that are different than the ones required to be dropped into your IDE that is a little bit more of which files are open, what is the exact context of the specific repo that I've cloned right now the muscles even around retrieval are different when you're talking about hundreds of repos or dozens of different services kind of all interacting with one another. So that's one, right? So just like the way that we understand and train across these really large repositories. The second, I think, comes down to how we capture institutional knowledge. An analogy that I really like to understand this is a lot of what we talk about internally is building an immune system. And this is something that I get really excited about because, you know, what is an immune system to us as humans, right? It is a barrier between our biology and reality. And it's constantly learning, right? It's constantly learning from every, you know, interaction that we have with reality. Anytime we're exposed to a new, you know, pathogen, a new virus, bacteria, whatever, it's constantly learning. It's remembering those things. And then it's becoming a part of the way that we live every single day. So that way we as humans can basically go outside, live freely, play outside, do whatever we want. And it's this thing that we can very easily take for granted. But it's this kind of long-term memory that stays with us our entire life. Just to make sure that we are resilient to the pace of things that we're basically throwing at our biology. And we started this conversation talking a lot about the pace of development has increased. And we can't think about software as a machine anymore. We have to think about it as an organism, right? It's evolving and it's evolving in ways that we don't fully understand. And it's interfacing with the reality in ways that we don't fully understand. And a lot of what we're trying to do with these scenarios and code simulations and stuff like that is build that immune system layer for your software. And so it's constantly being exposed to reality, you know, through its connection to the ticketing system through its connection to the telemetry, through its interaction with just, you know, the commits that are happening. It's basically building this long-term memory and accruing that over time. And that requires a discipline that actually spans beyond the code base as well, right? And that's really what I'm trying to get to here, right? That's kind of a secondary and important differentiator from, you know, what you would really want out of cursor or a cloud code or something like that. And then the third thing is, you know, our ability to actually run verification without compiling. And that's the code simulation piece of this, right? That's a really unique capability. It's a really hard thing to do to actually be able to say, hey, throw a billion lines of code at me and I'll predict how, you know, if this invoice comes in, in this country, in this particular, you know, configuration, exactly what would happen in the system and would it behave correctly or not, right? And it's built on those first two differentiators and adds onto it as well, you know, with some of our own custom models, with some of our own, you know, evaluation, with some of our own research that we have to throw at it. The underlying machinery is all LLMs and, you know, that ilk as opposed to, like, it's kind of calling to mind, like, automated verification and, like, you know, kind of NASA-style, like, going through the code and, like, really understanding what every line is doing and using that to predict outputs, given inputs. but we're not talking about anything like that. Yeah, like it's a combination of different LLMs. There's a couple of different reinforcement style or reinforcement learning style models that we've trained as well that are actually both LLM and not LLM based. So there's a few kind of just graph models that we've trained as well that can just go and traverse different paths through the graph. And then, yeah, there's like a couple of different retrieval and embedding models that we've got to tune. And so it's a combination of many different things, but the capability, I think here is what's really important. On the models side of things, I'm assuming you're using pre-chain models and fine-tuning them. You mentioned RFT reinforcement, fine-tuning. Can you talk a little bit about the base models that you selected and how you think about that space? Yeah, that's a quickly evolving arena as well. I think just yesterday, some cool things happened. So, you know, I think there's probably two or three, like, important models that we rely on. So in terms of retrieval and stuff like that, we use a lot of open source sorts of things. So there's, like, a tuned version of Llama. I think we're actually evaluating the GPT-OSS stuff that came out yesterday. So that'll be really interesting. seems to perform quite well, given some of our internal training sets. So yeah, there's a lot of stuff around retrieval that we've trained. In the code tracing, and a lot of the ontology built on top of that graph, a lot of that comes from slightly more effective reasoning models. this tends to be a combination of some of Anthropics models, they call it some of the older Sonnet models, plus like a variant of DeepSeq that we host that's able to kind of reason through a bunch of these different traces and explore a few different hypotheses. And then after that, we found that Sonnet, you know, Anthropics flagship model tends to be really good at kind of being an orchestrator between a bunch of different models. to be able to kind of like manage. Yeah, kind of like a router tends to be really good for that as well. And so all of these different things all come together at the end of the day. I would say that middle layer tends to be where a lot of the heavy lifting happens. So that's the thing that's, you know, tracing through a code or, you know, trying to retrieve all of the right context to basically predict the next step. These types of models tend to be, in terms of raw tokens, the maximum. that like that they tend to be the ones that are most heavily leveraged. And so those are the ones that we also spend a lot of time tuning, bringing in house, managing the, managing the inference for. And so when you have a new customer or a new system that, you know, with a new or old customer, are you doing a lot of pre-compute on their code base or is it all like, you know, query response and you're kind of doing everything real time. Yeah, there's always a balance here. Yeah, there's always a balance here. We do have some pre-compute that needs to happen, especially at the types of code bases that we deal with. Like creating embeddings and that kind of stuff or? Yeah, so it's embeddings, it's building that graph. Building the graph. And, you know, we also do some sort of training to basically say that there's some sort of online learning that happens as well to say like, hey, every time a PR is merged in, right? Make a prediction for what might break. And once the thing has actually gone in, we then go and say, well, what actually broke? And did our prediction match reality? And then basically go and tune our models further, right? And we do this on a tenant by tenant basis, right? So we're not using one customer to go inform someone else. But this is constantly happening for all of our customers. Using like LoRa adapters per customer or something like that? Yeah, so LoRa was a great way to start. We've actually found that like, you know, RFT over kind of longer time batches tends to just work a little bit better. LoRa has some limitations, especially when you start getting to like reasoning models and stuff like that. But yeah, I mean, there's a ton of pre-compute that happens. Usually most customers are live in a couple of hours after they've actually imported their code bases and we're off to the races. But sometimes it takes a little bit longer, sometimes a little bit shorter, really kind of dependent on the size of the code base. And the number of H100s you thought? Yeah. You said it, not me. Yeah, no, exactly. Exactly. When you started a new customer, like, maybe this is taking a step back from Player Zero and what you're doing. I'm curious, like, what you're seeing, you know, in terms of the landscape and the way the customers are thinking about, you know, whatever they call this phase that we're in, identity coding, AI system coding, vibe coding, whatever. Like like what what are you seeing How you know far down that path they are like I I I assuming the bias will be towards larger code bases I'm kind of envisioning a bias towards more mature companies or more traditional companies. For Player Zero? For Player Zero, yeah. Yeah, I mean, we have a pretty diverse set of customers. I would say that the problem that we solve around supporting the code base tends to be something that happens when you have a code base that customers are actually using. Right. And so, you know, like, we're not really working with super early stage startups for sure. But usually the problems of scale and complexity. Do you see both, like, your, like, digital natives, your metas and your ADTs or visas? Or is it more the latter than the former? Oh, yeah, that's it's a pretty healthy mix between both of these. I mean, like one on the former side. So think a little bit more tech forward is Zora, for example. Right. And like really, really sophisticated engineering team distributed across the world. Zora, like the like pricing SaaS or subscription SaaS or something like that. Yeah, exactly. And yeah, they they really care about the verification side. You know, I think this is really interesting, actually, because you talk to different companies and they articulate the same problem a little bit differently. But, you know, some of the, I would say, like more tech forward companies tend to care a lot about, you know, the seatbelt. So they tend to care a lot about, well, how do we, you know, move faster and not break things preventatively? I think some of the more maintenance mode companies who basically have just a large amount of accrued software over time, I tend to think about it a little bit more as, you know, I have a really expensive set of software and TCO, right, cost of ownership is really high. How do I lower that burden in a world where the velocity is increased, right, four or five X, right? And so like some people are thinking about it as a like support burden and other people are thinking about it as a, you know, how do I really just kind of help accelerate development by scaling verification at the same pace as I'm scaling the speed of code output. And so this is the, maybe this is just another way of asking you to, maybe this will only result in a restatement of what you just said, but like, I'm wondering if you're able to infer from what you're seeing if all of these tools will allow companies to like you know close the same amount of bugs with less resource or you know i guess you know tell me this tell me there's like a utopia software quality right around the corner because of all this stuff we have coming online yeah you know i i actually i actually i actually believe this really really fundamentally like um i i don't think um i don't think that using ai to reduce resourcing is necessarily the right way to think about it um and and I think what's happening, right, with, we're seeing this with agentic development, and I think we're going to see the same thing with software quality and support, is what we've done is basically leveled up the judgment that the humans can actually apply into the workflow, right? In development, the developers are no longer thinking about, should I name this variable this or that? or how should I implement this particular function, right? They were thinking about what to implement. And I think as these models converge and they get better, the judgment that developers are able to exercise basically levels up one degree. That doesn't mean that the world needs less software. The world is hungry for probably a thousand times the amount of software that it's able to produce today. And I think thinking about this in terms of, of the asymmetry that we were talking about at the beginning. As we start tackling some of these fundamental challenges in the development side of the house, one by one, the burden on quality, both in QA and support across the rest of the SDLC, is going to exponentially increase. And I think for the amount of software that we're going to be able to produce, there aren't enough humans in the world to be able to manage that process the way that we are today. And so I think thinking about it as a, oh, I could, you know, bring in Claude and fire 50% of my developers, right? And just to kind of say it bluntly, right? Like that's like a narrative that's being told. Or saying this in the other way, like, oh, I can bring in player zero and, let go of a bunch of my QA or support staff is the wrong way to think about it because I think all of these things are going supersonic. And what we need to do is basically think about our workforce and say, how do we help them exercise higher judgment in the workflow to create more leverage, right? To create more leverage in how do they verify, how do they test, can they test more in the same amount of time and use what they understand about the business to be able to guide these systems to get more work done for them. Yeah. Yeah. That just, it's like a false premise, I think, to say that the world has enough software and like, yeah, it's just demand is going to keep increasing as supply increases as well. It's an interesting dynamic. Are the kind of next gaps already obvious to you? Meaning you got this STLC that's like evolving rapidly because of the infusion of AI, you know, you tackled Cogen, you know, you saw maintenance quality or some sections of that as an opportunity, but like there's still like this long workflow or maybe not. Maybe it's just those three, four things. Like how do you, are there kind of, you know, is what you're doing exposing other things that are broken that need either you or someone else to fix? You know, one of the things that I realized as we were building Player Zero was that quality is not an exercise, like a point in time exercise. It's a process. and it's how do we learn from the mistakes of the past to be better in the future. And I think a lot of what we're doing with this analogy of an immune system and scenarios and code simulation and all these different things coming together is basically trying to create this process, right? To operationalize this process and make it a part of how we build software every single day. and i think as the ai native development era matures a little bit i think the uh uh the question that i don't have an answer to is what does cohesion in that development workflow actually look like you know how do these different things have to have to work together in order to orchestrate and operate software at the next level and i think there's many different bottlenecks here today i think you actually mentioned one of them for example which is if we think about developers as prompt engineers and are basically asking cursor to go iterate on some software that they want to build, and they have zero understanding of the actual code, does the language still have to be Python? Or can it be Java? Or could it be some other obscure language that only computers understand that is easier for language models to write? I don't know. And I think as we start seeing different bottlenecks in the development process actually shift and this entire broader SDLC mature a little bit, I think you're going to start seeing these like second or third order sorts of questions being asked to say, how do we optimize the entire SDLC end to end? So that way these workflows are a little bit more cohesive. Another, I think like simpler version of this is just, you know, how does player zero interact with cursor or how does it interact with cloud code and stuff like that? So like workflows are something that we're really focused on internally to make sure that this is as expressive and easy to adopt as possible. But I think it does beg the macro question of, yeah, adoption is going to be really important. It has to be very accessible in the SDLC, not only for its current moment, but for what the workflows look like, you know, two years from now. I think another way to ask the question that I asked was, you know, if you, you know, didn't start player zero, or you did start player zero, but you're some other person and you're looking to start a company to like work on this SDLC, like what would you start, you know, now? So there's actually one other thing that I think about often, which is security. It's not too dissimilar from quality, but it requires a different muscle to be able to understand And in the same way that we understand the relationship between code and reality from a quality and kind of verification standpoint, I think there's an important muscle to be built about. If we're producing this much code this quickly, how do we make sure that there's no holes in it? And I think this matters, especially for larger companies, but I think it's a universal problem. I think that's one of the obvious ones that comes up. I mean, you mentioned one thing that is on my mind, and it was kind of the question that I asked about, like, what you're seeing in your customers. Like, we have all these tools, and it's not, you know, because it's all moving so quickly, it's not fully clear how they're all going to fit together. Like, you know, take our conversation. like you're kind of like well you could ask claw code to do it or you could just ask player zero like there's a lot of overlap there's a lot of like you know shifting sands there's you know uh you know we're in this like land grab phase where oh this cli seems to be catching on let's all create clis like a big question that i you know find myself asking a lot is like what are all the pieces going to be like once you know once the sand starts shifting or like uh we start creating stop creating new pieces like what are the fundamental pieces that will continue to exist uh and how do they interact with one another i think there's still a lot of questions there i think the um the the best analogy i mean like software creation is a is like a non-deterministic imperfect process, right? There's many different ways to produce the same software. There's many different ways to write the same code. There's many different ways to verify it. And so we have these gates that we've built deliberately to say, here's how we produce software at scale for our customers and do it well. And I think the human process is some indication for what are the important centers of excellence that a good SDLC has, right? And so production is one. I think verification is another. Evaluation is another. And then support is another. And so production is, okay, well, how do we generate more code faster? And I think there's going to be something like, you know, a cursor or a cloud code or something like that, right? That really kind of sticks their flag in the ground for that. I think verification is really where we want to play. And that's really where we take in all the stuff that Cloud Code and Cursor are producing and say, is this actually correct? Verify correctness of the code. I think evaluation is, okay, well, I think there's observability and companies like that where I think you say, well, what's actually happening out there in production right now? and you know is there a spike in 500 errors and you know is something breaking and basically kind of the the incident response and i think there's there's a couple of cool companies in that category um and then i think in terms of support i think that is basically kind of closing the loop and bringing it back into development of saying well what broke in our verification and how do we make sure it doesn't happen again right and that's the other half of of the kind of puzzle uh that we're that we're working in. Yeah. So those are kind of like the high level, like broad strokes, like what I see is the important areas of investment in the SDLC. I think we call it different things. We call it developers and support and QA. I think in the future, it'll be, you know, a couple of different agents doing the legwork and like a few human orchestrators above it. But whatever we call it, I think those are important things to be done. And, high quality software. Some of that does kind of strike me as like fossilized Conway's law. Like, you know, we've got these organizational groups, dev and QA and, you know, what have you. And like, we've built these workflows that like, well, fossilize like these organizational boundaries. And, you know, part of what's interesting is the point you made earlier. Like if, you know, we're able to like re-imagine the whole stack, you know, in light of a kind of a native AI approach, what does that look like? And to what degree does it differ from what we're looking at now? Yeah. I mean, I think it's going to change a lot. Hopefully, we get to have some hand in making that happen. What, meaning and not just like the AIs? we as people or we as player zero both actually awesome yeah awesome well adam asho is great uh catching out with you learning a little bit about what you're up to and uh looking forward to keeping an eye on how you guys are playing in the SDLC. Absolutely, Sam. Thank you so much for having me on. This was a real pleasure. I had a lot of fun talking about this. Yeah, my pleasure. Thanks so much. Thank you so much. you
Related Episodes

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759
TWIML AI Podcast
52m

Why Vision Language Models Ignore What They See with Munawar Hayat - #758
TWIML AI Podcast
57m

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757
TWIML AI Podcast
48m

Proactive Agents for the Web with Devi Parikh - #756
TWIML AI Podcast
56m

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755
TWIML AI Podcast
54m

Building an AI Mathematician with Carina Hong - #754
TWIML AI Podcast
55m
No comments yet
Be the first to comment