Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

TWIML AI Podcast

Tuesday, December 2, 202548m

Spotify Apple

TWIML AI Podcast

0:0048:44

What You'll Learn

✓Gimlet Labs' goal is to make AI workloads, especially agentic AI, at least 10x more efficient
✓They have built a stack to run models efficiently on highly heterogeneous systems, from laptops to data center hardware
✓For agentic AI, they focus on orchestrating the entire application flow, not just individual model executions, to optimize latency and efficiency
✓They use techniques like fine-grained partitioning of the workload and right-sizing hardware to get the best cost per token
✓They leverage Kubernetes and dynamic resource allocation to enable flexible, real-time optimization of workload placement
✓The key challenge is managing the complex trade-offs between compute, memory, and cost across heterogeneous hardware

Episode Chapters

Introduction

Background on Zain Asgar and Gimlet Labs, and their focus on improving the efficiency of agentic AI workloads

Agentic AI Workloads

Explanation of what agentic AI means and the heterogeneous nature of these workloads

Optimization Approach

Discussion of Gimlet's techniques for partitioning workloads and right-sizing hardware to optimize cost and performance

Kubernetes and Dynamic Resource Allocation

How Gimlet leverages Kubernetes and DRA to enable flexible, real-time optimization of workload placement

Challenges in Heterogeneous Environments

The key challenge of managing the complex trade-offs between compute, memory, and cost across diverse hardware

AI Summary

This episode discusses the work Zain Asgar and his team at Gimlet Labs are doing to improve the efficiency and performance of agentic AI workloads at scale. They are focusing on orchestrating and optimizing the execution of these heterogeneous workloads across diverse hardware in data center environments, using techniques like fine-grained partitioning and right-sizing of resources. The key challenge is managing the complex trade-offs between compute, memory, and cost to achieve the best performance per token.

Key Points

1Gimlet Labs' goal is to make AI workloads, especially agentic AI, at least 10x more efficient
2They have built a stack to run models efficiently on highly heterogeneous systems, from laptops to data center hardware
3For agentic AI, they focus on orchestrating the entire application flow, not just individual model executions, to optimize latency and efficiency
4They use techniques like fine-grained partitioning of the workload and right-sizing hardware to get the best cost per token
5They leverage Kubernetes and dynamic resource allocation to enable flexible, real-time optimization of workload placement
6The key challenge is managing the complex trade-offs between compute, memory, and cost across heterogeneous hardware

Topics Discussed

#Agentic AI#Heterogeneous compute#Workload orchestration#Resource optimization#Kubernetes

Frequently Asked Questions

What is "Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757" about?

What topics are discussed in this episode?

This episode covers the following topics: Agentic AI, Heterogeneous compute, Workload orchestration, Resource optimization, Kubernetes.

What is key insight #1 from this episode?

Gimlet Labs' goal is to make AI workloads, especially agentic AI, at least 10x more efficient

What is key insight #2 from this episode?

They have built a stack to run models efficiently on highly heterogeneous systems, from laptops to data center hardware

What is key insight #3 from this episode?

For agentic AI, they focus on orchestrating the entire application flow, not just individual model executions, to optimize latency and efficiency

What is key insight #4 from this episode?

They use techniques like fine-grained partitioning of the workload and right-sizing hardware to get the best cost per token

Who should listen to this episode?

This episode is recommended for anyone interested in Agentic AI, Heterogeneous compute, Workload orchestration, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode, Zain Asgar, co-founder and CEO of Gimlet Labs, joins us to discuss the heterogeneous AI inference across diverse hardware. Zain argues that the current industry standard of running all AI workloads on high-end GPUs is unsustainable for agents, which consume significantly more tokens than traditional LLM applications. We explore Gimlet’s approach to heterogeneous inference, which involves disaggregating workloads across a mix of hardware—from H100s to older GPUs and CPUs—to optimize unit economics without sacrificing performance. We dive into their "three-layer cake" architecture: workload disaggregation, a compilation layer that maps models to specific hardware targets, and a novel system that uses LLMs to autonomously rewrite and optimize compute kernels. Finally, we discuss the complexities of networking in heterogeneous environments, the trade-offs between numerical precision and application accuracy, and the future of hardware-aware scheduling. The complete show notes for this episode can be found at https://twimlai.com/go/757.

Full Transcript

This podcast is sponsored by Google. Hey folks, I'm Amar, product and design lead at Google DeepMind. We just launched a revamped Vibe Coding Experience in AI Studio that lets you mix and match AI capabilities to turn your ideas into reality faster than ever. Just describe your app and Gemini will automatically wire up the right models and APIs for you. And if you need a spark, hit I'm feeling lucky and we'll help you get started. Head to ai.studio slash build to create your first app. Join developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and more than 75 other supporting companies to build the open tool stack for multi-agent software and trusted agent identity on Agency. Agency, which I recently discussed on the podcast in my interview with Vijoy Pandey, is now an open source Linux foundation project where you can help create the protocols, specs and tools that power next gen AI infrastructure. Visit agency.org to learn more and join the build. That's A-G-N-T-C-Y dot O-R-G. If you take a look at training hardware, it's kind of gone the way of like building supercomputers, right? Like, you know, people don't talk about building machines anymore. They're like, here's my entire rack, right? This starts looking like, you know, what Craig was doing. So in some ways, you know, you could be like, oh, we're kind of regressed back to the supercomputer era. And I don't know if I use that word positively, right? We're building this like fully vertically integrated systems. I'm not sure that's the route for inference. I think inference is much better served as a large scale work cloud where you can utilize a bunch of relatively commodity. hardware and be able to like scale out efficiently. All right, everyone, welcome to another episode of the TwiML AI podcast. I'm your host, Sam Sherrington. Today, I'm joined by Zane Asgar. Zane is co-founder and CEO at Gimlet Labs and an adjunct professor of computer science at Stanford University. Before we get going, be sure to hit that subscribe button wherever you're listening to today's show. Zane, welcome to the podcast. Hi, Sam. Thanks for having me here. Super excited to be here. Excited to have you on the show and looking forward to digging into our conversation. We'll be talking about the work you're doing around heterogeneous inference for agentic systems. To get us going, I'd love to have you share a little bit about your background. As I mentioned, I'm co-founder and CEO of Gimlet and also adjunct faculty of computer science at Stanford. Prior to this, I was a general manager at New Relic through an acquisition of my previous startup, Pixie. And actually, you know, a bunch of people from Pixie are now at Gimlet as well. I was an EIR benchmark capital where the idea for Pixie came out of. I was in Google research and spent a lot of time at NVIDIA. So I've kind of focused on like efficient compute and being able to orchestrate and run compute efficiently on large scale clusters. And where did the idea for Gimlet come from? What's your, what are you going after there? So when we started Gimlet a couple years ago, we had a, we had a focus on like, how do we actually make AI workloads, you know, at least 10 times more efficient, right? And part of the, part of the challenge over here has been that you've seen this like huge explosion in AI, AI workloads, especially around agentic AI, where you're, you know, consuming like 10x more, more tokens. And really, if you want to be able to keep this somewhat sustainable, you need to have these big leaps on improvements. So that was our original focus with gimlet and when we started off we were really thinking about how do we get models to run on things like your laptop and you know raspberry pies or whatever right like small scale hardware and how do we get like the best efficiency exactly any kind of edge device but one of the things we kind of realized is that you know we built up our stack to work on this very very heterogeneous system right because like you know two you know macbooks look very different than typical windows laptops running like intel or amd cpus which look very different than like an iphone And we got really good at running models efficiently. So we kind of realized that this technology that we've built up is actually pretty applicable to data center scale systems. And, you know, there's this large, large scale problem. And so we could have a much larger impact if we actually improve that system. And, you know, our team decided to start going after that space instead of like directly targeting edge devices. Partly because, you know, we think that the edge market still needs a couple of years to really materialize. Whereas the data center market is crazy hot right now. Exactly, exactly. And I think, you know, the fact that we can basically orchestrate and run stuff across many different hardware to optimize like the unit economics and also get best in class performance is like a big, you know, big benefit over there. One of the things that you guys talk about is specializing in compute for agentic AI workloads. Does that just mean LLMs and multimodal models? Or is there something specific about agentic that you think needs to be considered in the hardware? So one of the things we started to build for is how do you run entire applications? Today, people run agentic frameworks. those frameworks typically call out to different APIs to run models, which lands on meaning you spend a bunch of time doing all these API calls and orchestrating API calls. We kind of wanted to take the approach of like, well, what if you could avoid all the round trips and put everything in a single system? And agentic systems as a whole are very heterogeneous, right? Because there's a bunch of like, you know, CPU compute happening. There is a bunch of, you know, things like database calls, etc, etc. And there's a lot of like model, LLM model execution. And if we can better orchestrate all of those and understand when and where things are going to be run, we can both improve the efficiency and the latency of the system, but also be able to make optimizations within the models themselves because we know how they're going to get used. Got it. And what kind of optimizations are you typically looking to make? Yeah. So one of the things we do, one of the biggest things we do from an orchestration perspective is just like right-sizing the hardware, right? Today, you know, general consensus is that people will go buy like the highest hardware they can, which is basically a B200. And you'll basically maximize how many ever B200s you have to run your model workloads. One of the things we do is actually do pretty fine-grained partitioning of all your models and including your entire, like, you know, we think about an agent as an entire like data flow graph. So how do you actually like break up the data flow graph? How do you break up the models so you can assign the most performance critical pieces to the highest end hardware and then offload the less performance critical pieces to hardware that will be better utilized by running that workload? What you're describing reminds me of a conversation that I had with Lin Chow at Fireworks about kind of the way that they're approaching distributing inference across their cloud environment. I imagine that, you know, anyone who's dealing with inference at large scale is going to be thinking about how to do that kind of distribution. Does the heterogeneity add a particularly, you know, interesting element to it? So once you have any kind of large scale, so people typically talk about things like pre-fill, decode, disaggregation, where you run pre-fill on one set of nodes, decode on another set of nodes. When you started looking at heterogeneity, especially if you look on heterogeneity across vendors, this problem exists both within a vendor and even across vendors. You start having to think about a lot more of the trade-offs of what is my cost of compute? What is my cost of memory bandwidth? What is my cost of memory capacity? and then, you know, it's basically this giant optimization problem of like figuring out where to put which workload in order to optimize the cost per token. And is that something that you are thinking about on a, like a real-time basis, like a runtime basis, or is this more of a deploy time consideration where it is, you know, when you're deploying an agent, it gets kind of introspected and you determine what should go where at that deploy time? it's a little bit of both because when you first deploy it you know you'll probably have a good estimate of where things are but it's pretty most of these workloads are complicated enough that it's pretty hard to know exactly where things are going to land um so there is a whole bunch of like you know profiling that goes on after the fact so like observability and stuff kicks in and then you get a better idea of like how to change the allocations in order to improve the performance um so in our system we actually have two paths like there's the fast path where we just try to get something up and running. And then there is another path which tries to move the workloads around to like optimize the utilization of the hardware. And then on an API perspective, there's a whole bunch of like routing decisions that need to get made because, you know, you need to know where the data is available, which is cached, what's not cached. And, you know, where the models are loaded. You might be offloading models into CPU memory and things like that. So you need to know like the cost dynamics there on a real-time basis. I'm imagining a kind of an optimization problem or a control loop where you've got like a Kubernetes cluster with a bunch of different stuff in it. You coming from New Relic, you mentioned observability. You've got a bunch of monitoring types of tools and you're kind of mapping all that to some kind of costs, factors and using that to moving things around in that kind of environment. Is that kind of close to the way you have mapped things out? Yeah, that's actually not far from how our thing works, right? We run completely on top of Kubernetes. And we actually use some pretty cool new stuff like DRA and Kubernetes to really help orchestrate some of these workloads. Oh, what's DRA? It's the resource allocator, dynamic resource allocator. So it basically allows you to be like, oh, I have this many slices of capacity available. How do I actually route to that? Oh, nice. Yeah. So we can actually, instead of thinking of a GPU as like a full resource, we could be like, oh, here's like a quarter GPU worth of work. Oh, interesting. I'm trying to remember the name of the company that used to do Run AI. Yeah, they were acquired by Intel, I think. NVIDIA. NVIDIA, yeah. And so now that technology is more readily available. Correct, but it's really not supported on GPU hardware per se. So one of the things we do is actually really make it easy to run workloads on like GPUs and other accelerators while like partitioning it into these segments. But we rely on Kubernetes to help orchestrate it because, you know, we don't want to rebuild everything, right? So we want to rebuild the pieces that matter and then rely on the high-level frameworks to provide us the hooks necessary. So does Kubernetes DRA already support GPU partitioning or is it a broad framework and you have to plug in the GPU bits? It's a pretty broad framework. It's more just for doing like dynamic resource allocation. So we basically put in the GPU bits over there. and DRA itself is actually not a MGA in Kubernetes yet. Okay. It's supposed to get released in the next version, I believe. Okay. So the challenge that really comes in, I think when you deal with heterogeneous stuff is, you know, there's kind of like two things. Orchestrating on heterogeneous hardware is more challenging because the networking fabrics and stuff could be pretty different. So you have to figure out like how, you know, these, you know, because people are typically running very high-speed network fabrics or like 200 gigabit, 400 gigabit, 100 gigabit, Ethernet. sometimes they're running like you know infiniband or something more proprietary like even nvlink across like a rack um and you have to kind of figure out how you can actually orchestrate across these different fabrics and every single hardware has you know a different programming model right if they're not you know most of them obviously nvidia uses kudo but there was a whole bunch of other other factors if you're targeting like intel and amd hardware yeah talk a little bit more about why that networking heterogeneity causes challenges i think um you know folks that have been in the space long enough think back to the osi seven layer stack and at some point we should be abstracted from all that uh that underlying detail um but that is not the case here i think we're getting there slowly right like for example you know at gimlet most of our stuff runs on top of this thing called rocky um which is rdma over a converged ethernet right rdma was like doing remote direct yeah remote DMA yeah Yeah remote direct memory access So it basically tries to emulate like memory semantics over a network interface. And you really need stuff like this because you're like, oh, I want to move data between two different accelerators, right? And the fastest way to do it is to be like, oh, here's this chunk of memory. I'm just going to send it over to this other machine. The challenge you typically run into with this type of stuff is you know um you don't usually want to traverse the cpu while doing it um right because you don't want to have to have the cpu do like a you know call to the gpu capture the stuff in the cpu memory and then copy it out so there are things like gpu direct that actually allow you to do this and they're pretty mature in the you know nvidia ecosystem um and they're usually relatively mature if you go within like a single vendor right like if you want to say i want to copy from amd to amd it's like relatively straightforward um the problem you run into is like when you want to copy from one machine to another they may not be able to transparently do it um so there is a whole bunch of like you know systems engineering that happens over there um even if you're trying to transfer like caches and stuff they may not even be in the same formats um which means i could maybe apply some like transposes and move data around in order to actually get it to to properly map to other hardware um and all this is in a spot where you know milliseconds matter or maybe even microseconds matter. Gimlet operates a Gimlet cloud, but it also sounds like a big part of the heterogeneity that you encounter has to do with disparate customer environments. You know, what's the split that you tend to see between usage on your cloud versus usage in customer environments? What's, is there a focus, you know, is one more of a focus to another? Like, how do you think about that? Gimlet, you know, as of today, we haven't publicly launched our cloud product. We have like a handful of like early access customers on our actual cloud product. Almost all of our sales and we do, you know, we do about eight figures of revenue right now have been through data center, data center customers and partners on the semiconductor side. And over there, it's mostly self-hosted. Eight figures is not bad for a company that launched two weeks ago. Yeah, well, you know, We've been working on it for a little while under the radar. Talk a little bit about the licensing model. Like, is that primarily software licensing? Is that also hardware provision? Is that, you know, how does that all work? So today we've been mostly, you know, going to like, you know, some of the characters and data center side partners, and that's been a software license. And it's usually restricted to some number of nodes or there is an additional percentage licensing fee based on how much hardware you utilize. and then on the cloud side when we're planning to launch our product in Q1 that'll be much more of like a usage-based pricing. So we've talked a little bit about you know some of the the technical challenges networking and heterogeneity like how does that lead those all sound like you know areas for compromise but you're also targeting you know you mentioned 10x performance as kind of an initial vision. You mentioned better performance and better cost per token. Like, how do you overcome, you know, those challenges, those compromises and get to something that's actually better? So we think about, you know, at the highest levels, we think about our stack kind of as this like three-layer cake, right? So you give us like the agent, you know, agent graph, which is basically a data blow. And as of today, we can basically absorb, you know, things like, you know, Langchain or even people have written like Python code around like Hugging Face or whatever, right? Like we can absorb that stuff into our system directly. We don't have like our own like agentic interface or something. We expect people to come in with their own workflows. So we want to meet developers where they're at. And do developers need to like call some libraries if they're like in a pure Python, non-Langchain type of environment or? Yeah, when you import Gimlet, we do a lot of, a lot of our stuff relies on like doing tracing through the Python code. So it is currently only usable through Python that might change in the future. But as of today, you know, it's basically stuff written around Python or around like other other libraries where we can generate a graph out of it. So it might be analogous to how you might use a Langsmith or something like that, where you're decorating functions that call out to LLMs, that kind of thing. Right. Except on our side, we expect people to just use whatever system they're using and we intercept the calls and figure out how to map into our system. Okay. So we don't have like our own like decorators. Okay. Got it. Got it. And so you're intercepting the calls at the Python function level. And is the implication that you're, you know, then potentially, you know, the natural kind of place for that to run is like on a local, you know, GPU, et cetera. But you're kind of, you know, remoting that out to wherever it needs to be running. Correct. So when you upload it into our system, we'll basically run through it and be like, okay, here's the graph that we know. and then we'll separate out the models to be able to go run on GPU resources. And, you know, we know like, for example, if you import a model from like Hugging Face or something, we'll know exactly what model it is and bring it into our system. And it's all pretty, pretty automated. Like our goal is, you know, five minutes from like, hey, I have this like thing that runs locally to I want the scalable in the cloud. Yeah. And are you limited to specific models to publish models or if you've developed some custom model can that be used as well? Yeah, so our whole system is designed to be able to run on any type of model. We actually support more than LLMs in our stack as well. We have support for several different types of models. And so that kind of goes into how we architected the system. So once we have the agent graph, right? The first layer that we do is we think about as workload disaggregation, which is how do we actually figure out what are the granular components of this workload and how to split it up in some meaningful way. And this is actually where, you know, the vast majority of benefits come in because knowing what hardware is available and how to split it up, you can go ahead and start like packing these things in pretty tightly into the different resources. And this is also where the cost modeling comes in. You start thinking about, you know, what is my cost to compute on all these different hardware? What's my cost of memory bandwidth? What is my cost of memory capacity? And you start figuring out what are the critical resources for each one of these sub pieces of the workload. And then you can then allocate it to the lowest cost resource. right and of course all this is within some sla bound so you know like like very technical terms like this lines up as some like big convex optimization problem right with a bunch of different assumptions and you try to just like optimize this down and figure out here's the optimal allocation um in real life it's very unlikely you'll be able to get the optimal allocation because the hardware resources you want may not be available so you'll just try to you know find the best bet um where you deviate from the optimal um the next layer down is that we have our compilation system, which will basically take all these workloads and lower it to the target hardware. And when you do that, this is where we do all the optimization around like, okay, well, this stuff runs on like a CPU, this stuff runs on a GPU, and we'll, you know, go and compile that stuff down. The challenge you typically run into with any kind of, you know, generic compiler framework is that you usually land up with like, at least in the AI space, you land up with like relatively unoptimized like kernels. So we have a framework underneath it where we actually autonomously rewrite your kernels using LLMs in the background to try to generate more optimized code to run most efficiently on AMD hardware or Intel hardware or NVIDIA hardware. You said you're using LLMs to do the compilation? So for the first layer in the model compiler, we used MLIR plus Torch MLIR and LLBM to do the compilation. and then the layer below that we have like automatic kernel synthesis which is basically like take all the compute code that's running and how do we generate better compute code um and that's actually done using llms and did you have to post train a model to do that or fine-tune uh it's all post training so can you talk a little bit about that process yeah yeah so actually there uh for reference for the readers uh for the watcher sorry there is there are actually a few blog posts on our blog that talk about this in lots and lots of detail but at a high level um it's like a multi-agent system right so it's like a you know gimlet hosted multi-agent system in our own stack uh but basically what it does is um there's a part there there's like a super wiser based model which tries to generate new kernels right so you give it a you give it your pytorch input um in our case it's usually pytorch that's the only code we really really optimize so you give us the pytorch input um and then we can give it reference code if we have it available so for example we may have cuda reference code available right because there may be a cuda version of that kernel available if not it doesn't matter it's not required um it's a hardware in the loop system um which means that you know the first agent will actually go and generate a whole bunch of candidate kernels we will then go run this candidate kernel on the target hardware um and we'll do all the profiling and correctness checking because typically you know when you generate new kernels there's there's always challenges with correctness so you need to go run all the tests make sure it's correct we capture the profiling data and then there's another part that basically looks at the profiling data and says okay try these new techniques to optimize the kernels further right and then the first generation part again we'll go regenerate new kernels based on the optimization criteria that it thinks it needs to do. And then you kind of repeat this loop a few times until it converges. And if you get a faster kernel, you use it. If not, you just fall back to the previous implementation. Interesting. And is the correctness testing that you're doing, is it, you know, does it run or is it, you know, somehow faithful to the original PyTorch code? I guess I'm imagining that you could produce valid kernels that run, but somehow have slight deviations from the original algorithm. Yeah. So what we do over there is that we actually run through a bunch of numerical tests to make sure that the kernels are numerically equivalent. The caveat over here is that... Doing some kind of sampling of the input space and using that to determine. okay um the challenge over here is that it's basically impossible well maybe that's a strong word but it's very very hard to generate a completely numerical equivalent kernel because of the way floating point math works right um and uh you know this is actually part of the challenge because you look at this problem you're like oh we can just prove that they're equivalent right but it's actually not that easy because you know if a times b is not the same thing as b times a is certain things start looking really really weird right and that's where you are about floating point math. And because of that small precision issues, it actually makes this problem pretty difficult. So we're actually working on some research to show what is the impact? How do we understand what the impact is of accumulating these inaccuracies in the loop? Does that imply that some type of quantization or quantized models or something like that it would be more easily optimized because you don't have the floating point issues or are they two different areas of floating point? Well, I think a lot of people quantize into like a lower precision floating point now, like FP4 or whatever. So like ultimately you still have those issues, but the good thing about FP4 is there are not that many possible values So it actually a little bit easier to um obviously the higher the precision the harder it is for us to verify correctness and knowing the exact bounds um so when we do the kernel generation we typically what we do typically is that we cache all the kernel gens offline for like a whole list of models and then verify them and make sure to put them back in um so we try not to do it in real time because we're still afraid of like potential numerical errors in the system yeah it it strikes me that like if you're talking to me as a customer and explain this and like i'm not sure that i want an llm in my you know compilation and optimization step just given hallucinations and and all that kind of stuff and you know given that you you know that they're difficult to verify like that seems like that would discourage me. Do you find that? Well, so normally we tell people that we can generate them optimized kernels and people are pretty excited because they hear about this new technology, right? But like I said, we tell people, hey, don't worry about it because we basically run it offline and make sure that these kernels are correct because we're not ready to run it. The third layer of cake doesn't run in the loop right now. And so the argument then is that the sampling-based verification, you know, you feel like there's sufficient coverage and you can flesh out any issues. Yeah, and at the end of the day, we basically, you know, once we are running a set of models, like we always verify, you know, some set of models, right, and make sure they're correct and you can just get the performance scores on the model and see if they still match. And so we got as far as the second layer of the cake, did we get the third layer of the cake yet? The model compiler itself is fully, you know, LL compiler-based, right? So it'll never generate code that's not accurate. We did get to the third layer, which is the kernel optimization that happens automatically. And to be honest, the way I would think about this is that, you know, what would happen before is that we would have our compiler. Our compiler would, you know, spit out a bunch of code. And then our team would go by and be like, oh, you know, we think we can write this to be even faster. Right? The problem is that, you know, it's easy to do this when you're when you target like one hardware, like one piece of hardware. It becomes like untenable. Our team's not very large, right? It becomes untenable when you start thinking about targeting, you know, hardware from like three different vendors and multiple generations of hardware in each vendor. So then you start, you know, thinking more creatively. And we're like, oh, look, these are the steps that I take to optimize the kernel, right? Like I build, I write a more optimized kernel, run some benchmarks on it, figure out why it's not as, you know, performative as I think it should be, make some changes, keep doing this until I get a good kernel. So we're like, hey, why not build an agent that does this? And if you like the results, we keep it. if we don't like the results, then we will go write it manually. So ultimately, you know, while we talk about these three layers of the cake, the third layer of the cake is really how do we cut down the amount of work we need to do in order to get a functioning system at large scale. And over time, I think you'll start thinking about like, how do you apply AI to systems as a whole? Right. But I think we're not, I think you're right in the sense that if you go start telling people all these decisions are getting made by LLMs, they're probably gonna get a little upset. so you know we don't really run the llms in the live loop today but we do see that changing over time as as you know better guardrails and better testing infrastructure comes online what are some examples of ways that you would see them as being useful so i think there's an argument to be made for example that even in our planning system where we figure out how we're going to orchestrate these workloads we could have like an llm help in the planning right like right now we have all these rules and all these rules and heuristics yeah yeah exactly you can start thinking about there being some kind of like you know like a tool like an mcp tool or something that says oh i will tell you exactly what the performance of this hardware is going to be for this work cloud and you can build this out on like a much more structured like you know lom goes and examines all these things and then uh you know figures out how to plan this stuff out um we aren't that sophisticated yet right most of the planning is done using just like you know heuristics and optimization criteria um but it'd be pretty cool to see like you know systems that can actually use all the live information to to do this over time yeah i'm thinking about the the uh like tco advantages that you're that you're targeting like better cost per token like have you do you have a sense for like how much of that comes from each of these layers of the cake like how much of it you know if we just disaggregated the workloads and kind of put them to the you know the most efficient or the best cost hardware for that workload like that would be 60 of the benefit and then compilation is 30 and kernel is 10 or is it like you know the other way around like you know 20 40 uh 40 something like that yeah so i think it really really depends on what hardware you're targeting right because if you're targeting something like um like say like you know at the end of the day you're thinking about just you just think about the model workout it's easier easier to think about and that's the vast majority to compute if you think about something like that and you're trying to run on top of h100s right the code is like very very well optimized at this point which means that at the kernel level you're just not going to get all that much right because it's been optimized for for years at this point um and like ultimately you know we can talk about how machines might be able to do a better job but you know hundreds of people hammering on something for years will probably get you pretty close to the best answer there um and you know i think if you take a look at stuff like h100 we see like you know single digit percentage improvements right at the kernel level at the lower levels yeah at the lower level because it's been so hammered out and it's something that you can figure out like statically like oh i have to run this you know deep seek model here are all the optimizations i need to do in order to make this performant on an h100 um interestingly if you start looking at even like an rtx 6000 right which i think is like a b40 uh or a b200 you'll start seeing like much more significant gains to be had over there um you know we have seen 20 30 40 improvements in performance in many cases uh and that's because it's much much it's it's a lot less explored than the hopper optimizations have been. If we start taking a look at, you know, things like kernels on, you know, Mac hardware, kernels on, you know, AMD and Intel, we also see like significantly more improvements. Like, you know, sometimes even like over 2x. And that's partly because like some of these frameworks, like Apple, for example, they don't have a properly working like torch compile equipment. So we get a lot of benefits because you're comparing against like relatively unoptimized code. So we've been talking about Intel AMD and NVIDIA, but you also support deployment on the M processors? Yeah, we have always, you know, we've, like I said, we started off as an edge company and we have a soft spot for edge hardware. So, you know, we always have dreams that will eventually have a fully distributed cloud where we can run stuff in the cloud and run stuff on end hardware and have it orchestrated between both of them over time. So we play around with those dreams all the time. Nice. And do you have anyone running on Mac hardware kind of in the data center? Because that's something that people are doing. I'm sure people are doing it because I think like, you know, probably Apple and other folks like that will be incentivized to do that. But we don't we don't work on that. Doing that today on data center side. When you say you don't work on it, mean you don't have anyone doing it or the product doesn't support it yet? A product could run on Mac hardware because we do support that. But we haven't done it at data center scale. ourselves only on end user side things okay okay but yeah your set kernel area you know somewhere between single digit percentages to maybe like you know a 2x improvement exists where they're based on what hardware you're looking at could could be even more if you're running on very unoptimized or new hardware right um the you know the compiler side it's mostly around just like doing better fusion figuring out what all the stuff is uh you know probably you know at this point if you compare against like best in class you know maybe you're looking at single digit percentages maybe 20 30 over there and some of that is just because we're trying to build you know the main challenge the compiler is not the optimizations it's just being able to apply these optimizations across different vendors hardware um and the last layer of the stack i think is where the vast majority of you know benefits exist right like a lot of times you hear people say things like oh, my GPUs are only like 30% utilized, right? And what that ultimately means is that you're wasting like two thirds of your GPU capacity. So there is a fair amount of like optimization work around just more efficiently utilizing the hardware you have available. And that's where I think the vast majority of the benefits are. Do you come across interest in applying heterogeneity for like training scale or is it primarily interested? Like, is there, like, if so, like, what are the structural, if so or if not, like, are there structural reasons why folks are or are not trying to do that? I think on the training side, when people talk about heterogeneity, it's usually still heterogeneity in a homogenous sense where they're like, okay, I have an NVIDIA cluster or I have an AMD cluster and they don't ever try to mix the resources. And part of the reason is, like, if you take a look at training hardware it's kind of gone the way of like building supercomputers right like you know people don't talk about building machines anymore they're like here's my entire rack right this starts looking like you know what cray was doing like four years ago or something where they're like oh here's like an entire rack of machines so in some ways you know you could be like oh we kind of regressed back to the supercomputer era and i don't know if i use that word positively right we're building this like fully vertically integrated systems and i'm not sure that's the route for for inference i think inference is much better served as a large-scale work cloud where you can utilize a bunch of relatively commodity hardware and be able to like scale out efficiently and i think the only way this becomes like scalable and sustainable is if i can interoperate easily with different different hardware so i do think that there's going to be some divergence over time um and i think you're going to start seeing this like even nvidia for example announced you know their cpx which is our context processor you know it's a way that heterogeneity is making its way into the workload even within a single single vendor single generation yeah but there are certainly vendors that are pushing the kind of supercomputer profile, you know, rack scale solutions for inference, thinking of folks like Cerebris, thinking of folks like Qualcomm, SambaNova, you know, the same ideas apply, but it sounds like you're seeing a lot more heterogeneity on the inference side. Yeah, I just think that, you know, So as systems tend to scale out and mature, they tend to become more disaggregated over time and not more aggregated. And maybe over time, all of the new expensive hardware goes first to training and then inference gets the hand-me-downs. Right. And I think in most cases, I'll be fine, right? Like, you know, you might need some of the highest end hardware for the most performance critical work, even on the inference side. But then you can use a lot of the older hardware to then do the inference. Can you talk a little bit about, just to make the conversation more concrete, specific workloads or use cases or customer experiences? Yeah, so we've deployed on a bunch of different heterogeneous hardware at this point. I think there are some numbers that you've probably publicly seen from us on our blogs and maybe even from some like-and-a-re-companies. For example, we showed that if you utilize Intel's Gaudi processors, which are relatively old at this point, and mix that with like a B200 or mix it with an H100 or H200, you can actually get pretty significant TCO benefits. And the reason for that is you know you basically if you think about the different work clouds you essentially looking at what is the cost to compute and the cost of memory bandwidth And the Gaudi has like extremely low cost of memory bandwidth because it an HBM chip that relatively low priced So if you can start to think about how to utilize that difference, you can actually get pretty significant cost benefits on that. and because it's so much relatively so much cheaper you can use a lot more of them and essentially make up for all the perf differences you have and we can pack the most performance critical workload on like a B200 so from a user perspective they can't really tell but the cost is actually substantially lower Have you talked to the folks at Qualcomm they have their AI inference systems like they just announced the ai 250 and ai 200 systems um seems like they one of the things that they are talking about is inserting themselves into these heterogeneous environments and i would imagine that something like what you're offering might facilitate that yeah i think it would be useful to them we're just in like very early conversations um with them to be honest um but I think it makes a lot of sense. Like one of the things that, you know, everyone thinks about is that most of these data centers and most of the customers, like, you know, have a very high preference to buy large scale NVIDIA hardware, right? Because that's the default. Everything's going to work on it. So the reason heterogeneity is interesting from a business perspective to, you know, other some like companies is that it's an easy way for them to go to their customers and be like, hey, if you buy a few of these on the side, we can show you the benefit of heterogeneity. And then you don't have to like wholesale make large bets on... You don't have to rip and replace. Yeah, exactly. Yeah. And in particular, it makes the story easier if you're competing against, you know, potentially depreciated, you know, older NVIDIA capacity, right? It makes it easy for you to say, oh, I can utilize, you know, all my old NVIDIA capacity plus some of the newer capacity from a different vendor and then see what the changes are over there. Your typical customer deployment is like clearly heterogeneous in terms of infrastructure. Are they, you know, similarly more heterogeneous in terms of use cases and applications or are they, you know, more homogenous? Like they are, you know, running this to support a single application or a single, you know, agent or set of agents at a large scale. It's mostly the latter. The applications we typically see that are running and a lot of the data center stuff that we target, they're relatively homogenous because they're usually utilized by like one or two or three customers. Most of the deployments that we've seen have like basically heavy hitter customers. And then, you know, to serve like a large scale, they're going to have a whole bunch of really small workloads. But for the most part, the vast majority of the workloads are coming from like a couple of different players. Okay, so it's not necessarily something that, for example, like a CSP cloud service provider might create a general purpose inference cluster using your stuff. It's more, you know, I have this workload and I need a place to run it and try to drive down cost per token. I mean, they could, a CSP could use our stuff, but these days we've mostly been targeting people who have either their own data centers or like, you know, think about like sovereign clouds and other NeoClouds. Have you seen a lot of, talk a little bit about what you've seen in the sovereign cloud space. Like I hear that, you know, thrown around a lot, not a lot of use case examples, you know, in part because of the, you know, the sovereign nature of it, perhaps. You know, what are you seeing there? Yeah, so I think even amongst the sovereign clouds and NeoClouds, there's kind of, there's a few different classes over there. So there's some people who have like pretty good, usage of their clouds and there's some people that are building cloud capacity but it's not entirely clear what applications are running on there just yet and a lot of it i think is a software gap um we try to target people who are actually using their stuff that gives us better feedback on how to improve improve our things um there's also a class of like say the sovereign clouds who are building data centers up you know in even in the u.s um they're not like u.s sovereign clouds but they're external sovereign cloud companies that are building capacity up in the u.s i think they have like a lot of real workloads because they're basically selling that capacity to other customers uh so those are interesting people for us to work with because they have like actual usages and then one of the other things that kind of drives sovereign clouds is because of all the export restrictions and everything they're pretty limited by what hardware they can get um which typically makes their data centers like heterogeneous by almost by design because they're just getting like whatever they can find so that actually is a pretty interesting area for us just because we can enable the usage of all that hardware. You know, along the lines of like talking about Mac or the M series chips, talking about Qualcomm, do you anticipate that your focus will primarily remain in VD Intel AMD or are you anticipating like an extreme level of heterogeneity where you're supporting, you know there's kind of a long tail to it yeah so what i want to say is on you know we have support for all these different vendors but the vast majority of deployments that you know we go into or look into are still you know mostly nvidia nvidia shops um and even within nvidia the heterogeneity is like pretty interesting right across different accelerators there's different cost economics and stuff so we can still utilize that and orchestrate things um we there are a few people we're talking to that want to have data centers with extreme heterogeneity and these are mostly on the sovereign cloud side and that's because of the issue that i said you know they have so much capacity they can get from each one and they're like okay we're going to put all of them in here to get the maximum capacity um but for the most part what we find out is that people have like two vendors in their data center um and that's partly because you know we're solving one layer of the stack but you know as you start thinking about all this heterogeneity just doing the system management and everything across a lot of different vendors is a lot of work. So people tend not to have more than like, you know, two, maybe three at the most vendors and their actual data centers. And is the heterogeneity that you're seeing at like the, at the rack scale or the, you know, rack of racks type of scale, or is it, you know, even more distributed than this? Like many years ago, a startup that I worked at was really kind of on the cutting edge of like distributed systems. And like we would go in and like stand up these environments with like literally like, you know, a cast off of this, like some, you know, Dell DL 240s, like a little bit of that, a little bit of that. But like very, very like, you know, part of the idea was that you could, you know, create enterprise grade, you know, SLAs on commodity compute. And, you know, that idea has surfaced in a lot of different ways in the industry over time. are you seeing like that level of or are you anticipating that level of heterogeneity or is it more like i'll have my you know rack of h100s next to my rack of slightly you know newer older gear yeah normally it's a little bit further away than on a single rack because usually single rack people have the same type of machine on it's it's especially with like the vertical integration that happens on all these all these machines um i guess maybe um if we take a step back there's kind of there's pretty much like laws of physics limits on how many machines and capacity you can have per rack right if you go to like a generic data center you know you can maybe have like 20 kilowatts a rack and you can't really stack like a ton of machines so you'll actually see a lot of data centers that run gpus that aren't optimized for gpu workloads typically have like you know just a of machines in each rack because you're going to exceed the power budget of that rack if you go past like you know 25 30 kilowatts and you just start having you know back of rack cooling uh or like a rare door heat exchanger is what they call it um and that usually requires some level of outfitting right so they have it's liquid cooling but liquid going to a radiator in the back of the rack um and then if you really start you know going to you know 100 kilowatt plus racks you start going into like direct chip cooling right which is what like a lot of the latest great greatest data centers would use um and once you get to that point you're basically building like a very rack level system um so typically you know if you're like not super sophisticated you have like one or two machines a rack because that's all you can do and if you're very sophisticated you have all these things packed in super tightly but then now it's very vertically integrated so either way you're not seeing a lot of heterogeneity on a single rack because there's just not that much space one way or the other Yeah, essentially the GPU and the cooling and power requirements kind of take you out of that kind of super heterogeneous, like stack up a bunch of commodity stuff. Like it tends to be more. Got it. Okay. And that might change, right? As like lower power systems and, you know, more optimized systems come off. But the industry trend so far has been to try to increase. And maybe smaller models as well. And maybe smaller models. Yeah. but the industry trend at least on the hardware side uh if you look at nvidia's roadmap is that they're actually putting more and more things on the rack right like i think their next generation rack which is kyber is going to be like 600 kilowatts right it's like it's like almost an entire data center and like and like a single rack yeah any thoughts on like from where you are today you again you've been at this for a couple years recently launched though a lot of early success like how you see things evolving over the next year? Yeah, so we are super excited about, sorry, we're super excited about getting our, you know, agent cloud launch, which will be more of a developer facing products. Like so far we've been going mostly to like data centers and et cetera. But, you know, I think our heart is really into like going into a developer facing product, working with developers to make running like these models and stuff like much cheaper and easier and faster. Actually, as you started saying that, I was going to say like, if I'm a developer and you're not, you know, offering me a framework, do I care much beyond like the, you know, the cost per token and the feeds and speeds of the performance ultimately? Like what are the qualitative reasons why I'm going to care about running on your cloud as opposed to someone else's? Yeah, so I think the areas that, you know, we'll have a lot of capabilities around are one is like orchestrating all the large scale agentic workloads, including like asynchronous work and stuff. that's been a big area where people have asked this for like oh I want to make a whole bunch of batch model calls and stuff by a bunch of like asynchronously running agents like we can support that pretty easily in Gimlet so that's actually a big big part of it the other thing is obviously you know the cost economics and performance and just like the observability around how these workloads are running well Zane thanks so much for jumping on and I feel like this is a bit of a rapid fire chat about what you're up to but I learned a lot and it's interesting stuff I'm looking forward to keep tabs on it. Thank you. Thanks again for taking a dive. Yeah, thanks so much.

Share on X Share on LinkedIn

Related Episodes

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759

TWIML AI Podcast

52m

AI in 2025: From Agents to Factories - Ep. 282

The AI Podcast (NVIDIA)

29m

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

TWIML AI Podcast

57m

From Hiring to Growth and the Future of Workforce Strategy - with Meghna Punhani of Eightfold AI

The AI in Business Podcast

35m