The CEO Behind the Fastest-Growing AI Inference Company | Tuhin Srivastava

Gradient Dissent

Tuesday, November 18, 202559m

Spotify Apple

Gradient Dissent

0:0059:13

What You'll Learn

✓Srivastava has been working in early-stage companies since 2012, and has experienced both successes and failures before founding Base10 in 2019.
✓Base10 initially focused on serving data scientists and small models for internal use cases, but the market shifted towards large language models and production use cases.
✓The rise of technologies like ChatGPT and Stable Diffusion created new opportunities for Base10, as the company was able to adapt its infrastructure to meet the growing demand for AI inference services.
✓Base10 was able to scale its infrastructure quickly, going from serving a few GPUs to over 100-150 GPUs for a customer like Refusion.
✓The company was able to refocus its efforts and change the surface area of its product within a six-week period to capitalize on the changing market.

AI Summary

The podcast episode discusses the journey of Tuhin Srivastava, the CEO of Base10, an AI inference company that has experienced rapid growth in recent years. Srivastava shares his experience of working in early-stage companies since 2012, and how Base10 initially focused on serving data scientists and small models for internal use cases. However, the market shifted towards large language models and production use cases, leading Base10 to refocus its efforts on the inference space. The episode highlights how Base10 was able to capitalize on the rise of technologies like ChatGPT and Stable Diffusion, and how the company was able to adapt and scale its infrastructure to meet the growing demand for AI inference services.

Key Points

1Srivastava has been working in early-stage companies since 2012, and has experienced both successes and failures before founding Base10 in 2019.
2Base10 initially focused on serving data scientists and small models for internal use cases, but the market shifted towards large language models and production use cases.
3The rise of technologies like ChatGPT and Stable Diffusion created new opportunities for Base10, as the company was able to adapt its infrastructure to meet the growing demand for AI inference services.
4Base10 was able to scale its infrastructure quickly, going from serving a few GPUs to over 100-150 GPUs for a customer like Refusion.
5The company was able to refocus its efforts and change the surface area of its product within a six-week period to capitalize on the changing market.

Topics Discussed

#AI inference#Large language models#Model deployment and serving#Startup growth and pivots#AI infrastructure scaling

Frequently Asked Questions

What is "The CEO Behind the Fastest-Growing AI Inference Company | Tuhin Srivastava" about?

What topics are discussed in this episode?

This episode covers the following topics: AI inference, Large language models, Model deployment and serving, Startup growth and pivots, AI infrastructure scaling.

What is key insight #1 from this episode?

Srivastava has been working in early-stage companies since 2012, and has experienced both successes and failures before founding Base10 in 2019.

What is key insight #2 from this episode?

Base10 initially focused on serving data scientists and small models for internal use cases, but the market shifted towards large language models and production use cases.

What is key insight #3 from this episode?

The rise of technologies like ChatGPT and Stable Diffusion created new opportunities for Base10, as the company was able to adapt its infrastructure to meet the growing demand for AI inference services.

What is key insight #4 from this episode?

Base10 was able to scale its infrastructure quickly, going from serving a few GPUs to over 100-150 GPUs for a customer like Refusion.

Who should listen to this episode?

This episode is recommended for anyone interested in AI inference, Large language models, Model deployment and serving, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode of Gradient Dissent, Lukas Biewald talks with Tuhin Srivastava, CEO and founder of Baseten, one of the fastest-growing companies in the AI inference ecosystem. Tuhin shares the real story behind Baseten’s rise and how the market finally aligned with the infrastructure they’d spent years building. They get into the core challenges of modern inference, including why dedicated deployments matter, how runtime and infrastructure bottlenecks stack up, and what makes serving large models fundamentally different from smaller ones. Tuhin also explains how vLLM, TensorRT-LLM, and SGLang differ in practice, what it takes to tune workloads for new chips like the B200, and why reliability becomes harder as systems scale. The conversation dives into company-building, from killing product lines to avoiding premature scaling while navigating a market that shifts every few weeks. Connect with us here: Tuhin Srivastva: https://www.linkedin.com/in/tuhin-srivastava/ Lukas Biewald: https://www.linkedin.com/in/lbiewald/ Weights & Biases: https://www.linkedin.com/company/wandb/

Full Transcript

In a world where that we're shifting towards models doing everything and all the value being in the application layer, if there is AGI, the only market that will exist to some extent is what does the model need to do and the model can only run on inference. There's two different parts of inference. So I think on the infrastructure level, I have a workload running across five, ten, a hundred thousand GPUs. How is this thing going to scale? The second which is the runtime level problems. Hey, how fast do these models actually run on a given GPU? Who's using open source and who's using closed source source in your experience and how do people think about that trade-off? I think like everyone's on this curve of like maturity and I think a lot of people start from customers to open source models where they're doing their own things. I think the lower places other people start is that you go with Anthropi, you go with OpenAI and you have these great models that can do a lot but they're either too expensive, have a lot of reliability issues themselves or the third piece which is just like our customers care that we aren't just piping all this data off to to someone who has trained models on this, and that matters to us. You're listening to Gradient Dissent, a show about making machine learning work in the real world. And I'm your host, Lucas B. Wald. Today, I'm talking with Tuhin Srivastava, who is the CEO of Base10 and the founder of Base10. Base10 is currently one of the fastest growing companies in the inference space right now, which itself is a fast growing space. So it's phenomenally successful, just raised a giant venture round. But I was especially interested to talk to Tahin because he's been at it for a lot longer than people think. I think, you know, you see those graphs where you go from like zero to 100 million ARR in a couple of months, but you kind of don't see the years that I've known him when the growth wasn't there. And they're kind of like looking for what to do. And, you know, as someone who has pivoted myself and kind of gone through a lot of struggles, I was excited to talk to him about how he worked through that and how he eventually, you know, got the company growing. I was also really excited to talk to him about inference and how it works. I feel like that changes every couple of months. Everyone calls the space a commodity space, but it certainly doesn't look like a commodity space with the growth you see in certain companies like Base 10. So this turned out to be a really interesting interview. It went in a lot of directions I didn't expect, and I hope you enjoy it. Okay, so I was thinking about this interview. I have a lot of technical questions that I want to get into that I think will be really interesting, But I kind of thought it might be fun to start with like a little inspiration for people who are like kind of feeling stuck in their companies. Like I feel like you have this like amazing story of like, you know, kind of suddenly taking off after a couple of years of sort of, you know, kind of not being the hot company. And I just thought that'd be like a great story to like, you know, kind of share with everyone who might be feeling, you know, at an impasse. Can you tell me about your journey? Yeah, it's actually really post-tech found me. It's actually quite interesting. And Lucas, you could probably, let's write the right word. You could probably like empathize with this is that I've actually been working in early stage companies since 2012. So 2011 actually. And that was like a series of small companies that I was like very early employee at. um most of them were no they didn't work they really work and then we i had another company from 2015 to 2018 that was you know i met some great friends along the way that that that was mostly mostly the outside of it but um and then even at base 10 so base 10 was actually founded in 2019 um honestly not that long after weight and bias was probably founded and you know the world was just very different then in terms of kind of like what the state of machine learning was and what people thought about machine learning even honestly even who we were to some extent targeting you know we were trying to work with data scientists working with small models for internal use cases and then that's what we thought was the market was the market that was important and obviously in 2022 everything everything had changed and you know the last three years where I'm sure we'll talk about past that in this interview. But what, the reason why I bring up the fact that we've been doing this for 15 years is that I don't think the inputs have changed that much since 2000, since 2000, at least definitely not since 2015. And, and really what has changed is probably just, you know, the market, the market kind of carries everything with you. And, and the second thing is. Wait, sorry, so when you say you don't think the inputs have changed very much, are you the input? Like what? Yeah. Yeah, yeah. Every day, like we, I don't think like my motivation towards work, how I think about work, even what I do as much has changed very much. Just kind of the velocity of everything around it has changed, you know, like, you know, we're still got come in every day and, you know, I check the support channel and, you know, I go and dig into some product problem with some user feedback that we've got. um i i still go and try to sell base 10 to a bunch of people totally i really and i think the desperation honestly from us is even more now than it was when things you know weren't working i think i think the only reason was changed is that a market arrived and we didn't give up um and you know there's plenty of capital to keep us going and so um my my inspiration i i don't think it should be it'll be that inspirational but like my inspirational thing would just be hey don't don't don't give up you know just work with people you like um and you know follow markets i think like one of the big things that we did um in the in the early stages of this company was not scale too quickly um and i think like when when markets shift when markets shift or like a fundamental truths change underneath you um the weight of the company is probably what gets in the way of you being dynamic with those things. And for us, from 2019 to 2023, when we raised our Series B at the end of 2023, and I'd say that's when I was like, oh, we have something real here. I think we were still 18 people. We were still 18 people. And so we were lucky in that way. Totally. I mean, do you know my story with Crowdflower? I mean, I also had kind of like a long like slog before the markets go off. So it's like it takes real skill to kind of keep a company cohesive. I mean, even at 18 people, I think it takes a real skill to kind of keep people going. And then I think this like incredibly underrated skill, maybe like the core entrepreneurial skill to sort of like jump on, you know, to something. You know, it's an opportunity that's like in front of you. Can you maybe talk about like what you were seeing and how you did that? And how much of a pivot was it versus like you're kind of ready for it? Yeah, I think a lot of people think it was a pivot. It kind of wasn't. I think like we already had a lot of the infrastructure built for it. We were, it was very funny from 2019 to 2022. Obviously, no one was calling it inference back then. We were calling it model deployment and model serving. We were going around telling everyone that model serving is a commodity. IT. It's easy. We'll give that away for free. We'll give that away for free. But we'd built all of it. It was just really what changed was that if you think about those three things that I said initially, like data scientists, small models, internal use cases, it kind of went from small models to big models is one massive market expansion that makes serving actually quite hard. Then we went from internal use cases to production, which meant that SLA and infrastructure mattered a lot and then the third thing went from data scientists to engineers um and then all of a sudden you have people who can actually have an agency to change the things in front of them and i'd say it was like around these three truths it was more of like a refocusing as opposed to a pivot so we had other parts of the product that we just killed and we're like all right we're going all in on this but the fundamental product was still there um if that makes sense yeah totally And so late 2023, what actually happened? Was it one customer that kind of pulled it out of you? Or what's going on? Yeah. I do give ourselves credit here. I think we did a good job here. So I think there was a couple of things. So at the end of 2022, so ChatGPT happened. And I think that didn't actually have that big an impact on us, except it, one, made people pay attention to AI. Two, and you probably remember this at the end of 2022, kind of like the narrative just like really shifted um around ai but um i think chad gpt set like two things up which made it very interesting one like it set the consumers the standards for consumers of what they expect out of ai products and so like people building things who are now our customers like had like this you know the standard too that they had to beat when it's pretty crazy that the first mainstream AI product was ChatGPT because like it was so good from like even like user experience. It's like if it's like if Stripe was the first person to ever do payments, that'd be kind of unreal. I think on the other hand, like they also started creating these developer APIs, which again were like high quality APIs. I mean, I think like develop the standards for developers was set very, very high. And so that was one moment that came out of chat bt the the bigger moment was probably stable diffusion to be honest um because all of a sudden you had like an open source model that had um approximately it still wasn't as good as dali too at the time or whatever whichever dali we were on but it was like approximately as good and so all all of a sudden people started paying a lot of attention started paying a lot of attention into like the ecosystem that formed around it. And I think that's injected a lot of excitement. I'd say what's really cool is the first, I'd say the first customer that ever came to us and was, there's two customers that I can think of actually who came to us and were like, this is going to be really cool. The first one was like Patreon was experimenting with Whisper for like generating subtitles. And we're like, oh, that's up. That's very, that's very cool. I think the second piece was we had this guy, he was a friend of ours, called Seth. And he was like in his week, he took a year off between companies. And I think he called it his year of yes. And as part of that, he wasn't super technical, but he figured out how to fine tune Stable diffusion to generate music um and it's called refusion and he kind of put on the hack and it kind of took off oh i remember refusion totally yeah yeah yeah and so like at the time i remember he came to us like the day before his launch and he was like i might need a lot of gpus and we're like yeah sure you will um you'll probably need five a10gs and i think i i think at the time he ended up scaling up to like 100 150 a10g and we thought that was mind-blowing how many gpus that was at the time. And I think that was like a real spur was like, oh, there's going to be a lot more of this. Like, let's, let's really think about that market as a core market. And I think we really, we pivoted around, we refocused around that moment, which was just like, all right, let's change everything. So I think over the course of six weeks, we completely changed the surface area of the product, which is, which is pretty cool. Wow, that's awesome. That'd be good for you for making that. Yeah, helpful you don't have that many customers, but still. Still, a lot of people don't. Yeah. I mean, I guess like, you know, before we get into the technical stuff, you know, you kind of immediately jump to like, hey, or something I always think about is like, I'm the same person. Like I kind of remember, you know, taking a lot of criticism, you know, when stuff wasn't working and sort of getting praised for like, almost like exactly the same stuff where, you know, you said the inputs are the same. I just like really relate to that um do you like is there you know any more you want to say about that like do you feel like you've like learned something from kind of watching that experience yeah um i i think i think everyone needs to just figure out what drives them to some extent i think like i i just call like what that what i'm driven by is like it's not really about having a successful company as much or it is about like working with people you care about on problems you care about and like everything else for that change and i think it's really easy to focus on the input when you're like what do i need to do is like well i need to go hire more people i like and let me go find hard problems um to solve um i i'd say um where where i get stuck and like in my and i don't know if you relate to this at all the darkest moments are almost when you're like trying to company build as the as the core focus um if that makes sense it's like no what do you mean by company build like when you when you start chasing photos when you start chasing photos and revenue targets because you know with this yeah with this artificial um you know gregorian calendar that we've come up with or like hey um this actually happens a lot with go-to-market teams go-to-market teams um in startups is you know when you start applying playbooks generic playbooks to things and it's like you're not really coming at it from like a like you you're almost like trying to um treat everything like the science where i think a lot of it is still the art of it which is you know hire good people what what work on stuff you care about um yeah and really optimize around that but that can be a bit idealistic attempt as well so i love that attitude i mean i sort of i think I relate to that when things are small, like, you know, oh, this is just like my project. And I think it's kind of hard to hold on to, or in my experience, it's hard to hold on to that feeling as a company grows. Like how big are you today? And are you still feeling like it's a, you know, kind of a fun project with people you care about or Yeah Look we 110 people now We have our company off next week and we do it every six months i think we we less than half that our last company off that it growing it growing people i i i think i'm lucky in that i i am a bit like that um still um one of the really great things about and about on base 10 is you know my my my two co-bounders i've known between 15 and 25 years um these are like very close friends of mine um a lot of the people early in the company are still around um you know even our investors like a lot of like our serial girl who is at who's on our board you know i've talked to every day for six years that's amazing really yeah every day yeah everyday and so you know like it's a i think that is um that keeps that stuff alive i think there's also just a bit of like uh um i've learned this from some other founder which is like i i kind of refuse to do things that i don't want to do and i i i think like one of the like we have to do a lot of like being a founder i think is actually quite challenging just because you you all the problems to some extent fall upon you at the end but like i still kind of like take a lot of agency around like i would only work on stuff that is this is for the company all the stuff i want to do um like it's okay if i skip meetings so that's okay that like that is the that is the only thing i ask for myself of myself and for myself with that agency and that i think that allows it to feel um a bit more like that also like i'll i'll caveat all this is that everything is really fun when you're growing a lot um and and so like the last two years i've felt like that so yeah okay so we're taking this in a much different direction than i thought i'm gonna ask what's the thing that you don't do that you think someone like me would feel like most guilty about not doing um i skip one lunch all the time nice i i i i see one on lunch all the time and um and usually usually there is of the form of hey i just if is it anything pressing come catch me come catch me my desk uh-huh um i yeah that's funny i feel like every founder ceo that i've gotten to know well enough um actually says the same thing like all the way up to like jensen was actually on this podcast saying that and that kind of gave me like you know freedom to feel like I could do that but it's funny because I feel like it was Ben Horowitz like years ago was like it's just outrageous to miss like one-on-ones and everyone just started like feeling like guilty about skipping one-on-ones but it's funny you know it's like you know if it drains energy and like there's nothing to talk about like certainly it doesn't make sense to do like a standard meeting and maybe like I think there's like the um the things that are like our rituals which were important and there's stuff that's just performative at times totally and like the idea that you need to meet with someone that you that you know you work with every day for like 30 minutes to like you know reflect or whatever or problem solve i think it's like it's a bit um you know like we we have a own version of that um with me and my co-founders which is that we have these like this weekly meeting uh we have this weekly meeting um for like 45 minutes on friday mornings um where we just you know it's it would seem like a one-on-one but it is purely like we just you know we shoot the shit you know we we kind of just do whatever we want um in that time um and sometimes it's work sometimes it's not and i think like you know that's that that part's a lot more important to me from like a ritual perspective than like the performative part of like uh what are the three things that we need to talk about this week and yet you have a daily one-on-one with your investors sarah what are you guys talking about there but like i engage with you're so much as like a friend right at this point you know like it's like you're kind of blurring i we end up just like blurring the boundaries between um as as founders between everything um and like that i lean a hundred into that which is like you know there's no there's very little boundaries in my life um for less interesting well i feel like someone might like listen to this and think wow like base 10 is really lacking some kind of like operational discipline do you feel like that's like a fair criticism or no i i don't think so at all i think actually like we are um it's a like it is i think we have a lot of operational discipline where it matters um for example like in our sales force and like the way we do sales like it's very very customer centric and you know i'd say if you want to talk to our customers they'd be like hey no one pays as much attention to us as they send us i thought they'd say that um i i i think a lot of the a lot of the things that we are trying to solve for our like how do we just do the most impactful thing possible without draining energy and i think you know for us some at least for engineers all the time those things are very, very large structure, for example, which is like too many meetings, too many one-on-ones, too many right throws, too many right throws. I think on the flip side of it, there's like a healthy tension between what does that need to look like in every company, which may be different as opposed to just like the industrialization of the product dog. Does that make sense? Totally, totally. Yeah. All right. So switching gears to what you guys do. I mean, I actually was looking at your website recently preparing for this, and you do a lot more than I knew. I think you're expanding the surface area of your product lately. But I sort of think of you as like an inference company. Is that fair? That's fair. I say our North Star is inference. That's probably the best. And so I feel like inference, I know a lot of the founders in the space, and I've had a lot of friends kind of come in and out of that category. And I think it's sort of like people feel like it's commoditizing. And I think you even sort of said you sort of thought of it as a commodity. Like, I don't know, is inference a commodity service? Like, what are the differentiators? Yeah, I think you need to step back and think about inference of what. So if you think of a generic inference of an open source model, like if all we did was serve llama behind an endpoint i think that's 100 a commodity um and like you know it's a it's you know given some quality benchmarks um everything and given some performance benchmarks people will go for the lowest price i think there's a lot to um differentiate on a performance perspective still but still like in the fullness of time i believe that, you know, generic inference of vanilla models is probably commoditized. I think the truth is for us, like two or three things is that one, like dedicated deployments of custom models and fine chain models, you know, that is not commoditized. Every workload is slightly different. Every model is slightly different um and you know serving serving those models um is very very and and the tooling around that is is quite differentiated we differentiate on like three different things i'd say today um and i think um one is infrastructure and that is hey what did it actually take to run this reliably at scale and so like we we take a lot of um pride in um hey we have many nines. We have many nines. We don't go down. The clouds go down. We don't go down. We're taught everything to be fault tolerant, pretty elastic, and we align our infrastructure around the customer's needs as opposed to the cloud's needs. So that's the first piece. And we can talk a lot about the infrastructure problems here. I think the second piece is performance. I think a lot of inference providers differentiate around performance. We are very, very performant. In any head-to-head, we don't generally lose that. Now, when you say performance, do you mean tokens per second? Latency. Not the quality of the model. We mean, hey, how fast does this move? There's a lot of talk around how fast these things move, and I think if you normalize over time, what you'll see is that, and if you look at the history of software, that open source run times seem to win. All the best quality, all the best performing run times will eventually just be the open source one because that's how developers want to adopt things and where they want to put things forward. But you still need to meet a minimum of our performance there. We think, you know, we are A+. We don't think we are, we don't think that's going to win over the, in the fullness of time. but you need to be very good and best in class there. And the third one is developer experience and platform. I think like we think of this as a software problem. We think, you know, most of our customers have between like three and 20 models deployed on base 10. Most of them are serving their customers with different hardware, different scaling needs, with different traffic patterns. And we just provide a lot of software to be able to manage all that. And like we think that is very differentiated the same way the same way that you know a lot of other cicd tools or software tools add value so it's infrastructure performance developer interference and i think when we look at the intersection of these three that's where we think you know a massive influence company um exists in the fullness of time um now we do other stuff too i would say those things are very important but they happen in the in the service of doing inference so like we want to be the production great inference company um you know we did this we did this marketing campaign recently which out at home with this obnoxious bus is a lot of the city which i thought was great but it's like inference is everything inferences based and like that is how we think every day when we come to work but we we have like a a training product where we create worthless the training but again Again, we train those things so they can be served on base 10 to some extent. Or we provide fine-tuning scripts there so those models can be fine-tuned and eventually be run on base 10. Okay, but like, so there's sort of like infrastructure, performance, and developer experience. Why do you make this distinction between like an open-source model and somebody's custom model or fine-tuned model? I would think with like, I don't know, like, you know, your newest open source model, all those things could differentiate like reliability. You know, performance could potentially be better. Develop experience, probably you're managing, you know, a bunch of these things anyway. Yeah. And, you know, even with like a fine tuned model or something, it seems like those still could also, you know, undifferentiate, right? I mean, I can imagine if everybody's using VLM or something to run their models, then maybe it doesn't matter. Yeah, I think you're right. And I think today that is the case. I think you can differentiate on those things today. And we do. But in the fullness of time, I do think a lot of those things will know for a given model type how to run it very fast. and then reliability will have to be there's like there won't be the unreliable companies won't exist and what will be left is reliable endpoints that are pretty fast and where the switching cost between them is pretty low. Wait, so are you like telling me in the fullness of time you don't think that you will differentiate? Is that what you're saying? No, no, no. I think, sorry. Maybe I should separate this as well. I think this dedicated and this shared. So when I think about shared endpoints I don't think there's much differentiation in the shared market. I mean, there's a lot of differentiation in the dedicated market. I see. Yeah, sorry, maybe that wasn't clear, which is like, we have this like, again, if you want to use a Lama endpoint that a bunch of other people can hit, and it's going to do, you know, is providing the same tokens per second of the same quality, to me, they're very much the same. and that's the shared endpoint business. 99% of our business is dedicated capacity, which is people have not multi-tenant endpoints, only those customers are using those endpoints. I see. Sorry. Sorry, no, that's actually even a different axis of like dedicated versus shared, although I guess you wouldn't have a shared endpoint with a custom model. Yeah, exactly. But okay, in like a dedicated situation, And why is that more possible to differentiate? Because all those workloads look very different. So like, even when you like, look, firstly, it's like not everything's a language model, but let's assume everything's a language model for now, which is like the way that I'm setting up, I'm deploying a model for use case X. Another customer comes along and says, I want to deploy it with this. And my inputs are slightly different. My outputs are slightly different. My SLA is different. I can only run capacity in this certain cloud. I need things to be HIPAA compliant in this way. All those just form different kind of attributes of the workload and that give you ways to differentiate to serverless customers. Interesting. So when I come in and I'm talking to you, you're having a conversation about my requirements and then setting something up. custom for me. Yeah. Or we're giving you the tools through our software to be able to configure those things yourself. But yeah, so like, you know, you have full at base hand, when you come up, you have full control of your runtime. We can give you the tools to like make your runtime better or we can give you some default runtimes or we can look at the LAM or we can look at TRCLM and tell you hey here are some good configurations to run those But you can do whatever you want there And you might change those configurations based on your run path. In a way that would make sense for another customer. Totally. Okay, so could we dive a little deeper into how inference works on a modern LLM? First of all, I guess let's talk about how it works and maybe what the possible optimizations are. Toby, so I think there's three different parts of inference. There's two different parts of inference. So I think that you need to think about is you need to think about the infrastructure level problem that you got to think about the runtime level problems. I think on the infrastructure level, that is, hey, I have a workload running across five, 10, 100,000 GPUs. How is this thing going to scale? and so you need to set up your infrastructure to allow inference to be done. So what that might mean is that you need to set up your infrastructure in a way that allows the same user to be able to go back to the same GPU to reuse the KB cache for their workload as much as possible. And so there's an infrastructure, or there might be another problem which we see with some of our customers, which is that I can only use GPUs in a certain region because I want to minimize the number of hocks in the world. So you kind of have this like Cloudflare-esque type problems or those LX problems around running inference workloads. And in that way, it's very much an infrastructure problem. And so, you know, we've done a lot of work there. You need to acquire capacity. What happens when you need 2,000 B200s and one cloud can only give you 500? What do you do? like how do we kind of give you all those tools to solve that i'm the second stuff which is um a bit more researchy it's like the runtime level problems which is like hey how fast do these models actually run on a given gpu um and you know that is you know then at some point like that that is where stuff like vln and trtlm and suan come in that's why it's where some of the proprietary runtimes come in from some other companies. That's where even like the chip level companies start to come where they're like actually at the hardware level what changed how inference will be done. And then, you know, the question comes into like what makes it, what makes this challenging? And so, you know, there's really like a couple of things that matter here. I think people care about utilization and they care about speed. and there's somewhat of a trade-off there. What do you mean by speed? Speed, you know, the general metrics that people care about are like time to first token, which is like how quick does the first response come back. The second piece that people care about is like the time per output token. So after that first token, how long does each additional token take to come back? And then the second one is throughput, which is the memory question, which is how much capacity can we put through this without the degradation of performance. And last one is cost per token, which is like the cost of the underlying hardware and the KVCash and how well you're using that. And so there's lots of different things to optimize here. And we can talk about those different things. But, you know, the first one is like with the time to first token, that's really dominated by the pre-fill, which is the first forward pass through a neural net. You know, time per output token is dominated by the decode, which is you know the repeated single token steps um that are happening and then you know throughput a lot of us about memory bandwidth and quantization that comes in like how much memory can you use and how many flops um can you process and the cost um per token just comes back to like again as i said like the cost of the hardware how big the weights are how much kv cache are you using and how much do you have how much kind of work can you reuse and there's so much optimization that can go through it. And when you come to some of these open source frameworks that I'm sure you're familiar with and you're familiar with, which is like VLAM and SGLANG and TRTLAM, they're all kind of coming up with their own frameworks to optimize across these things. So whether that's quantization or whether that's using flash attention, if you use kernel, whether that's a continuous batch in the spec deck, Like they all kind of have the same researching mechanisms to do that. These problems at the runtime level today are really, really challenging. And they're, you know, pretty research intensive still. Like oftentimes we are pushing stuff that's less than a week old into production, which is pretty terrifying. And probably what's even harder right now that I'm sure you have heard of, which is like the talent that knows how to do this is pretty limited. and we're just competing with pretty crazy people for that time. So if you just share. I've heard of that, yeah. I've heard that's an issue. So there's a bunch of different ways to make LLMs run fast, and that's what I'd call the runtime problems. And then there's the infrastructure problems. We kind of think of both of these things as interchangeable to some degree. Not interchangeable. Both are necessary, I'd say. And I guess it's kind of interesting that there's multiple competing open source projects to do this runtime thing. Are there fundamental differences in opinion that they have? Situations where one works better than the other? Yeah. We love them all. We love them all. I think a lot of them just come back to that classic difference of usability and control. Usability, speed, and control. I'd say that you know at least from what we have seen at scale like you know who's pretty good at running inference and video chips it's nvidia it's it's nvidia and so like trtln is like the lot like you know has a lot is like the lowest level um you know at scale at scale once you're especially with all the new dynamo changes that they're pushing through at least once it's like elapsed some time from when a model has dropped to when you need it. So like maybe not day one, but like day 90. TRT-LM is by far the fastest and you can do the most. And that kind of makes sense. It's the lowest level to what NVIDIA provides. On the flip side, BLM I'd say and HTLang both, but BLM especially is like I think they really go towards usability. And so, you know, you can solve the VLM and, you know, honestly, anyone can do inference with VLM. But you're going to take a bit of like a performance hit. You know, you can get around a lot of that by just configuring it, but that's when you really have to become a mastermind. Like I'd say, SG Lang sits somewhere in between. And kind of, and like the question might just become over time, like why do all three exist? to some extent and i think it's just like a it's somewhat of like nascency of a market um and just slightly different requirements um for the people who are using them and i think like i help the ecosystem for like how how long did it take for like front-end frameworks to converge i think like we could argue that next day has dominates today i think definitely dominates today but what we're 30 years into that journey or like 20 years into that journey. And so it will just take a bit for the market to converge. And I think just given how fast the ground is shifting underneath us, it's really important that all these frameworks continue to exist. Do you work with hardware besides NVIDIA GPUs? Yeah, look, the answer is yes. And we've done experiments and everything. We've served, you know, we've worked with like a lot of the chips coming out of the cloud, we've worked with, we've done some work with AMD. We've even done some work with some of the new, like the new set of chips. I'd say like today, like what we keep coming back to, like we love the ball. And again, we want a thriving ecosystem there is that Kuda is just, you know, like the reliability and the versatility of and the developer ecosystem of cuda it's just very very hard it's very very hard to like kind of um push like to overlook and so when it comes to like imagine you are pushing something out into production tomorrow um the last thing you want to be doing is working with anything else besides cuda that being said um some of like the performance and the cost trade-offs that we have seen with some of the other providers is really, really cool and very promising. I think it would just take a minute for them to catch up here in terms of the most utility in production. And I mean, why is that? Because I guess from where I sit, I mean, I've had in this podcast probably five different companies pitching me on better inference. I have probably 15 other pitches that I've heard in my life. I go out to dinner and I get like pitched on like, hey, I got like a faster inference. And it seems like a very like testable thing and the market's there. Like I look at that and I'm like, okay, there's no market risk if you have like faster inference as far as I can tell or faster and cheaper. So like what do you actually like practically run into? Yeah, yeah. Yeah, it's a really good question. I was thinking about that just yesterday actually. And so like, I think it actually comes down to three or four things. So one is like, look, like it's a, you need the chips so you need um then you need the manufacturing capability to scale up um and then you need the software layer for versatility so and and i think what once you and so just like the chips like you need to be able to know what an inference chip looks like um the second one the manufacturing capabilities like we need you know tens of thousands if not hundreds of thousands of these um how are you going to do that scale um and three um how do build like the api so that people can mess around with these things themselves um i think all three of those things contribute to this being a very challenging problem for all the folks that you'll probably have on on the podcast but i think the biggest thing which is um i think there's a challenge and we'll get over it as the market matures is just that everything is moving so fast it's like you know the tip turnaround times are what like six to nine months if you're lucky um if you're lucky and so um with that in mind like the market we are in like i can't i can't project more than 60 days forward in this business uh who knows what bottle will come out or what people want to use and yeah and so i think that that speed of the market probably just make things harder and you kind of see the video right now as well which is just that the nvidia is like three or four different chips that are like people are talking about right now so you have the hopper series and you have the b the b200 and b300 and you have the gb series uh i think customers like things are moving so fast like customers are also a bit stuck in terms like what what do we want i'm sure you've seen all this stuff in call even like when you talk to customers like people are somewhat um somewhat is moving so fast that the customers confused about what they want as well so like you know projecting forward nine months is very hard but i agree i think in the in the fullness of time as think as soon as things um to slow down just a bit um we'll get there i hope i feel like in the fullness of time is your is your catchphrase i like yeah i I like it. It's funny. I was talking to a CoreWeave customer recently, and they were talking about how they don't like to change chips. And another thing they were telling me was kind of interesting was that they were able to get, like, over time, a lot, lot more performance out of any particular chips. I was kind of surprised. They seemed like they really did not want to be on the latest chips. They didn't want to, like, move their workloads to new chips. But I mean, it seems like if somebody is working with you, shouldn't you abstract away the hardware that they're running on? Totally. I think there's like something we will move towards over time, which is just like, you know, big, small, extra small, like T-shirt sizing for compute. um i i um i i think i think that's okay i again like everything is just moved so fast that like you can't just take a workload that's running on h100 and apply it to a b2 i'm sticking on the b200 be like oh look now i have three times the speed up unfortunately unfortunately like it just takes um a bit of like finagling you know the flash attention versions need to change to support, you know, FP4 or whatnot or, you know, the new chat from hardware. And so I think that, we will abstract that away. We will abstract that away. I think, again, everything just changes so fast underneath us. It's kind of hard to do that today. And I think customers, what they want changes from a week-to-week perspective as well. Do most customers show up being like, hey, I really want a GB200? A lot of people do. Interesting. I'm like, engineers just want, you know, engineers want the latest and greatest hardware all the time. But I think like actually like we actually see probably two sets of customers. We have customers that show up and like I have like give me the best hardware I could possibly buy. Give me the Ferrari. I want the Ferrari. I see the majority of our customers, especially the ones like, you know, we want to sell to companies like either in the enterprise or sell into the enterprise. they're a little bit more pragmatic around like, hey, I need to solve a problem for my customer. And this is like the, this is my workload. This is the size of my model. This is the SLA I need to meet. Could you help me get there? Which I think is probably the better way to think about the problem. And it fits back into the thing you were saying about, there is some walk into old hardware especially if you are meeting the SLA that you care about Well I really amazed to hear that in your case because I feel like you know in my limited experience with this stuff there's nothing fun about being on the latest chip with like the least, you know, support and like the hardest to get working. So I'm kind of surprised people want that. They think there's going to be a better performance trade-off or is it sort of an emotional connection? Look, we are like, I was looking at a customer that we work with, you know that's doing like hundreds of millions of tokens a minute which is just a lot of throughput i think i think when it comes to those really really high volume use cases those performance trade-offs are huge even even like with a lot of video stuff like for video models b200 provide a pretty like a 40 50 speed off just off the bat if you can get it running um and and so like it's those trade-offs that for their end customers um they can drive a fundamentally better um user experience um with better hardware i see yeah and again the game just comes back to a lot of that um anthropic um open ai google stuff where it's like they are setting from like a speed perspective they're kind of staying the bar for what customer and open source models need to be be able to hit. And they're all pretty good at that piece, at least the speed part. Let's take a segue into my next question, which I'm sure people ask you all the time and people ask me, who's using open source and who's using close source in your experience and how do people think about that trade-off? I'm sure if they're coming to you, they've kind of decided on open source. But I mean, some people ask me, why would anyone ever not use Anthropik or a GPU? So maybe we'll start there. So, yeah, look, I think everyone's on this curve of maturity. And I think a lot of people start from customers or open source models where they're doing their own things and they want to solve it. They're training their own re-ranking models. They're training their own synthesis models that are purpose-built for what they're doing. And so that's one place to start. I think a lot of places other people start is that you go with Anthropi, you go with OpenAI, and you have these great models that can do a lot. but they're either too expensive and they're too expensive. The speed's fine, but, you know, OpenA and Anthropic have a lot of reliability issues themselves because, again, they're serving the massive, massive workload at an enormous scale. And I think they come to it because they want more control over costs. They want more control over, you know, even their own reliability. Like, give me, like, dedicated infrastructure at a decent price with some transparency so I know what's happening. Or the third piece, which is just like, hey, our customers care that we aren't just piping all this data off to someone who's trained models on this. And that matters to us a lot. And that's kind of like all three of those reasons are why, four reasons, like something custom, something cheaper, something more reliable, data privacy SLAs for the enterprise is the reason why people shit from closed to open. And I think the reality of the situation is that everyone's going to use a bit of bars. Like, I think, yeah. Do you have a prediction in the fullness of time where this ends up? I don't know, like 40, 60, don't know which way. Like, I think it's all, I think it's going to like, look i think there's two worlds that we live in it's either um there's either um ag icons ag icons all we have left to do is go on podcasts because we you know everything else i mean those agis would make a better podcast of me than me coming out of here coming out here anyway so like maybe i don't even get to do that but you know i think that's a pretty low probability um you know I actually think that models are already very, very good and the next unlock is probably further away than we think. Why do you think that's an interesting point of view? I think we're out of data. I don't know where the data is going to come from. I think we've kind of used, we kind of went through this period of the last five years as we kind of ingested more and more data. and so I think there needs to be now an architectural unlock an architectural unlock and then we're really going to like a research problem again and I that's my I think not only that is that and that's fine because I still think we have you know a thousand x of economic value to unlock what we've already unlocked so like I think we could stop today and like still have like more than the industrial revolution of gaining productivity like which is which is fine but but i think the going back to your original question the more reasonable outcome is just that um yeah this is going to be a long tail of models um and there's going to be and people want that um and you know you don't need the the most powerful model in the world to do everything for you um all the time are you seeing um reinforcement learning start to to change your business? Yeah, like, I think there's a lot of demand there. I think there's, like, again, everything just comes into, like, a data collection problem now, and, like, I think RL, in a lot of ways, is going to turn into, like, an inferency problem. Even, and I tell you, like, we're not there today from, like, we're not there today in terms of, like, our first-class support for those things, but all those things are possible using base 10 as a primitive and so like i think there's a lot of demand there um especially um with folks who are able to connect and like this is what's really interesting working with so many customers at the application layer is that they are most of them are collecting real preference data like uh real valuable preference data which they want to use to make their models better. All right. So I sent over my questions and unlike most guests, you actually engaged with my pre-read and you added some questions yourself that I was kind of intrigued by, or I don't know if this is your team, but you added a question, why inference is the fastest growing part of the AI market, which seems like incredibly obvious that it would be, but I'm like wondering if there's something deeper there that you wanted to talk about? I think my team probably added that. Good job, team. It's a little softball for you. I think actually that's the wrong question. I think it's the wrong question. I think the better question really is why it's the most important market. In a world where we're shifting towards models doing everything and all the value being in the application layer, there just needs to be a hell of a lot more inference. and we're very lucky that we got locked into this market or this market showed up. But it's interesting as a thought experiment it might be the final it might be the last market. If there is AGI we're just going to need a whole even more inference and the only market that will exist to some extent is what does the model need to do and the model can only run on inference. I think that is why it's the biggest market. like why is the fastest running market i think is pretty obvious which is that you know we're just rushing as a society to like inject models into everything we do um and i think that's like you know a good like every day i i now use you know a dozen products that are powered by base 10 in some shape or form and i think like that's somewhat of an indication of you know like how big inference is getting yeah that's incredibly cool I was thinking as you were saying that tool use also might, I mean, like something, do you also like run the tools for your customers? Yeah, and I think the other one is like sample, like the actual code execution, like all these problems are kind of, the way I would say it is like, I was saying this to someone on our team today, which is like most I think most startups have this problem where they go into a market and they saturate it and then they start thinking what else can we build alongside of it um and for like for me honestly it feels like we're just swimming into the ocean and it just keeps getting deeper and deeper and deeper and deeper which is really cool but those are somewhat frightening but like tool use like sandboxes um all these things are like rl rl like all these things are just infamous problems like um once once they're all built out and i think that's um why i think this market just continues to get bigger and compound over 10. I do want to talk about, you know, you had this quote about you got to burn the boats, which I thought was evocative. And I think a lot of people, I don't know, talk about that, but it's actually like really hard to do in practice. Like, can you talk about burning the boats and how that works and how you get people bought in on that? Yeah, we're just really odd emotional people now. i'm joking yep the i i i think like um i i'd say the we're just so future looking honestly which is like we just fall at book all the time like everything we've built until now is the past and like what i owe what i owe the people who work for us our customers and um even like our our investors is chasing you know what we think is the largest opportunity um okay great great But what's the most beautiful boat that you've burned down? In 2022, we killed three products out of four in one year. And so we basically spent three years, going back to what you said, Bayside initially had this retool type application builder that went alongside the serving engine. It was kind of bizarre that we built all this thing. and we you know there was like two dozen odd people who spent two and a half years building this and i think we just we just killed it like honestly like we we within six weeks we were like let's off-board customers let's get them new places to do this work um we're going all on this another one was like we launched uh in 2022 we also launched the fine-tuning product called blueprint um and it was early for fine tuning back then um and we didn't like we seven people on the company or six people the company of like that's a third of the company at the time spent six months building the thing uh we launched it it didn't go any of them it didn't really go anywhere we realized we had the abstraction wrong and three months later we're like all right we're not working on that anymore um and i and i say we we constantly do this as much as possible which is hey, rewrites are part of the job, throw away stuff is okay. And we're actually pretty okay, even with let's sprint out this, it's okay if we have to throw it away. So I think that's the trade-off there is that what it allows you to do is to be a little distracted and go on side quests and be okay with that side quest being not super valuable. and as long as you can re-center and go back to the thing you've tried to do. Oh, that's kind of interesting because earlier when we were talking, you talked about sort of like don't give up and like keep going. And I can see how this two are related, but like it is a tricky question to know when to stop. I said don't give up, but don't be emotionally attached is probably the two things I'd say, which is like just be forward-looking to some extent. And I think that's where I've always got stuck when I've been building things. It's like, you know, the worst thing that could happen is you get stuck at a local maxima. A local maxima. And I think I'm acknowledging that to some extent. I think the second piece, sorry, just to go alongside this, is that it's so interesting. I don't know how you think about this because you've been around venture for like how long? Like 10 years, 15 years? Around venture-funded companies a while? Very much time, yeah. I feel like as a venture-backed founder, I personally think you're really committing to a type of company you're trying to build. And I meet founders often who are like, oh, it's not really working, so we're going to start to become cash flow positive. And I'm like, oh, I didn't think I'm in the cash flow positive business for a while here. And I signed up for something that's going to be very, very big and just swinging big. I think that is something that I think everyone on our team war. We try not to take the conservative take as much as possible. It's also interesting that you say that because I think earlier, at the very beginning of this conversation, you were talking about how, how did you put it? You were sort of like, I like to think of it more as this project that I'm doing for this purpose, rather than optimizing for growth or something. But I think probably your venture investors are thinking, hey we're gonna optimize for growth there don't you think yeah i i i think so but i don't think that detached as you think because like i i think with the um with respect to the former thing which is like hey i need to be happy to win like you know i'm not thinking about winning but i need to be happy if i'm gonna get there um and i think also like the thing that drives um happiness for me is working on big things. And so it's not necessarily around like, how do I squeeze like the most amount of sales efficiency out of the machine that we have? And, you know, we need a new leader there because, you know, we misquote up a template. It's more just like, hey, are we chasing the biggest thing? Are we chasing important problems to some extent? You're going to make sense. Yeah. All right. I think that's a great place to stop. That was an awesome interview. I really appreciate it. Yeah, no worries. Thanks for having me. thanks so much for listening to this episode of gradient descent please stay tuned for future episodes

Share on X Share on LinkedIn