Infrastructure Scaling and Compound AI Systems with Jared Quincy Davis - #740

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) • Sam Charrington

Tuesday, July 22, 20251h 13m

Spotify Apple

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

0:001:13:02

What You'll Learn

✓Laconic decoding - using multiple parallel instances of a model and early stopping - can improve speed, accuracy, and cost of AI systems
✓Composing 'networks of networks' can push the performance frontier of AI systems, especially on highly verifiable tasks like code generation
✓The cost of AI inference has been falling rapidly, enabling new approaches to scaling and composing AI models
✓Efficiency and convergence of these 'compound AI' architectures are key research challenges
✓Foundry.ML is building cloud infrastructure optimized for these types of advanced, composable AI systems

Episode Chapters

Introduction

The host introduces the guest, Jared Quincy Davis, and the topic of 'compound AI systems'

Laconic Decoding

Davis explains the 'laconic decoding' technique of using multiple parallel model instances and early stopping to improve performance

Networks of Networks

The discussion expands to the broader concept of 'networks of networks' and how composable AI architectures can push the performance frontier

AI Infrastructure Scaling

The conversation covers the rapid decreases in AI inference costs and the importance of efficient, scalable AI infrastructure

Foundry.ML's Approach

Davis explains how Foundry.ML is building cloud infrastructure optimized for advanced, composable AI systems

AI Summary

This episode discusses the concept of 'compound AI systems' and how they can be used to improve the efficiency and performance of AI models. The host interviews Jared Quincy Davis, the founder and CEO of Foundry.ML, who explains how techniques like 'laconic decoding' - using multiple parallel instances of a model and early stopping - can push the performance frontier of AI systems in terms of speed, accuracy, and cost. The conversation also covers the broader idea of 'networks of networks' and how composable AI architectures can be leveraged to further enhance AI capabilities.

Key Points

1Laconic decoding - using multiple parallel instances of a model and early stopping - can improve speed, accuracy, and cost of AI systems
2Composing 'networks of networks' can push the performance frontier of AI systems, especially on highly verifiable tasks like code generation
3The cost of AI inference has been falling rapidly, enabling new approaches to scaling and composing AI models
4Efficiency and convergence of these 'compound AI' architectures are key research challenges
5Foundry.ML is building cloud infrastructure optimized for these types of advanced, composable AI systems

Topics Discussed

#Compound AI systems#Model parallelism and early stopping#Networks of networks#AI infrastructure scaling#AI cost and efficiency

Frequently Asked Questions

What is "Infrastructure Scaling and Compound AI Systems with Jared Quincy Davis - #740" about?

What topics are discussed in this episode?

This episode covers the following topics: Compound AI systems, Model parallelism and early stopping, Networks of networks, AI infrastructure scaling, AI cost and efficiency.

What is key insight #1 from this episode?

Laconic decoding - using multiple parallel instances of a model and early stopping - can improve speed, accuracy, and cost of AI systems

What is key insight #2 from this episode?

Composing 'networks of networks' can push the performance frontier of AI systems, especially on highly verifiable tasks like code generation

What is key insight #3 from this episode?

The cost of AI inference has been falling rapidly, enabling new approaches to scaling and composing AI models

What is key insight #4 from this episode?

Efficiency and convergence of these 'compound AI' architectures are key research challenges

Who should listen to this episode?

This episode is recommended for anyone interested in Compound AI systems, Model parallelism and early stopping, Networks of networks, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode, Jared Quincy Davis, founder and CEO at Foundry, introduces the concept of "compound AI systems," which allows users to create powerful, efficient applications by composing multiple, often diverse, AI models and services. We discuss how these "networks of networks" can push the Pareto frontier, delivering results that are simultaneously faster, more accurate, and even cheaper than single-model approaches. Using examples like "laconic decoding," Jared explains the practical techniques for building these systems and the underlying principles of inference-time scaling. The conversation also delves into the critical role of co-design, where the evolution of AI algorithms and the underlying cloud infrastructure are deeply intertwined, shaping the future of agentic AI and the compute landscape. The complete show notes for this episode can be found at https://twimlai.com/go/740.

Full Transcript

In this case, they kind of noticed a quirk of reasoning models, like DeepSeq, like DeepSeq R1, which is that all other things being equal, the longer that they think for a given problem, the more likely it is that they're going to get the answer wrong, which is kind of counterintuitive in some ways. But it's intuitive another way, which is think about like a student on an exam. You know, if it's the same set of exam questions and one student's taking three hours on the first question and one student wrapped up in 20 minutes, probably the person who wrapped up in 20 minutes is going to, you know, have a better score on the exam. All right, everyone, welcome to another episode of the TwiML AI podcast. I am your host, Sam Charrington. Today, I'm joined by Jared Quincy Davis. Jared is founder and CEO of Boundary. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Jared, welcome to the podcast. Yeah, thanks for having me, Sam. Really great to be here. I'm looking forward to getting into our conversation. We're going to be talking about compound AI architectures, a theme you've been working on for a while, and how they may transform agentic AI. To get us going, I'd love to have you share a little bit about your background. Yeah, yeah, really awesome. Thanks for having me on the podcast, Sam. Yeah, so my background is I'm the CEO of a company called Foundry. ML Foundry. Foundry is basically essentially a new cloud built from scratch for ML workloads. And I think we've been trying to, you know, increase the amount of AI progress in the world and trying to make it slightly easier for a broader spectrum of companies to do the types of things that currently only OpenAI and DeepMind can do. And we've been approaching that primarily from an info perspective. But then also, you know, we're pretty big believers in AI systems co-design. And we've been thinking a lot about what type of infrastructure we'll need in the future for what we think kind of the future of models and architectures will be. And so we've been talking about this theme of compound AI systems, trying to push this and promulgate this in the world. I'm trying to help people understand why we think it's an important direction to kind of push the broader frontier. And so I'm excited to talk through all of this. Prior to starting Foundry and doing this work, you know, my background is as an AI researcher primarily. I was a deep mind on the core deep learning team, you know, kind of thinking a lot about how to scale deep learning approaches, how to make sure they converge at scale, both the systems and algorithmic sides of that. I mean, also did my PhD work from the Stanford with Matei Zaharia and Yuri Leskovich, kind of also at the intersection of systems and kind of ML theory and ML. And so these are threads that we've been pulling on for quite a while, myself and a lot of members of our team. So it's exciting to, you know, I think it's a pretty fun time to be alive, exciting to see all kind of the progress. There's definitely a lot going on. Yeah, a lot going on. Things have never been more interesting. So it's a great time to be alive. Awesome. Awesome. So let's start from the beginning. Compound AI systems. When you say that, what exactly do you mean? Yeah. Yeah. Great question. Yeah, I think that maybe it's best to try to illustrate this through an example, to talk about why this is an interesting thing. And one of the examples I find most intuitive is actually a great method that one of my collaborators, Alex Demacus, came up with. And he calls it laconic decoding. It's kind of really nice. So in this case, they kind of noticed a quirk of reasoning models, like DeepSeq, like DeepSeq R1, which is that all other things being equal, the longer that they think for a given problem, the more likely it is that they're going to get the answer wrong, which is kind of counterintuitive. That's a bit counterintuitive. Yeah, but it's intuitive another way, which is think about like a student on an exam. You know, if it's the same set of exam questions and one student's taking three hours on the first question and one student wrapped up in 20 minutes, probably the person who wrapped up in 20 minutes is going to, you know, have a better score on the exam. There's something kind of like that with DeepSeek where maybe it goes down the wrong branch and like a beam search and has to backtrack and, you know, et cetera. And so one very simple method that I came up with to leverage this fact in this observation was to basically create 10 replicas of the reasoning model, you know, 5-10 replicas, and then just return the response from the first replica to complete thinking. Super simple, you know, kind of intuitive. And so the amazing thing about this is it actually can push the entire Pareto frontier. So it can be faster because you're making 10 parallel calls, 10 replicas, and returning the fastest one. So it's faster on average than just a single call, as long as you can scale out. It's also higher accuracy, given the point that we just mentioned, on average. But then also, very surprisingly, it can actually be potentially even cheaper on average, which is like, wait, how would that be if you're spinning up 10 replicas? How could it be cheaper? Well, as we know, output tokens are typically more expensive than input tokens for reasons we can, you know, double click on. And also, depending on how long these thinking traces can be when the model gets stuck or on average, the, you know, in distribution of kind of tokens, if you basically do early stopping over these 10 replicas, you actually might produce fewer total output tokens, you know, and actually end up saving cost. And so you just push the entire proto frontier with a super simple method. You know, you made 10 calls and had a policy of early stopping on top and kind of the super simple selection function. And that can push the entire Pareto frontier. You didn't have to train a new model. You didn't spend a billion dollars. It didn't take three months. That's kind of inspiring. And to what degree are we talking about in terms of reduced cost, efficiency and greater accuracy? So this is a really important point. So on the cost point, it's relatively simple to understand. We've seen gains of over 1000x, which sounds kind of absurd. But you can have the same performance at a much, much lower cost. And to explain that, it's actually pretty intuitive. The dispersion between the most expensive and the cheapest models is quite large. And so, you know, if you can combine calls to the cheapest model, you know, and actually get the performance of, you know, the kind of frontier system, you know, that actually kind of makes some intuitive sense. It's kind of a routing. It's kind of an extension of the concept of model routing. But you've introduced a new element there, which is changing to a simpler model as opposed to, you know, equal models, compare equals models. So, but if you're comparing, if we're comparing equal models before we get to the opportunity to use simpler models, what kind of impact are you seeing with that method? Oh yeah, we've seen really dramatic impact. So maybe just as one headline result, you know, with this work in this paper called Networks of Networks, and we looked at the extent to which you can just combine, you know, compose architectures out of calls to a frontier model and push the frontier, push the reliability or quality frontier. And we found that on tasks that were highly verifiable, you could push the frontier quite dramatically. You know, we saw kind of 9% plus gains on these really, really hard, sticky benchmarks, you know, where technically changes across generations of models were only like 1%. So pretty dramatic, pretty dramatic leaps, you know. And we also derived this kind of pretty simple equation, this pretty simple formulation that allows you to understand to what extent could you push that in the limit? And we kind of proved that and showed, if you're willing to spend a lot of capital, make many parallel calls under some pretty, then you can actually push the frontier almost arbitrarily far on tasks that are highly verifiable. And that's something that we're seeing, for example, in the RL reasoning regime. But this is kind of another angle, another dimension to that. And by highly verifiable, would an example of that be like code generation? Code generation, math, these things where the models taste, is better than its execution or where it's easier to check an answer than it is to generate an answer. Kind of like proofs being the canonical example of that. It is also counterintuitive that you can push this idea arbitrarily far. I would think that there would be like diminishing returns, right? Yeah, there definitely are. Definitely plateaus at some point depending on the fidelity of your verifier, for example, right? Depending on the fidelity of your verifier, also depending on, And, you know, it's kind of like, also depending on the kind of distribution or variegation of responses at your generators, you know, can, you know, so if your generators, if there's mode collapse, if you're using the same model, or if there's not a great degree for diversity in the outputs of your generator models, then this can only go so far. But it's kind of like that, you know, that thought experiment with the monkey at the typewriter writing a book. You know, if you can kind of, you know, you can brute force anything, you know, arbitrarily far, right? If you're, you know, if you have a good enough selector, you can brute force it. But yeah, that's obviously not practical. And so that's where some of the interesting things come in, which is from a basic research perspective, what types of methods are more efficient, you know, in terms of the cost or the time it takes to, you know, converge or to reach those theoretical limitations, those theoretical, you know, I think I'd say limits of what you can do. Like, how can you reach that faster? How can you kind of rein that in? You know, those types of things start to get really, really interesting as questions like the efficiency of these inference time scaling architecture schemes. We started off talking about, you know, taking a single model, you know, parallelizing it, applying early stopping, that kind of thing. I think the implication is like, that's like a very rudimentary starting place for thinking about composability and networks of networks. Like, talk about how you evolve this idea and what that lets you do. Yeah, yeah. So just go back to that example. You know, that example is obviously it's a heuristic. It's kind of a relatively simple method. It may be brittle, you know. You know, but I think it starts to give some intuition for, oh, well, you know, there is this space, this kind of meta space, you know, of networks of networks. And you can think about what types of strategies are more or less efficient. How can you push the frontier? here, you know, it starts to, I think, give people a bit of a picture of what you can do. But to your point, you know, there's this much broader space. I think that the, maybe I'll step back a little bit and try to, you know, grant, give people some of the intuition for why we thought infrastructure scaling research would be fascinating in the first place. And to understand, like, what are some of the, you know, what are some of the elements that make this potentially rich? So let me maybe step back and first start with a couple of points of context. The first is that the cost of inference has been falling precipitously for several years. And it's clear that will continue. The cost has been going down roughly 10x per year for the last three years, 1,000x, to achieve some kind of baseline level of performance. Kind of like, say, GPT-4 level of performance on something like MAMLU. You know, it's really, really dramatic. And so now it's starting to, I think, if you kind of extrapolate that trend, which I actually think this is one of the trends that's easier to extrapolate in AI. Like, I think we extrapolate other trends that are sometimes a little bit more parallel to extrapolate. Yeah, I think this one's actually slightly easier to extrapolate for a whole host of reasons. But if you extrapolate that trend, then you can start to potentially in the distance envisage us creating architectures where when we say 8B, it's no longer the number of parameters, it's the number of inference calls in some kind of inference system. And that starts to get a little bit interesting. And the question is, what types of principles, by what types of principles should we think about that architecture space? Right, that's a, I think, starts to be a really interesting, starts to elicit an interesting set of questions. since you were asked the questions. And so that's part one. I think part two is that the ecosystem, in addition to this cost piece, the ecosystem is a lot broader than it used to be. You know, in a couple of years ago, there's really only GPT 3.5 and GPT 4 as options. Now, even just within OpenAI, there's this whole model selector, you know, issue that people joke about. There's so many different models, you know. But then also there's other providers. You know, Gemini has multiple generations of models and then in each generation releases a family of models now going from Pro to Flashlight. And they actually, you know, Claude, I think, somewhat started this with Sonnet, Haiku, and Opus in the Claude 3 generation, right? You know, kind of invoking implicitly or explicitly this notion of a Pareto frontier. Gemini is now very explicit about that notion of Pareto frontier and actually says that they define the frontier. That, you know, whether you want high quality and reliability or whether you want low latency and low cost, a Gemini model will give you the best, right? And so I think we've seen that. We also have obviously open source model providers, you know, Meta with Llama, DeepSeq, it's Grok and XAI. It's a lot more diverse than it was. And I think arguably in some ways more diverse than people expected. At least in some of the people around me expected. And so. I mean, especially considering the early, early observation and conversation around how expensive it was to train these models, there was a period of time where we thought that there would be, you know, only, you know, three model providers in the world because it was just such a Herculean effort to train these. And now we've got this Cambrian explosion of models for, you know, every need. Yeah, that's right. Yeah. I think that's, you know, been really interesting to watch. And I think as a result, across both providers, but then also across models within a single provider family, we've seen this massive dispersion emerge where the difference in cost between the most expensive model or system, should I say, and the cheapest system is quite wide. There's a massive gulf. And so O1 Pro was, you know, maybe they've changed it, but O1 Pro was $150 per million tokens, whereas DeepSeek R1, you know, was $0.03 per million tokens. That's quite a dispersion, right? You can make a lot of DeepSeq R1 calls for one O1 Procall, right? And so it kind of starts to get your mind wondering, you know, what could we do? Obviously, routing is a basic idea. You'd say, okay, if it's an easy question, then cheap models are already saturating that level of intelligence. So I'll just send it there. And that's something that one of my collaborators, Ling Zhao, did early work on, stretching back to, you know, 2023, 2022. That was really fruitful. and is a bit of a richer problem than one might think on first blush. But then also you can start to say, beyond just routing among these monolithic models, to what extent can I combine models? You know, can I combine calls to R1 Pro? Or can I even do kind of funky things where maybe I have some, so O and Pro, where maybe I can do rich things where I send something to R1 initially, get it to like, you know, somehow reframe the question So I'll use fewer tokens from O1 Pro, something like that. And so you can start to envision doing all kinds of funky things. And basic research comes back into play. Things get exciting again. It's not just the most blasé form of scaling things up. Things start to get rich from research perspective, academics can contribute. It becomes a really dynamic ecosystem again. Yeah, and you can even imagine, like, I'm forgetting the term for it, But you can imagine like a, you know, generative networks of LLM networks, right? Where, you know, these are created on the fly or they're created, if not on the fly, you know, kind of in a batch mode and then explored for their efficiency. Yeah, exactly. Yeah, exactly. And yeah, you can feel better. So, you know, there's this, there's, these are some of the kind of first principles reasons why. It's interesting. At the same time, as all this dispersion is emerging in terms of cost, we're also finding, hey, things like distillation work extremely well. We've known this for quite a while. I think people are starting to appreciate better and better just how much you can, if you have a specific task that you care about in mind, just how much you can reduce the cost and preserve quality and reliability. Just how dramatically you can push that, how extreme it can be. So we're seeing that. And there's also one more point, which I'll invoke, which is a bit more up in the air still. But I think it also seems to me like even at the frontier, the things are very jagged, number one. And number two, the affinities of respective providers and models are starting to become clear that these different systems have their own affinities. You know, Claude seems to be going more in the sense that, you know, Claude seems to be going more and more after, you know, agentic coding and practical coding, you know, like the kind of their choices being made in data set mix and the objective function construction, you know, in the way that the evals are set up that seem to be saying this provider is going to focus more on this set of tasks, you know, get better at it, you know. certainly on a model by model case, if not by provider by provider. Like, you know, they're coding models, writing models. There are companies out there trying to do domain-specific models of, you know, variety of use, for a variety of use cases. You see it in languages as well. You know, the QN models from Alibaba are better at, you know, Chinese, idiomatic Chinese, things like that. You're kind of seeing, you know, these, you know, the fact that these models have different affinities emerging. And right now it might seem like the differences are relatively narrow between models and they're all kind of at the frontier and the jagged peaks aren't that deep or relatively shallow. They're not that high, rather. But I think when you project into a more agentic world, compound systems compound errors. And so these small differences accumulate. If you have many, many steps or long horizon tasks, these small differences are becoming big differences. It's like that, you know, that's kind of funny example of 1% difference per day over a year, you know, that type of thing. You know, you start to see, you know, big, big, big, big goals emerge. Although that, you know, to some degree that implies like a uniformity of performance per model per use case. And I'm not sure that we necessarily see that. Claude may be on the broad average perceived as better for code but for a lot of specific things that a given user is trying to do you know maybe Gemini is better maybe O1 is better And so I would think that a challenge that one might want to address is finding ways to use these compositional networks of LLMs to get better average performance on all tasks as opposed to like do really well on some benchmark task or something like that that's exactly right and i think the fact that to your point even within some broader kind of regime or some broader task family like practical coding which is hard to define you know there might be one model may be better on average but then there might be big gulfs if you further you know stratify that i think that's a and that's important that's actually why this is such a rich thing. If it was just, this model is better at coding, this model is better, then that'd be a relatively simple graph that you'd built. You know, a relatively simple rattle graph. If it's at the level of a single individual question, if users' own preferences come into play, things start to get a bit richer. I think that one kind of piece of work that we did that kind of, I think, illustrates this really well is this work that my colleague's name, Ling Zhao, led called LM Selector, which basically was asking the question of, if you have a multi-step pipeline task, you know, some tasks with multiple steps, let's say you have a bunch of different models you can choose from, will you do better by, you know, mixing these models to do different parts of that pipeline? Or by taking the best model, you know, the model that did the best, the best model that did the best on all steps of the pipeline. Meaning if I only have, let's say GPT-4, you know, or let's say O3 now for a given task, versus if I can, you know, mix O3 and Gemini, et cetera. Will the only O4, only Gemini 2.5, only Cloud 4 system do better? Or will the kind of hybrid system do better than the best system along those? And it turns out that the hybrid system does better. What's an example of the type of multi-step tasks that we're talking about? Yeah. So there's a whole set that we chose partly to, you know, to show this a general thing, not task specific. But, you know, a lot of the most exciting tasks that I think everyone's still trying to work on are like the coding, kind of agentic coding tasks, like the, you know, those types of benchmarks like Sweebench and things like that, where, you know, these are, I think everyone's really excited about these types of tasks at the moment. But there's a few others in kind of math and chat. There's a few others, like kind of somewhat artificially constructed ones as well to illustrate this. But I think that the kind of like sleep bench, you know, types of tasks are the ones that are extremely practical, extremely useful and really exciting. Yeah. So that raises another issue, which is it's not even the task that we're specifically talking about. It's the approach to the task. So you talked about a multi-step. What I interpreted you saying was using different models for each step of a multi-step approach to solving a task. Like SweetBenz, you can solve those one shot or you can like break them up into more of an agentic flow and solve them at a variety of different granularities. And so I was trying to get a sense for like what kind of granularity are we talking about here? You know, I think one of the points that I'd highlight is that when you start to think about these compound systems, there's multiple ways you can optimize them. And that's why I think it's starting to spawn a lot of new research. It's starting to kind of engage the academic community again, you know, where they felt left out previously. You know, you can optimize prompts over fixed models in a fixed structure, kind of DSPy style. You know, our collaborators, you know, Omar and Matei, you know, did some great work on write, which is very popular. You can optimize the weights of a verifier or router. you can optimize model selection, you can optimize the architecture of the non, right? We've been talking about, you know, that iconic coding example where it's kind of an ensemble in some sense, right? You can optimize that, the width of it, you know, the structure of it, et cetera, the type of function that is your aggregation function or your selection function, et cetera. And when you said non, you're saying network of network. Network of network, yeah, exactly. I'm just now, I'm trying to make this a thing, I suppose. I was like, you know, so I'm just assuming, acting like it's a term everyone knows. Oh, exactly, network of network. Right? You can also, you know, change the hyperimeters of the model calls. You can say, I'm going to change the temperature. Like maybe it's the same ensemble of GPT-4, but I'm changing, you know, the generator temperature to elicit greater diversity among their responses that I can then use, right, to further explore the space. So you can also do distillation if there's one part of the system that is linked to a lot of cost. You can fine-tune or distill, right? So we'll also show there's many dimensions they all have their own cost infrastructure time etc implications and it's rich to think about how would i search this what's the most effective way to navigate this space optimizing prompts can get pretty expensive pretty quickly even for shallow pipelines you know if you try to do that over some large network of calls that might be really difficult depending on your approach and how you're searching right so this starts to get really really cool this starts to be a really neat area. And, you know, there's a lot of room to gain given that dispersion we talked about, given that the cost between the cheapest models and the largest models is so vast, given that these models can give you an answer that is one output token, if you frame the question correctly, or it can give you, you know, a million tokens of output, you know, if you frame it wrong. All these things contribute to saying, okay, there's a rich space here. I think people can, I think it's starting to come into view. People can see it now a little bit. And, you know, basic research is, has a place, again in a new way and you can kind of do experiments at small scale develop an intuition um because some of these approaches they just are a small scale like your best event graphs or your ensemble graphs don't have to be massive to push the frontier or push the frontier so it's yeah so that's kind of this this domain that's i have a pretty biased way of looking i think the term compounded systems is a little bit nebulous still you know still and there's a few ideas that are, you know, kind of being jumbled into this banner. I think that another idea that comes up a lot is kind of tool use, which we haven't talked about. We've been primarily talking about this kind of infratime scaling architecture approach to it. But there is also tool use, and we can double click on that a little bit as well. But yeah, I think that gives you some intuition for this family of questions that we have been really inspired by. Before we dive into tool use, in the LLM Selector paper, you mentioned that they identified that this hybrid system can outperform a monolithic system. Did they also provide any guidance for folks that, you know, wanted to go down this path, like how to optimize the use of different models in a given system? Yeah, yeah, Elm Selector actually is now a framework that has, you know, some algorithms in there that can kind of automatically help optimize this. That's right, they can search the space. You know, it's still relatively in-quote research, and so I think there's a lot of work to do to push the efficiency of this to, you know, et cetera. But yeah, in that there's actually a code base associated with the paper that people can find that can start using. And a number of people have been doing research extending it. Yeah, so I think this is going to be a really rich area, and it's already becoming rich. Yeah. Maybe just to make another point here, I'll just note, and this is a broader branch, maybe we should go down, maybe a point around, you know, co-design, kind of systems and algorithm co-design to put, to give a little bit of intuition for why it starts to get interesting and why, from my perspective, you know, at the cloud and scheduling level, this is really fascinating. imagine two very different systems you know one system is you know kind of a vertical inference time scaling you know like you're basically producing longer and longer chains of thought to improve reasoning capabilities or to you know kind of push performance and use more compute at inference time invoke more compute inference time to push quality i mean that requires more and more hbm you know probably what you want in that case is a kind of scaled up system you want high bandwidth memory exactly you want more and more gpu memory so you can keep all that in context you want a bigger system that's connected via nvlink maybe you want an nvl 72 like nvidia's newest system kind of the big box right there's another case which is kind of more that alpha code 2 kind of more classical thing which was maybe one of the other ways that you can invoke more computer infinite time is horizontal infinite time scaling maybe i want to instead say i'm going to spin up a million replicas. This is literally what they did. A million replicas. And I'm going to ask each coding question, each replica. Right? And I'm going to have a million candidate responses. I'm then going to have some protocol for filtering these million candidate responses down to the best one. Or to the best, in the case of AlphaCode2, best 10 that I can then try one by one and see if the test passed. But you have some kind of best event construction. These are very different. You know, in the best event case, I might want a million GPUs for 400 milliseconds to do these inference calls. I don't care if they're interconnected. They can be part of different infidiband domains. I don't really care. In the vertically scale case, I want a highly interconnected system for 10 contiguous minutes. You know, and everything should be part of the same domain. It's one contiguous block of compute. These are very different cases. they also might have different implications depending on the task for the performance i can expect but these are very different cases from a systems perspective right you know and so one of the things that's cool for us is saying okay to what extent like these things should probably be priced differently they should be scheduled differently you know if there's too much of one than the other then becomes actually a little bit better it's like hey there's a lot contiguous compute blocks already allocated that i can't disrupt i can fit you know kind of less contiguity requiring workloads in between you know etc so you kind of you can kind of push nudge you know that's the you can kind of nudge things you can kind of nudge to use this broader pool of resources really efficiently and say hey i've got a lot of stranded workloads because i've got a lot of contiguous you know work going on a lot of contiguous systems so i've got a lot of stranded stuff. So if you have a stranded workload, you know, be my guest, come fill this, you know, actually the costs are relatively, are going to be relatively low because it would otherwise be stranded and unutilized and then vice versa. So it gets really, really neat. And it's a part of the thing that we're trying to do with Foundry, but also this research on compound systems with our broader set of collaborators is trying to, you know, make sure that as we, as we get into this world that's highly agentic with these very, with this diverse, almost ecosystem of systems occupying their own ecological niches that we can actually, that we have the infrastructure that can actually support this, that can, you know, in the right way, in an elegant way. So hopefully that makes some sense. No, it does. It calls to mind for me the world of, call it like this transition over a prior era from supercomputing to distributed systems, right? There was a time in which in order to run any large computationally intensive workload. You needed computer systems with shared memory. And then eventually we got to things like Beowulf clusters that had some degree of shared infrastructure. And then over time, more and more of those workloads that required this like highly integrated system got chipped away. And now, you know, you've got people running a lot of the things that used to require these super computing systems on Kubernetes clusters and highly distributed independent systems. And a lot of that was changes in algorithm. A lot of that was the development of tooling. And just hearing you talk about this, you can kind of see the same thing playing out on the AI side, where because of the lack of, you know, the infrastructure work or the algorithmic work and the frameworks, For some workloads, you might just want like NVIDIA's biggest box. But more and more of that's going to be using inference parallel or parallelized types of implications in a very distributed setting. 100 percent. Yeah, that's a really good thing you're raising here. Yeah, I think maybe two ideas. One would be I think there's a pendulum. them right and almost it's almost like a competition you know there's almost like a deeper deeper way to way to explain this but essentially i think there's a lot of things that people talk about which is hey will more of the work be inference or will more of it be training and i find a lot of these questions you know i don't think they're the right questions for example i think that these things will always balance each other out so for example I think one of the things that we've that we all know is if you're doing a lot of inference then you might actually want to train your model so that they're cheaper to inference meaning distill them meaning over train a small model kind of pushing the chinchilla scaling loss somewhere they shouldn't go or they weren't initially intended to go so you have you know over train a small model so it'll be cheap to inference and the lifetime costs are lower etc so there's kind of a natural way in which these things will balance out you know and so I think that what this kind of hints at is That point I've been kind of hammering, which is around co-design, I think, I don't think that people fully appreciate the extent to which the systems that we have have dictated the types of research that's, you know, that's been ascendant over the history of deep learning. I think, frankly, I think deep learning overall, part of the reasons deep learning is the ascendant ML theoretic paradigm is because it scales so well. It's kind of a bitter lesson type notion. It scales so well. It assimilates compute so effectively. And so as Moore's law progressed, deep learning is one of the methods that could take advantage of that. You know, same thing with the methods within deep learning that have been really ascendant. You know, just a super dense transformer attention computation versus kind of a neater LSTM type of thing that's not as efficient at, you know, using the fact that we have these really, really dense parallel systems that are, you know, initially designed for graphics in some sense, but are really, really good at, you know, these kind of dense maestrian operations, et cetera. Like, I think that the, people don't fully appreciate this. And so, you know, one of the things that I guess is exciting to me about, you know, the next little while is I think we are going to see, you know, to the extent that people all converge on one approach, you know, vertical time scaling, whatever it may be, I think we're going to see that actually opens up, you know, kind of creates, as long as there's a system that can actually manifest this, like what we're doing with the scheduling, if there's a lot of contiguous blocks of work, that actually frees up a lot of gaps. and any type of workload that can fill those gaps kind of is occupying an ecological issue to itself, right? And isn't facing competition in the same way and, you know, that can be cheaper. And so I think that these things will balance out and the pendulum will continue to swing as it has for quite a long time. I think you invoke kind of the map reduce and the history of these things. And I think that's very much going to play out. I think we see the same thing with async kind of work versus real-time work, you know. I think you'll see that as methods are more and more real time, there'll be a benefit, computational benefit, to having things that can run in batch mode async in the background. And, you know, I think these things will, I think there'll be very, very diverse and interesting landscape going forward. And so I think if you're a systems NML person, there's never been a better time to be alive. And I think for people who are on one side or the other, I'd encourage them to try to get more familiar with the other side of that. Yeah, so, you know, and kind of thinking about all that we've discussed with regard to compound AI systems, like practically speaking, if you are, you know, machine learning engineer or researcher and you're trying to actually do something, like how do you take these ideas and put them into practice? Yeah, for sure. I think that for now, the one primitive or the one idea I'd encourage people to think of is just ensembles or best events. you know, as a reduction, as the community, you know, gets more and more intelligent. I think that in the case where you want to push the reliability or quality frontier, which I think that's where most people are. I think most people aren't trying to optimize costs yet. In my bias view, you know, they're not yet that stage. They're trying to get it to work. You know, they're trying to, you know, trying to get to work, trying to make the reliability good. They're also trying to cross that demo to real-time chasm where you start being judged not by the outlier good examples but by the outlier bad examples, right? They're trying to make sure that everything works. In that case, they want GPT-6 early. They want GPT-7 early. I think one of the really simple ideas that people can try to employ is just calling the frontier models multiple times and trying to combine the answer, at least on those verifiable tasks. That's a really simple thing that works pretty well. And actually in this network of networks paper, we're able to characterize when this will work well and by how much it will help in terms of three simple numbers. The numbers being some base difficulty or just the base likelihood that you get the answer correct with one call. And then also some basically soundness and completeness number, which is a characteristic of a verifier, which is if an answer is right how often is a verifier model able to certify it as right And in a case where an answer is wrong what the fidelity of the verifier in terms of saying hey, this is wrong and rejecting it? If you have those three numbers, a base correctness probability, a verifier soundness or certification probability for correct answers, and a verifier rejection probability, then a very simple ensemble or best of end structure with generators and a judge will push the frontier pretty reliably. Obviously, it will keep you through costs as well. But, you know, these models are actually overall pretty cost-effective, at least at, you know, tasks that aren't many, many, many steps or multitask. You know, and so I think that's the thing I would encourage people to try. And we actually have this framework called Ember that we created trying to make these primitives really, really simple to construct and express and efficient to run. That, you know, we're trying to make this easier and easier for people to construct these things. And yeah, I think that base ensemble is already a great place to start and also has other nice benefits, like you can use the disagreement between ensemble members to get some sense of, okay, is this thing hallucinating or not? You know, you can kind of, and that's already a pretty good place to start to try to employ these ideas in their own work to get things that maybe weren't working to work. Yeah, so that's what I'd be, that's what I'd encourage. And, you know, but also I'd encourage people to, you know, follow this area of research to, you know, start being creative themselves. I think we're in the early era. It's kind of like 2010 deep learning, you know, where a million flowers are blooming and there are live ideas and stuff was just starting to work. And people are really excited. I feel like it's, you know, 2012 again. But in this, you know, in this new kind of networks of network setting, not neural networks. And so I encourage people to say, hey, it's 2010 again, you know, because that was a fun time, you know, the 2010s, you know. Like, yeah, get excited. Do research. If you're an academic, you can push the frontier now. You don't need to be at a frontier lab. You know, I think that's what I encourage people to think about. With regards to Ember, what is the, like, describe that as a framework to work with. What are the ergonomics of it, the developer experience? It kind of came out of our own research on these architectures, on these for some scaling architectures. And we wanted to kind of be like what Torch was for neural networks for networks and networks. So we wanted to be fast to run, fast to write. The contributors, a few of them were the original contributors behind Ray and Spark. And so Ember is from that same family. And so we're reusing that Spark banner of fast to run, fast to write. And so it should be, you know, things that would take a long time or be, you know, have all kinds of little pitfalls, like constructing these big graphs, these big almost like neural network graphs, but out of inference calls. want to make that really simple concise and i also want to be efficient in the sense where just like with neural networks there's vectorization you know you know some some parallelization etc you can do that here or maybe you have if you have a large graph of calls you want to do some type of topological sort and parallel dispatch rather than just doing a for loop right and so it'll run faster and you won't you know won't be super slow so we want it to be concise and also pretty efficient in terms to run these things and so that's what it is it just has a little bit of like a Jax slash PyTorch ergonomics. But then there's also like a kind of Unix-y, a string-based compact syntax that someone built as well. But primarily it's kind of like Jax or PyTorch, but instead of neural network modules and nn.linear and nn.dropout, things like that, it's these network and network modules and non.ensemble, non.judge, things like that. So it's just a bunch of primitives, you know, tested, validated, you know, trying to make it easier for people to do this work. and we've been using it with collaborators spanning a pretty broad range of universities and also companies. And there's a lot of really cool work going on that I think will be coming out over the coming weeks and months from the community. Some projects I'm pretty energized by that I don't want to steal their thunder by leaking any results. But yeah, I think that's the goal of that, is to make it really easy to search this space. Just like Torch made it really easy to search that neural network space in the 2010s. And kind of looking through the code on GitHub or sample code on GitHub, the basic ensemble seems fairly straightforward. I think where some of the complexity might come in in terms of how do I apply this to my use cases rather than like the voting and the judge elements. Can you talk a little bit about how one should think about building those out? Yeah, I think the most basic variant is already, already works pretty well. So I would encourage people not to, you know, overthink it for the first variant. The most basic thing is just, you know, these models are already pretty smart. And so you don't have to have a really fancy prompt, you know, to start to get some advantages. You can just have a number of generators, let's just say as a default. I think the default in the system is like 10, you know, nine or 10. I think it's actually typically an odd number depending on the aggregation function. and then you have a judge and the judge is you know some bigger more powerful model um the generators you know can be that same model or some cheaper model and basically just ask the question all of the generators tell them to generate candidate answers and then you just see those candidate answers to the judge and say hey here's some advice from your advisors you know you know your council of advisors um you list it you know given all this you know you know produce a final response to this original question and give the question. That very simple construction is often pretty helpful. You know, people can test it a couple times in their setting, you know, and see if it helps. But it's often quite helpful. Yeah, I'm thinking about this in the context of a use case that I've worked on, which is, surprise, surprise, podcast transcription, right? And, you know, what I've found was that, like, particularly for a technical, you know, podcasts like this there are a lot of you know names and places and things like that that you know your off the shelf transcribers would get wrong and so the approach that I ended up taking that worked really well was to scrape a lot of information about a guest like the stuff that they worked on online and then go like essentially paragraph by paragraph and ask a model to you know given all this information we have like what in here is probably a transcription error and not right. And it works really well. But in light of this conversation, I'm thinking that another approach would be, you know, just do it 10 times and, you know, have something, like try to rationalize across those 10 times because it was often the case that, you know, it's just a randomization, a random nature or probabilistic nature of this where sometimes it would be right and sometimes it would be wrong. And the additional machinery that I built, like it increased the likelihood that it was right, but it wasn't because I was necessarily introducing new information. It was just like biasing it towards that new information, if that makes sense. I know exactly what you mean. Yeah, 100%. Yeah, two ideas come strongly to mind here. One is in this our moral alum called All You Need paper, you know we're able to kind of characterize what you're describing a little bit but also find out when this doesn't work when it does work and it's actually very nice intuition basically there's what we change between the networks of networks paper and the rmrlm calls all you need paper is the nature of the aggregation function right so they're both ensembles but in one case you're doing best event in the other case you're doing most common event and these have very interesting theoretical properties. Most common event is like, you know, democracy, right? It has some, it works really, really well. It's a quorum type of. It's a quorum. Exactly. We found that that simple idea does work really well, but only in cases where it's an easy question. And the default model would have gotten the answer right, but there's some probability that it hallucinates. In that case, the variance reduction of the ensemble quashes that hallucination probability. And so you kind of converge on 100% accuracy. And my intuition is that that probably is not the best approach in my case because, you know, oftentimes it's like some obscure acronym or something that it's, you know, it would get probably, you know, 20% of the time right, but not 80% of the time right. Exactly right. And so in the other case where it's a hard question, the variance reduction actually hurts you, right? It removes the outlier probability that you get the answer right. And there's actually some funny organizational design implications, like for breakout ideas, you probably don't want consensus or decision by consensus or committee, but I'm extrapolating a little bit, but I think it's kind of fun to think about. That's a property of this type of voting-based or quorum-based system, you know, or your consensus-based system, rather. In the other case, with the best of N, you don't necessarily see this, right? It has its own properties. And that is actually more monotonic in terms of the more ensemble members you have, the more calls you have, the more that kind of leads to a performance increase, regardless of whether you're in the hard regime or the easy regime, which we can talk about. It has to do with that verification versus generation complexity asymmetry notion and the soundness completeness notions, which are characterized in that paper. But it's kind of like, it's kind of funny that, number one, there are differences in this, you know, the behaviors of these sort of types of systems. And then also you can think about how that applies to your case. I think it's, yeah, it's a fun set of things. But yeah, I think in your case, that's a really neat example, this kind of, you know, trying to, yeah, the transcription error resolution thing. Yeah, that's neat. Yeah, one thing that occurs to me in this context is, you know, say we're going for, you know, best of N and do we end up with a case where now my, you know, single judge is a single point of failure, like that thing hallucinates and it destroys my whole, you know, it doesn't matter how many generators I have because I'm really dependent on the judge. Yeah, that's totally right. I think that the, so there's a couple ways that you can do this. So the construction we study in the Networks of Networks paper is actually a specific type of judge, right? So this actually is a funny question. Like what should the judge, what should the judge's prompt be? How should the judge work? The construction that we study for theoretical reasons is what we call a verifier-based judge. It implies though that your problem can be verified. Like it's a verifiable task. It does, but actually even if it can't be verified, it's an interesting construction. So the way it works is let's say you have 10 candidate responses. You can show them one by one to a judge, and the judge is just trying to produce, is this correct or is this incorrect? Thumbs up, thumbs down. That's all the judge can do. That's all the judge can do. And so now that starts to get interesting because, and then if the judge says everything's wrong, then you return a random answer, or you return the first thing that the judge says, yeah, this looks good. Now, the funny thing about that, really simple protocol is that it doesn't necessarily make things worse versus like your your average case of the you know initial of like the generator likelihood like unless your judge is like more likely to certify a wrong answer as correct than a correct answer as correct which is kind of extreme so but you see you can look at those probably it's kind of funny i kind of like this because i like cryptography a lot i like theoretical computer science a lot it's kind of funny that like those kind of p versus np ideas you know of like how easy it is to verify something versus so if i those cryptography ideas from like probabilistically checkable proofs and pcps come in here where very rarely is that going to be the case that like you have some model that's so ill-conditioned that it's like more likely to you know certify a wrong answer is right than the right answer is right but that's something you can study probabilistically and empirically you know on some subset of questions a holdout etc and kind of try to get a sense of that, right? But as long as that's not, like, really skewed in a bad way, then, you know, you're actually pretty likely to, you know, like, see a gain from this, with high probability, see a gain from this verifier-based judge construction, right? And so, you know, I think that kind of highlights that there is this design space, you know, a normal judge might be bad, like, you know, might be overwhelmed by context, you know, in some way, like, there might be issues that you might, or it might just be more expensive. this verifier-based construction can be, you know, it's kind of funny, like the properties of it are really, they are really kind of clean theoretically in a way that you can kind of guarantee and be confident about. Does this verifier-based judge and limiting it only to kind of thumbs up, thumb down mean that you could use much cheaper models for verification? Or is there still a requirement that you use kind of a more powerful model as your judge? Yeah, you can definitely use much cheaper models, you know, for verification, as long as, you know, that kind of soundness, you know, completeness story is correct, you know, and that's right. And so it doesn't have to be a powerful model. That's 100% right. It's kind of a function of the, you know, likelihood that you certify a correct answer versus the likelihood that you, you know, certify an incorrect answer type of thing, right? So that's, you're right. But then also another nice benefit is the thumbs up, thumbs down also is nice because, you know, input tokens are very cheaper than output tokens, you know? And so there's kind of something nice about if you just, if the output is just... Kind of a token compression in your judge layer. Exactly. If the output's just thumbs up, thumbs down, you know, that's actually pretty, you know, and then you're just returning the answer that was already generated by generators, you know, that kind of, you know, plays some role into this kind of cost equation as well, which is kind of fun, which is kind of interesting to start to think about. But you didn't talk about kind of parallelizing that judge layer. Like I was envisioning, you know, you can, you know, turtles all the way down, like, you know, highly distributed networks of networks of networks with each layer being very parallel. Is that something you've looked at? 100%, yeah, exactly right. So, you know, to your exact point, you could say maybe the judge itself should be a network. Right? You know, or maybe this whole little primitive of an ensemble plus judge should be a node within another network with a judge. You know, it's kind of trees, right? And so it starts to get pretty, pretty rich. And so you start to have deep networks of networks instead of deep neural networks, you know, denons, you know, you know, et cetera. And so it starts to get in. You can start to visit and then you start to think about, oh, there's loss in the middle dynamics, right? We know that these models have somewhat limited context, right? And so you can't feed an infinite context length, or even if you do, quality diminishes. So maybe you do want to break things up and have kind of smaller domains where fewer generators are going to a single judge, and then you're aggregating. It's like representatives, right? This whole architecture layer emerges, or maybe there's some notion of skip connections that starts to emerge where you're like, oh, well, you know, something way down below got quashed by this judge layer, so I need to kind of feed it up, right? you kind of get back to this neural network, deep neural network concepts of vanishing and skip connections. And these things come back in cool ways, which I think is really exciting to start to think about. And it reminds me of the term that I was searching for before. But like neural architecture design is an idea that comes into mind here, like where it's network of network design, where it's a highly automated process that, well, now I'm thinking like, yeah, I've seen evolutionary approaches to this and, you know, other approaches of searching the space, it can get pretty interesting. But I don't get the sense that we're quite there yet. Not quite there yet. Yeah, not quite there yet. It's still the very early days. You know, a lot of the work that's been so far is like those early days and like perception research and MLP research. We had like these, you know, like kind of wide, but relatively, you know, but shallow networks. And they, you know, had relatively few parameters, you know, and then you're like, over time we scaled them up and things got richer and richer and we start to run into vanishing, exploding gradients and start to think about ways to combat that. And, you know, it kind of started getting richer and richer. I think we're in those very, very, very early days still. Partly because I think the infrastructure doesn't support doing richer research. Like, you know, the latency is going to kill you if you try to make, you know, a thousand calls to GPT-4 with like with for loops, nested for loops or something. It's going to get pretty ugly pretty quickly, you know, and also it's going to get pretty expensive pretty quickly, you know. But then as this dispersion emerges more and more, as distillation works better and better, as we have cheaper and cheaper systems, you know, as you have the infrastructure like Embare to make this faster and faster to instantiate and run these types of, you know, ideas. I think there'll be more and more research and over the next months and years, I think it'll get, start to get wild where we'll have, you know, multi-billion parameter networks and networks, you know, with very kind of intricate structure. And you kind of alluded to this earlier, but you can almost see a world in which you, you know, you've got some problem, you kind of defined your problem, you have your data, and then you, as opposed to training a model, you evolve a network of networks, and then you distill that network of networks down to a simpler either model or network of networks. And so I'm wondering, like, yeah, I guess, you know, we're even further out ahead. But I'm wondering, do you track any research that looks at, like, you know, network distillation? Are we anywhere there yet? Or is that, you know, still to come? Yeah I think that we see early forms of this already in some of the reasoning research Yeah so actually I want to make maybe two points about ways in which compound systems are already kind of around us but that we don't really recognize as a compound system and haven't like taken that, taken those examples to their limit. One is specular decoding and the other is what you just mentioned of kind of distillation of networks and networks, right? Would you add MOEs to that mix as well? In some sense, yeah. In some sense, that's definitely true. Although maybe in a slightly different way. Yeah, so let me actually, maybe let me start with speculative decoding because it's just so common. But, you know, and it's everywhere and et cetera. So speculative decoding, you have kind of a funny little compound system where there's a small drafter model and then a big verifier model. And the idea is you can get greater efficiency. That's kind of like it pushes apart our frontier in terms of latency versus quality in a way that's provable, that you're not losing quality, but you're definitely getting an expected improvement in latency, you know, in speed and tokens per second throughput. This is, you know, it's kind of obvious. But the way it's doing it is you have the drafting models running ahead and then the verifying model is kind of in parallel, you know, rejecting or accepting multiple tokens, right? And this is kind of a funny case of, you know, at least a very simple hybrid system with a small model and a bigger model being combined to push the frontier, you know, the kind of proto-frontier for the overall system in terms of cost, latency, et cetera. And it doesn't actually increase cost by as much as people might expect because of this fact of a GPU, which is kind of funny, which is you want to oversubscribe it, you want to keep it utilized, the cost is mostly CapEx, et cetera. And so having that smaller model doing things in parallel that then can be verified by the bigger model doesn't actually increase your cost the way people would maybe imagine. And so that's one very simple example of a compound system that's kind of all around us, but we don't think of that way. And another is what you're talking about, which is more this, you know, distilling a network. So I think we're kind of already doing this when we have many, many replicas in kind of a standard RL pipeline. You spin up many, many replicas, you ask them to do questions, you know, they all produce candidate responses. Maybe one of them's correct. And that's the trace, that's the rollout that you end up, you know, you're training on, you know, overweighting in your training, right? That's kind of a, you know, a funny case of using compound systems, you know, but within the kind of framework of RL in a more classic way, rather than using that training data to train a single model, you can imagine that system itself being something you just do at inference time, right? And there's obviously the whole spectrum. And so, and by the way, this type of thing's been done multiple times where in a lot of this chain of thought research, one of the basic things people have been doing is saying, let me go ahead and do multiple steps, multiple step rollouts of the current state of the art model. get it to generate 10, you know, you know, consequential improvements, you know, based on some question, just initial reasoning. And then let me take that entire thinking trace and use that as my training data to train the next generation of the model to then try to produce that whole 10-step, you know, reasoning trace in one step. And then let me try to bootstrap off of that. You know, do that again and do that again, right? There's, you know, and obviously there's all kinds of trouble in that potentially, so, you know, perspicuity thing. but I think you can kind of see oh you can kind of see the contours you can see how that would work and how that would extend that type of thing so I think that's been already in the works implicitly at least for a while yeah 100% yeah it's a yeah it's pretty interesting yeah I think something that I don't think you know everyone maybe knows or has intuition for is we mentioned earlier how shocking it is that distillation works so well and just how well distillation works i think that even at the hardware side i don't think people appreciate how well asics work you know application specific integrated circuits how well these when you have a task in mind if you can design the chip for it and how all this type of things works and i think that's why the kind of you know compound systems or system of models or menagerie of models type of vision. I think the GPU in some sense is a proof point that application-specific system built in a co-design way with an end use in mind can really push the, bend the curve on cost. Yeah, it's interesting that there's one lens in which a GPU is a general purpose building block, But then there's another lens that you can look at it as an application-specific building block for graphics. And now, yeah. Yeah, it very much is, right? And I think it was kind of obvious that the CPU was not the right form factor for a lot of use cases in a fundamental way. And that's why you need accelerated computing and that's why you need the GPU in a fundamental way. I think people haven't taken that to its limit. At least, obviously, there's some natural like you wouldn't necessarily need infinite variation and infinite specialization but there's at least different poles and we've seen this in the last few years of deep learning where the tpu is very much explicitly described as a co-designed system where there's you know elements of you know what level of precision do you need right you know the precision format the you know kind of design and in its in the extent to which it's well suited to parallelism but then also the the models themselves and the actual learning algorithms and how tolerant they are to noise from lower precision, et cetera. Like all of these things co-designed into one system. Yeah, I don't think, you know, people have kind of really appreciated this. And so I think when it comes to that question of will the future be compound systems or will it be a single model, you know, I basically point people to at least a CPU versus a GPU to say at least there'll be a couple of different poles, right? they'll at least be that kind of you know perhaps small model highly distilled you know with big models at least that type of pairing if not something even even richer you know so yeah it's a i think it goes back to this kind of theme that i'm going to push which is you know one basic research is back there's a lot of work to do but number two kind of it's never been more interesting time to do co-design and to have both the systems and ml or engineering and research perspective on these things. And bringing this all full circle, like given this viewpoint about the way that ML systems are going to evolve and kind of, you know, scaling out of distributed inference, makes perfect sense that you'd start a company at the compute layer doing inference. Yeah, for sure. I think the way we, you know, that I thought about at least is it was pretty clear to me that compute is really a root node problem for ML. you know as a i was very much kind of an algorithms deep learning person um you know but it's pretty clear that part of the reason that deep learning is the synonym l theoretic paradigm in the first place is because you know the way it assimilates compute and so in some sense you know deep learning is a root node problem for all these downstream applications across health and education and you know materials and all these things that i think are really important interesting you know robotics you name it but then the problems that are kind of upstream of deep learning are largely systems problems and it's pretty clear to me that the the cloud as we currently know it the algorithms economic model was not well suited to deep learning and it was really going to narrow our creativity limit the types of things that we could do um and then it was already hurting companies where they you know would have to raise tons and tons of money to buy a long-term contract the cloud just wasn't working the way it was supposed to the whole promise of the cloud of elasticity on demand infinite compute on demand on tap not needing to capacity plan you know that whole notion was not being fulfilled in the AI cloud context. You know, all of the desiderata of the original cloud, none of them were being fulfilled really in the AI cloud context. You know, almost all the complexity of infrastructure is being posted onto the users themselves, as opposed to the cloud handling it for you. So you can just focus on making your beer taste better. All that stuff. None of it was fulfilled in the AI cloud context. And you need to really fundamentally rethink the way cloud works, re-envisage a lot of these systems, the business models, et cetera, in order to fulfill the original promise of the cloud and the AI cloud context. And so I think it was clear to us that we were not on the path of that happening, that things were not moving with the right velocity, either in the right direction or with the right speed overall across these multiple angles. And so it's like, OK, we need to push this. Otherwise, we're going to definitely reach some limitations at the research level. And so we started pushing on that. And I think I was fortunate to have both a systems perspective, you know, and at least enough understanding of, you know, the cloud and how scheduling works and all these types of things. But then also the algorithms to see, you know, where there are gaps where people on one side or the other weren't thinking about how you need to design the full system. We've got a lot of work to do at Foundry to, you know, fully realize our vision. But I think we've made, you know, a lot of progress so far, helped a lot of companies. we've been able to, for certain types of workloads, cut the cost by 12 to 20x, particularly for workloads that are amenable to running in a preemptible fashion or being checkpointed or running in a heterogeneous way or running in a batch mode where they can run, they just need six hours within the next 12 hours and they don't care which six hours. And so for those types of workloads, we've been able to create a lot of surplus. Yeah, when I look at Foundry and think about it in the context of other companies that are going after similar ideas. I think of like Modal and Fireworks. It's like where I would characterize Foundry as like really pushing innovation and like the economic interface of inference, whereas, you know, these other companies are more like modal is pushing like the serverless idea fireworks is like more traditional like you were just going to abstract away the call and you use it in a similar way um but there are some interesting ideas uh in what you're doing like you set the price for your workload and you're essentially you know this is how much it's worth to you and like run it when you can like that's kind of an interesting uh concept and there are a bunch of other like marketplace oriented ideas built into the way you're approaching things. That's right. Yeah. I think one way I'd characterize that is, you know, Jensen and others have talked about how there's a couple of different regimes that you might fall in. Like one regime is the very, you really care about low latency and the others you really care about high throughput, right? You know, it's a little bit different. Let's just say in the life of a researcher, let's not even go to the user of ML. In the life of a researcher, I have some kind of verification training and debugging workloads I'm doing throughout the day as a researcher. And in that context, if I have a five minute, 10 minute, 20 minute, one hour queuing time on the cluster, I'm very unhappy. I want it to be snappy. I don't need many nodes. I want like one node quickly, you know, and actually these are short lived workloads. I'm going to pay you a lot more than the typical rate of a GPU, like to get it faster, for example. In the other case, maybe I have a workload that I set off at 8 p.m. You know, as a typical researcher and I come back the next morning at 10 a.m. I want it done. In that 14 hour span, I don't really care when you finish the workload. It's a six-hour workload. You know, finish it whenever. However, maybe as a one caveat, maybe I want it to start quickly so I can start to see the curve going down. You know, it is confirmed, you know, that there's no error. But then at that point, like as soon as I see the curve starting, you know, I'm going to close my laptop and walk away. Right? And I just want it done by the time I get back. I just want it done by the time I come back. And if you tell me that you could do that 10 times cheaper, right, than if I need it done, you know, just in one shot. I'm like, yeah, sure. That's totally great. You know, if other people can do their work and there's 10 times more effective capacity on the cluster because of that, that sounds awesome. Yeah, why not, right? Versus, and then for that thing during the day, if because you're being efficient and making some things on batch, I can get my stuff snappily, you know, even if it's a little bit more expensive, you know, than the kind of batch rate, I'm fine with that, right? And so I think that there's this heterogeneity. Same thing with these agentic cases. They're going to be things like operator where I want it to be really fast. Like it's way too slow. I want it to like, you know, be as quick as possible, go and perform a task in real time, you know, et cetera. And I have other things that I'll launch async that can run in the background, you know, like deep research or like the kind of async coding updates, et cetera, as long as it's high fidelity, right? And so there are these different modes. And sometimes just like in the search case with Google, you actually have some workloads like MapReduce and indexing that are background things that you're doing explicitly for the purpose of making it fast when you need to serve. So you can have a distilled model or a really fast model or just a retriever, you know, actually at inference time because you did all this work up front, pre-computing things, right? And so there are these different regimes and I think you need an infrastructure, you need a cloud that can handle those infrastructure regimes efficiently, you know, pack them all. It's kind of like playing Tetris, right? You know, it'd be much easier to play Tetris. It'd be a crazy cheat code if you could delete blocks that were kind of in a weird shape and making things hard and have them come back later. If you could take a block and split it into sub-blocks or make a block into sand when you need to, right? You know, if you could make a block that's falling, fall slower or just like pause it while it's falling, you know, because it's inconvenient. If you could do these things, obviously, you know, that'd be a massive cheat code in Tetris and will allow you to pack the space more effectively. I think something similar in the kind of the cloud context, and we want to have an economic model that would allow people to say, hey, if your workload could be deleted and come back later, then there's a massive economic benefit we can give you for that. that is, you know, based on the effect that you're having on the overall system. You know, I think that's kind of what we want to be able to express at the industry level, which obviously requires a ton of things. There's some very heavy, you know, virtualization and networking and, you know, scheduling elements to this, but then also UX, business model, you know, et cetera, elements to this. But being able to express that is like a really, really powerful thing for the ecosystem as a whole. We have this parking lot analogy I can give that I think even more intuitively and kind of convey some of this, but I think you get the intuition overall. As a sidebar, if someone vibe codes that Tetris game, I really want to play it. It sounds really cool. Yeah, for any listeners out there, that is an open invitation, a challenge rather. These types of questions, like these systems utilization questions, were pretty esoteric at one point. Back when some of my collaborators or some of the senior PhD students in my lab We're doing work on, you know, using spot and reserved instances and on demand intelligently for a workload. You know, we're saving some costs, but these are like 20K workloads, right? And this is like the, you know, the olden days. Now, you know, these companies are spending more on compute than on people. In many, many cases, from the biggest companies like Google and Microsoft to the smallest startups to companies in the middle like OpenAI, you know, that are very late stage privates, etc. you know it's the primary economic budget in their P&L and also the dollar amounts as a result are much much larger we're talking you know tens of billions of dollars hundreds of billions of dollars on infrastructure spend and these workloads are significantly more expensive and so these questions that were previously esoteric about compute utilization are now central questions right and so I think these are pretty fundamental to the future of the field and so that's partly why we and said, okay, we need to build this the right way. You do this via kind of a translation layer on top of another cloud. Phenomenal question. So we do both, right? So we have some infrastructure that we own and operate, but relatively small. But then now we can also partner with other clouds. We have a partnership we're going to be announcing relatively soon, you know, with one of our partners that kind of shows how we do this. And we'll talk about, you know, the way in which we're partnering. Basically, it's a win-win thing. The parking lot analogy makes us clear, but if we can basically pack more work we're creating surplus. We're creating more revenue for the owner of the compute. We can cut costs for the person whose job is being packed in there more efficiently. You know, and then we can also, and in that surplus, we can also capture upside. So it's a really nice structure. And so we want our layer to be run across as much acuity compute as possible. Ideally on every unit of compute, you know, every single GPU, you know, then also literally anything that can do useful computation at a scale, you know, sufficient scale over time. you know, and then different workloads need security or contiguity to different degrees. And that's all managed by the schedule. And that's partly why this level needs, this layer needs to be smart. The kind of orchestration and scheduling layer needs to be smart. So, but yeah, exactly right. And so obviously only our own compute, but there's a limit to how much we can scale that, you know, particularly via venture. And so we ideally would run as a layer across all compute and partner with others to basically create more surplus in this way. And I think if we can do that, not only are we creating a lot of economic value, but I think also that will lead to just the ML space being so much more dynamic so much richer as well. Well Jared, this has been a great convo. I really appreciate you taking the time to talk through all the many things that you're working on. Very cool stuff. Yeah, thank you Sam. Really fun conversation and yeah, really fun to your mind, you know you're pretty creative and generative so that was a fun conversation. Awesome, thanks so much. Thank you.

Share on X Share on LinkedIn

Related Episodes

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

57m

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

48m

Proactive Agents for the Web with Devi Parikh - #756

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

56m

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

54m

Building an AI Mathematician with Carina Hong - #754

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

55m

High-Efficiency Diffusion Models for On-Device Image Generation and Editing with Hung Bui - #753

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

52m

Comments

No comments yet

Be the first to comment

Infrastructure Scaling and Compound AI Systems with Jared Quincy Davis - #740

What You'll Learn

Episode Chapters

Introduction

Laconic Decoding

Networks of Networks

AI Infrastructure Scaling

Foundry.ML's Approach

AI Summary

Key Points

Topics Discussed

Frequently Asked Questions

Episode Description

Related Episodes

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

Proactive Agents for the Web with Devi Parikh - #756

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755

Building an AI Mathematician with Carina Hong - #754

High-Efficiency Diffusion Models for On-Device Image Generation and Editing with Hung Bui - #753

AI Curator

Ask me anything about AI