

Closing the Loop Between AI Training and Inference with Lin Qiao - #742
TWIML AI Podcast
What You'll Learn
- ✓The purpose of AI models is to serve the product, not just achieve good offline metrics. Product A/B testing is the ultimate judge of model success.
- ✓Decoupling training and inference systems leads to compatibility issues and slows down the development lifecycle. A cohesive system that aligns training and inference is necessary for both fast experimentation and large-scale production.
- ✓Fireworks focuses on providing an end-to-end developer platform, starting with inference as the foundation and expanding to support various customization techniques like reinforcement fine-tuning.
- ✓Fireworks does not build foundation models, as the open model quality is already high, and instead focuses on providing a seamless platform for post-training tasks like fine-tuning and deployment.
AI Summary
The podcast discusses the importance of seamlessly integrating AI training and inference for effective product development. The speaker, Lin Qiao, shares lessons learned from his experience leading the PyTorch team at Meta, highlighting the need for a cohesive system that aligns training and inference for both fast experimentation and large-scale production. Fireworks, the company he co-founded, aims to provide a JNI SaaS platform that optimizes the entire AI lifecycle, starting with inference as the foundation and expanding to support various customization techniques like reinforcement fine-tuning.
Key Points
- 1The purpose of AI models is to serve the product, not just achieve good offline metrics. Product A/B testing is the ultimate judge of model success.
- 2Decoupling training and inference systems leads to compatibility issues and slows down the development lifecycle. A cohesive system that aligns training and inference is necessary for both fast experimentation and large-scale production.
- 3Fireworks focuses on providing an end-to-end developer platform, starting with inference as the foundation and expanding to support various customization techniques like reinforcement fine-tuning.
- 4Fireworks does not build foundation models, as the open model quality is already high, and instead focuses on providing a seamless platform for post-training tasks like fine-tuning and deployment.
Topics Discussed
Frequently Asked Questions
What is "Closing the Loop Between AI Training and Inference with Lin Qiao - #742" about?
The podcast discusses the importance of seamlessly integrating AI training and inference for effective product development. The speaker, Lin Qiao, shares lessons learned from his experience leading the PyTorch team at Meta, highlighting the need for a cohesive system that aligns training and inference for both fast experimentation and large-scale production. Fireworks, the company he co-founded, aims to provide a JNI SaaS platform that optimizes the entire AI lifecycle, starting with inference as the foundation and expanding to support various customization techniques like reinforcement fine-tuning.
What topics are discussed in this episode?
This episode covers the following topics: AI inference, AI training, Enterprise generative AI, PyTorch, Reinforcement fine-tuning.
What is key insight #1 from this episode?
The purpose of AI models is to serve the product, not just achieve good offline metrics. Product A/B testing is the ultimate judge of model success.
What is key insight #2 from this episode?
Decoupling training and inference systems leads to compatibility issues and slows down the development lifecycle. A cohesive system that aligns training and inference is necessary for both fast experimentation and large-scale production.
What is key insight #3 from this episode?
Fireworks focuses on providing an end-to-end developer platform, starting with inference as the foundation and expanding to support various customization techniques like reinforcement fine-tuning.
What is key insight #4 from this episode?
Fireworks does not build foundation models, as the open model quality is already high, and instead focuses on providing a seamless platform for post-training tasks like fine-tuning and deployment.
Who should listen to this episode?
This episode is recommended for anyone interested in AI inference, AI training, Enterprise generative AI, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
In this episode, we're joined by Lin Qiao, CEO and co-founder of Fireworks AI. Drawing on key lessons from her time building PyTorch, Lin shares her perspective on the modern generative AI development lifecycle. She explains why aligning training and inference systems is essential for creating a seamless, fast-moving production pipeline, preventing the friction that often stalls deployment. We explore the strategic shift from treating models as commodities to viewing them as core product assets. Lin details how post-training methods, like reinforcement fine-tuning (RFT), allow teams to leverage their own proprietary data to continuously improve these assets. Lin also breaks down the complex challenge of what she calls "3D optimization"—balancing cost, latency, and quality—and emphasizes the role of clear evaluation criteria to guide this process, moving beyond unreliable methods like "vibe checking." Finally, we discuss the path toward the future of AI development: designing a closed-loop system for automated model improvement, a vision made more attainable by the exciting convergence of open and closed-source model capabilities. The complete show notes for this episode can be found at https://twimlai.com/go/742.
Full Transcript
The purpose of those models is to serve product. The model by itself doesn't have any meaning. If model training doesn't move the needle for product, it's not meaningful. It's not useful. Product A-B testing is the ultimate judge. Whether your model investment is paying off or not, is successful or not. So actually the fast iteration experimentation loop has both training and inference combine together to declare victory. All right, everyone, welcome to another episode of the Twin Wall AI podcast. I am your host, Sam Charrington. Today, I'm joined by Lin Chiao. Lin is CEO and co-founder of Fireworks AI. Before we get going, be sure to hit that subscribe button wherever you're listening to today's show. Lynn, welcome to the podcast. Thanks for having me, Stan. I'm super excited to dig into our conversation. We're going to be talking about a pretty broad variety of topics, but really digging into AI inference and enterprise generative AI. Before we get going, I'd love to have you share a little bit about your background. Yeah. So I started my career as a researcher after I graduated from, again, my PhD. And by my work for research, it becomes three different products in the company I worked at, and I become a software engineer. And then I joined Facebook, no matter, to drive the AI initiative. it's kind of full circle in my career to do research, go build systems and come back to system research, co-design space. So that has been a very fun journey. And the visibility we got from driving the AI work at Meta really gave us confidence this is going to be a huge industrial transformation. And that's our motivation to create fireworks to help all developers in the industry using AI in a very easy way. Awesome. So what is Fireworks and what led you to start the company? Fireworks is a JNI SaaS platform where you do not need to set up anything, do not need to worry about how many GPUs you need to deploy to run your application, Where is the location? How to manage reliability, cost, performance optimization? None of those. You just onboard to Fireworks and integrate with your product, and boom, it will optimize itself. It will help you save costs and scale quickly to millions of developers or billions of consumers. It will manage all the production operation complexity and let you focus on application product development. Let's maybe dig a little bit deeper into that. Talk a little bit about the way you expose the platform to developers. Yeah, so a little bit about our background. We have a pretty big founding team, seven of us, been working in AI space and building AI infrastructure for both training and inference for a long time at Meta and Google. So we really see the developer needs are not uniform. It at least broke into two phases, if not more. In the most simplistic way, the first bulk of developer work is experimentation. And that's where you want fast iteration, simple integration, lots of control, and validate your hypothesis, what can work, what may not work, and fail fast. So that's kind of the speed of iteration is the most important. Productivity is most important. Once you find partner market fit, maybe one out of 10 ideas survive for you to scale into production, then it's a completely different set of problem in the tackle. That's where you need to worry about cost efficiency to have viable business. You need to worry about latency to bring this interactive experience of consumer-facing, developer-facing product live. You need to worry about quality to the next level, quality to the next level of, hey, how does model quality really feed into mold into the product needs? So all of that becomes a different set of concern. So we have seen issues only focus on one versus the other. Meaning only the fast iteration or only the high efficiency scaling. Right, right. So because if you only focus on large scale, then people who are at the early stage of experimentation will not be able to use that platform. You cannot force a later stage product into early stage in the funnel. and second if we only focus on early stage then when people need to scale they need to hunt and find another platform to to leverage and there's a lot of migration overhead for them so to build a seamless easy to transform multi-stage product for developer is absolutely necessary and we have validated that during the PyTorch days the big a big part of funding team has been building PyTorch for five years, I matter. And PyTorch now is the most popular AI framework in the industry, especially when it comes to JNI. Almost all models are written PyTorch and deployed in PyTorch into production. I was actually going to ask about PyTorch. So let's maybe take a second and go there. I feel like you undersold your background a bit. You were the head of PyTorch and kind of guided that team through a pretty interesting time in the evolution of that product. I would love to have you talk a little bit about like lessons from that experience that are important to you as you, you know, first frame out the problem that you articulated for Fireworks, but then as you're building the company, like what did you learn that was most salient to your current experience. Yeah. So I think there are a lot of lessons learned, I would say. I can imagine. Because at that time when we started, there's no hardware, no software, no team focusing on AI first, right? Meaning at FAIR or? Yeah, at Meta in general. obviously metal was on the cutting edge on the forefront of creating a technology space innovating there ahead of the industry so there are a lot to figure out and during that process we try many things some idea doesn't pan out and we learn a lot and that's kind of lead to where we are today I think one of the big lessons is we thought without these two loops, as I just talked about, the fast iteration experimentation loop and the production loop can be completely decoupled. We should use a completely different system, optimize for different things, and build a bridge across. Okay. By building a bridge, that means we are going to do conversion. We're going to convert the research outcome for doing experimentation or a lot of model tuning, training results, convert that into production, a different set of system, optimizable production. And then you get the better of both worlds, right? The idea sounds plausible. However, this conversion is taking forever because these two systems are not designed to be compatible with each other. A lot of compatibility is numeric compatibility. you want to preserve the quality, high-quality results from experimentation, from customization, from training, whether it's pre-training, post-training, into production. Any conversion is going to cause loss of position. But at the same time, it's additional work to do this conversion. And that can actually slow down significantly. The second is if you look at the product lifecycle, model training doesn't, the quality work doesn't finish at offline model training. As in, you have a bunch of eval data sets and you measure the offline metrics and the metrics look good, you're happy? No. So actually, you have to continue to deploy the model into inference and the product point to the model to start to do A-B testing because the purpose of those models is to serve product. The model by itself doesn't have any meaning. If model training doesn't move the needle for product, it's not meaningful. It's not useful. Product A-B testing is the ultimate judge whether your model investment is paying off or not, is successful or not. So actually, the fast iteration experimentation loop I just mentioned has both training and inference combined together to declare victory. And then you know, hey, do I have a partner market fit or not? And then I know, we know whether we need to scale into production or not. And then the second loop starts. So even the first loop requires training, inference, fast iteration. And if you break it into, oh, training user system, inference user system, then you are like being stuck because that velocity, you lost it right there. So that's where like have a cohesive system that do training inference alignment and cross deployment quickly is a necessity for experimentation. That's a big lesson learned. And the second is the inference system for experimentation need to be the same as inference system for large production. Because once you find a product market fit, the immediate thing product team is going to do is, oh, I did, I route 1% of my traffic for doing A-B testing. It looks good. I want to ramp it up quickly to 10%, 20%, 50%, 100%. No time. I cannot wait. So that's why kind of whole end-to-end need to be, the transition need to be extremely smooth. And even if you could wait, the product team, once you did whatever conversion, the product team would still want to start at 10% and ramp up slowly because that's part of their testing process. Yes. And if this conversion takes a month, they're like, oh, my God, like we are just doing nothing but waiting for this to happen. And this this is not working for me. So just so you know, like the product team is heavily depending on the fast iteration with their users. To understand the new part of design is learning well with the customer or not, because there are a lot of hypotheses. What part of a feature become popular? Welcome, drive user engagement, drive time span, drive a lot of metrics. But it's not validated until they see the A-B testing results. So that's the nature of fast iteration. So there's a fast iteration with the end user also that needs to happen. That cannot be slowed down. So that's where this whole JNI SaaS platform design cannot just be one-sided, kind of any one phase of the development lifecycle, of the production lifecycle, it needs to be end-to-end, very smooth transition. So I will say that's going to be the biggest lesson learned. And that's why in Fireworks, we always think about ourselves as we're building towards the end-to-end developer platform. And we start with inference because inference is the foundation of this end-to-end. Inference is foundation because if you think about post-training, so we build on top of open models. One thing I want to clarify is, of fireworks, we do not do pre-training. We do not build foundation models because today the open model quality is really good, especially this month is the gift showering all high-quality models. We can talk a lot more. I'm very passionate about that. So in the open model world, a lot of people are doing tuning and customization. And part of the popular tuning technology is called RFT, reinforcement fine tuning. And if you take a deeper look at reinforcement fine tuning, it is training, it is inference for rollouts. And actually, even for tuning, it has a lot of inference dependency. And obviously after tuning, you need to deploy to small scale A-B testing, that's inference, and then scale to full production traffic that's also inference. So that's why inference is a foundation of this end-to-end. We started with inference at the beginning. And then we added supervised fine tuning. We added reinforcement fine tuning, all different kinds of tuning, customization technology built on top of inference. Interesting. When I think about inference as a starting point and then your evolution to more of an end-to-end platform and the needs of post-training, I think of like the very different abstractions that you might offer to do that. like, you know, for inference, I might just want to plug into line chain or do like an open AI compatible thing. But then for fine tuning and training, you know, I might want to take a different approach or have some you know very tailored user experience for this since it much newer and you know there not a dominant kind of approach to that user interface or developer interface, I mean. Can you talk a little bit about the abstractions that you offer developers and, like, you know, how you've kind of managed the offer best-in-class experience but also, you know, help them avoid lock-in or remain compatible to other directions they might want to explore. Yeah, absolutely. So the inference abstraction and post-training abstraction is completely different. So I will start with inference. Again, think about there are people using inference for pre-product market usage. They just want to validate if JNI is the right tool to solve their product problem or not. So that's why we choose OpenAI Compatible. API. Obviously, everyone chooses OpenAI Compiler. It's just kind of easier to speak the same language. But then when they graduate from product market fit validation phase to scale, then there are a lot of customization for inference itself to be high speed, cost efficient, and also quality needs to be really good. So we call it a three-dimensional optimization. We built a 3D optimizer to drive the customization for our developers without they understand deeply how to customize their inference deployment to satisfy their needs. So that is a layer of abstraction we created. Very similar to the concept of query optimizer for databases. SQL databases. Yeah, for SQL databases. basically the data engineers or analysts they just need to write SQL which is an abstract way of describing what semantically is the query and this database engine has a rule based or heuristic based optimization path to convert that into a very efficient execution right but the idea is very similar but database query optimizers one dimensional optimization only optimize for latency. So we drive three-dimensional optimization for inference across quality, latency, and cost, all at the same time. When we just started, the bad news is extremely complicated optimization space. The space is big because the technique we need to tweak, pick and choose from, the options we need to pick and choose from, has a lot of components. Each component has its own subset of options. And they stack with each other, leading to more than 100,000 different combinations. And then it becomes a search problem. Optimization is always a search problem. It's a very search problem with a large search space. And then how do you find, how do you navigate that search space to find this one option efficiently is the problem we solved. Can you give some concrete examples of points in that search space or, you know, the different, I'm assuming we're talking primarily here about different backend configurations. But maybe the first thing to do is to have you elaborate on, you mentioned these three dimensions, but those three dimensions are kind of a front end for lots of, a lot of dimensions of change on the backend. Like what are all of those dimensions of change? So there are many dimensions. For example, for inference, we do fully disaggregated inference. That means we chop the model execution by the system bottleneck, right? Some part of model execution is fully bottlenecked by compute. Some are fully bottlenecked by memory access. Some are fully bottlenecked by networking. and we disaggregate them and scale them independently, right? And then how to scale them really depends on the workload pattern and the performance characteristics that the product team is looking at for the product requirement. So that is one way to look at it. On top of that, there are various different numerics you can apply in terms of position. And when you apply position, obviously, you're, and this technique is called quantization. You can quantize so many different things. And there's a lot of options in that space. Not just you can pick different position. You can also quantize weights, quantize activation, quantize communication. You can quantize many things. But at the same time, when you quantize, by default, your quality goes down. Sometimes if you quantize naively, your quality goes down by a lot. So a lot of product use cases cannot have that tradeoff. They cannot trade off quality with speed, quality with cost. They want to preserve the quality, do not degrade, and then improve the other two. So then we have a way to preserve the quality while doing this optimization. We also have different kernels implemented for optimizing for different context length, the long context, shorter context, and for generation speed. speed. There are different options we can pick and choose from there. Obviously, we also have optimization for different hardware backends because different hardware have their unique strength we can leverage. And yeah, so that's where kind of the search space is very big. and you cannot manually do all these kind of try all these combinations. So it needs to be done in a systematic way. It also sounds like something that there are aspects of what you're describing that don't seem to lend themselves to doing that kind of optimization on the fly, meaning it's more like design time optimization or maybe better deploy time optimization. Like I have specified a model. I've told you how I'm going to use it. Now I press the deploy button and then you like figure out the best way to deploy it as opposed to I'm sending in an inference request and it could happen in one of N ways based on, you know, the content of that request. Is it more the former than the latter? It's a combination. some of the optimization we can do during runtime. Some of the optimization we're going to push before we deploy. Yeah. So it's like how compiler work. There's JIT. There's AOT. So, yeah. So we're doing a combination of both. Okay. Okay. And then it sounds like most of what you've talked about It is like infrastructure tuning types of, both architecture and tuning types of optimizations. And they are independent of, not independent of the model, but they are kind of assuming a static model. And now I'm thinking of this in contrast to like a Google I.O., Google announced a Vertex model optimizer. and the idea is to provide a front end to all of the model choice that a user has. And so instead of specifying in a specific model, they say, I want to prioritize speed here or I want to prioritize quality here. The dimensions of those two offerings are overlapping, but the approaches sound very different. Are you also thinking about that kind of model abstraction? that's a very interesting topic because um oftentimes we see a desire for many developers they want to be model agnostic when they build a product they don't want to get you know just bundled with a model because this this space moves so fast uh every almost every week there's a new model being announced whether it's closed or open there's some some models breaking some record somewhere. And there is a huge amount of desire to be model agnostic. Within a given product, there's, you know, a given product might have a dozen or more tasks and each of those tasks may be best served by a different model. Absolutely, yes. And the interesting observation is it also depends on the product lifecycle itself. In the early stage of product development, when people are not sure which part of this product feature can be hardened yet, they typically would pick a powerful model to begin with. Just kind of ensure the best quality. And over time, they got a lot of user feedback. There are a lot of, again, there's A-B testing, there's kind of part of metrics. To monitor each part of feature, how people's reaction look like and how they kind of continue to polish it and tune it, to the point it starts to emerge, oh, those features went too hard. And typically, those feature has a narrower problem definition. And that's where they start to specialize and customize. So we see a clear transition and continuous transition. It's a continuous process. It's never done one shot. from, hey, I have a new part idea. I want to try it on the most powerful model to, oh, now I know exactly what I want to build. And I'm going to narrow down and use a special customized model. And then the customization process can be, I pick another smaller specialized model from off the shelf, or now I have bootstrapped my product. I have a lot of data from production to tune a model specific for my application. And there's a lot of power to unleash from there because no one has that data. Not the Frontier Labs, nobody. And with that data, you can overfit your model into your application. That's what you want to do. Actually, in this case, overfitting is good. And therefore, the model is designed for your product, make your product better, and your better product drive more user engagement. You collect more data, make the model better, better product, and so on. So there's a virtual cycle we start to build. So we see a lot of success for early pilots being able to master that virtual cycle, and they emerge as leaders in current JNA app space. So what we are convinced to do and to offer value to this community is to make those tools accessible to everyone. So it should be easy. It should be standardized. But when it comes to standardization, it's a messy, it's a very messy place. One is just let's talk about one technique. that's getting very popular and also we offer our platform is called Reinforcement Fine-Tuning, RFT. So the most critical part of RFT is to write an evaluator that takes a model output, gives a score, right? That score is called reward. You give a very big positive score. That means you really want to, the model doing a great job, you can't nudge it to the right direction. or the model, a big negative score means the model is really bad. The result is BS and you want to strongly discourage it going that direction. A lot of reward you can write in code and can be executed in isolation. But a lot of times the reward is it cannot write in code. You have to call into an application-specific internal API to extract knowledge and information that is deeply coupled with the product. Okay. Because a lot of time, those kind of score generation or reward generation depend on internal state. What are some examples of those types of reward? Or what evaluation functions? Yeah. For example, if it's a math problem, right, or if it's a coding problem, it's very verifiable. You know, yeah, you know absolutely here is the ground truth, right? So for highly verifiable rewards, you can either code, sometimes you can just code in your reward function itself, or you call into a, you know, ground truth API to get the result back. So that's kind of, that's one way to do it. But sometimes the reward is subjective, right? the reward may need to come back from the product. It's a multi-turn chat and people engage with the product, thumbs up, thumbs down. There's a way to read, hey, you know. So then it becomes a reward environment integration problem, which is messy. So there are so many different places you can extract signals and do integration. Whenever it becomes an integration problem, then it's a lot of work. It just a lot of work So I think this is a big space calling for standardization And we are going to put our effort to drive towards that direction because we believe this is a must void and much-needed effort to put together by the whole entire community, not just us. So we'll kind of launch something towards that direction very soon. And it sounds like that might be an open-source type of tool? Absolutely, yes. Okay. Absolutely. I mean, the only way to run this is through open sourcing. We're very familiar with open source Gmail background. And anything that benefits the whole much bigger community, open source is the best tool. You know, we started off talking about kind of this convergence you observed from the PyTorch days between inference and production and the desire for teams to do online testing, A-B testing of a model once it's put into production. And what it sounds like you're describing here from a reinforcement fine-tuning perspective for these very application-specific problems. So again, not coding, not math is like almost bringing the fine tuning into your online application operations and you know doing online training of the model like how close are we to that type of an environment where you know the model is constantly being trained it's constantly looking at signals that users are given up and down and like using that to refine itself? Yeah. So I'm very bullish on that direction. Right. I think, I think we should have that. The biggest challenge today is for us, in order for to have that, people need to write what is the expectation for the model to grow into that expectation. Okay. It's like if we want to test our code, we need to first write unit test, right? And unit test is basically, I expect this piece of code to behave this way, right? And I'm going to test it. and then I know, oh, if the code is doing the good job or not, right? So that's kind of similar to reward. I hear you starting to describe this scenario where what you're saying is like, we've just kind of kicked the reward can down the road. Like before, like it was like this micro reward. And then if we, once we integrate this all in, you know, we still have to describe this reward function, but it's more of like this long-term overtime reward function. that's maybe just as hard or harder to describe than the micro reward? Is that kind of the direction you're going? I don't think we can kick the can down the road because without a good evaluation mechanism defined, many things are harder. All the things we talk about is going to be harder. For example, earlier we talked about, hey, people want to be model agnostic. But without evaluation criteria defined, when they switch models, they don't have a way to measure. Should they switch or not? Anything is going to regress or not. It's going to improve. I don't want to be surprised. So that is evaluation. And evaluation is also needed for RPE because you need to have a way to tell the model. if a model is, you know, in comparison to expectation, is it doing a good job or not? So that's where kind of, before we do anything, I think one of the common practice that is spreading to the industry is to have developers write evals. So that's a starting point. Without that, a lot of things they do is like shooting in the dark. A lot of things they do, they will start vibe checking, but vibe checking is not reliable. It's not consistent. It's not repeatable. It's not systematic. So I think that before anything, the future, I'm very bullish about that future we talk about to close the loop automatically. But the first thing is to enable everyone to write clear evaluation criteria. So I think that's a transition the industry is moving towards. Many app developers, they are very familiar with product analytics stack. Hey, I can start to trace this and build data pipeline to generate product metrics. And I know how to do A-B testing. And then I know how to look at the result and make decisions and do a lot of product-level experimentations. but how do I write evaluation for this model is a new skill I need to learn. I haven't worked on that before, but we see people are kind of picking that best practice faster than ever. So this is a kind of exciting moment, but also in the same time, it's a starting point of like do these automatic closed loop thing. So earlier, you weren't necessarily saying that the metrics or reward function for this closed loop thing was going to be necessarily harder. What you were starting to say was that, you know, we need to be evaluating in order to even hope to get there. And then once we, you know, have as a community, as a practice, like a discipline of evaluation for models, then... you know, starting to insert this kind of RFT or whatever the technology is into kind of a live optimization loop, then it becomes, you know, not just easier, but possible from the first place. Yep. Yeah. Interesting. Interesting. We started to talk about kind of interfaces and abstractions and you kind of dug into inference and talked about it being OpenAI compatible on the The post-training side of things, it's a lot more open there in terms of kind of the field. I don't know that there's quite an accepted standard. You know, when you think about kind of designing that experience, what are you emphasizing? What's important from a developer or user perspective? And how have you seen, you know, what have been the required elements for folks that are having the most success with tuning? We are aiming at a broad set of developers, but in the adoption curve of a new technology, it almost always starts with power users. People are self-learners. They can go really deep and pick it up and adopt it. So for the power user, the pattern is they want to control everything. They want to tweak a lot of knobs without being blocked out of optionalities. And if you give them a higher level of abstraction, they will be so frustrated and they will not like it. Instead, they want to kind of go as low as possible. So the system design is just kind of building blocks, expose knobs for them to tune. But then when you want to reach a broader set of audience audience where lots of people, they don't want to tune low-level knobs. They want to focus on developing product and just have simple integration and things should just work. And by things should just work, it means on a box, it has pre-configured setup that can bring you to 80%. It may not be 100% or like 9%, but 80% to begin with is good enough. So that's a high-level, simpler abstraction for a broader set audience. We have tested this hypothesis many times, and this setup is the kind of best to cover both ends. And the good news is the higher-level abstraction can build on top of the lower-level abstraction. It's kind of a staggered system. and in that sense, it's not wasted effort to build lower level. Yeah, so we believe in this design and we also think what's happening during training space is also following that pattern. What are some of the knobs and levers that folks operating at the low level are wanting to configure? So for example, in the reinforcement fine-tuning, there are many parameters. just kind of get the tuning parameters. But also, for the initial setup, so for example, we have a web app, you can write your evaluators, the code of reward function, and then you can test with your initial small data set. That's kind of easy. That's simple. And then once you are confident with the direction of the evaluator or reward function. And then you start to upload a real data set. And then you start to kick off a training job. And then the training job has a lot of parameters that you can set, but we have the default ones. You don't have to worry, just kick off a training job and go with our default ones. And then we also have default parameters, metrics you can look into, but you can add your own metrics. oftentimes uh like next level researchers they want to add a lot more details because the mattress can be overwhelming because if it's not actionable if you look at those charts and you don't know like what what do i do like it's going up it's like like or going down um even if i understand it but i don't know what's the action to change it then it's overwhelming it's useless So that's what I mean is the first mile for a broader set of audience. And, you know, we designed that that way, but we have like people can click open the parameters. They can set their own thing if they want. So that's kind of, hopefully that will bring a wide range of experience to a broad set of audience that way. I'm imagining that ultimately, I guess this is going back to that kind of closed loop discussion we had. Ultimately, I want to, you know, just kind of hook the system up to my eval parameters, my, you know, data flows and say, you know, go optimize this, do a new train every week or, you know, whatever, every day, every hour. And just, you know, keep going until the performance is better. Like, do you have a sense, you know, so as I say that, I'm wondering, like, is that interface and user experience, like, is that even well defined? Is it, you know, once everyone has their evals in place and gets comfortable with fine tuning, you know, are most of those problems solved? Or like, what is left to solve for us to get to like this, you know, self-tuning type of ideal that we've discussed? Yeah, I think for the soft tuning, I think automate that to 70%, 80% will happen pretty quickly. But optimize that to 99% will take a much longer time. So I think that's kind of, it makes sense. It's a natural progression. But we believe in this direction is, if we take a big step back, in this new wave of J&AI, the product development philosophy is also changing. as in many people we see they've been building a great product have huge amount of adoption over a small amount of time is they do not think model as a commodity or as just utility you plug in as is. They think about model as a critical part critical assets of their product. and they are actively bridging a very interesting gap. The interesting gap is the data they collect from application cannot be seen or accessible to the Frontier Labs developing models. And you can think there are two parallel threads of data distribution that's happening. One is the Frontier Labs, whether it's closed or open source model providers. They have a dedicated research and data team to curate data based on some assumptions of how applications are going to use this model for. So they curate data based on some data distribution, and that's where kind of the end result, the model quality is going to focus on. In a different company, the application product developers, they are building new user experiences and that produce another set of data distribution. By design, these two are completely different from each other. By design, they diverge from each other. Because, hey, they solve different problems and there's different entities, it's different. So the mismatch between these two data distribution is going to result in worse quality, worse latency, worse cost, right? You leave a lot on the table. And we firmly believe there a huge void in that space to standardize to bring more tools in the platform to close the loop Because we believe a large portion of data actually is not on the internet. A large portion of data lives with the application itself. And if you do not leverage that, and then there's just a lot of data is the moat, and data flying well is the moat, and you just leave that on the table. And whoever can leverage that quickly in the best way, they are going to emerge as winner faster. So that's part we want to standardize and productionize soon. We said that we would come back to this early on in the conversation. And this seems like as good a place as any, but also relevant to your last comment and thinking about the model as a commodity versus an asset. And that is open models versus closed models and all of the activity that's been happening in the model domain. You know, talk a little bit about, you know, first. Commodity versus asset and open versus closed. Like there's not necessarily a one-to-one relationship between these. There's a complex relationship between these potentially. How do you think about, you know, if your model is going to be an asset, do you want it to be open or do you want it to be closed or based on an open model or a closed model? Yeah, I do. You know, if we take a big step back, the model quality is essentially data, right? It's data quality. That's pretty much clear. because I don't believe any lab has perpetual secret sauce in terms of, hey, for the same amount of data, they just can make the model better in a meaningful way. And no one else knows how to do it. As you can see, people move around a lot, and they bring their knowledge with them, and they spread the knowledge across the industry. So then essentially it's about data. And the public data on internet has been exhausted. And then there are many data labeling companies that are pretty much providing services to all the labs. And then in that sense, there's not much differentiation. Until they get acquired by Meta and then they lose their contracts with the other labs. There are 10 different data labeling companies, right? So no one has the unique advantage, right? So it's because there's such a high potential space and a lot of investment come in. And that means a lot of companies are going to work on a similar space to bring supporting services to the broader audience. And then if you think about that, if all the labs essentially has access to the same amount of data, then the only difference is in the app space, they cannot have access to those application-specific data. The app space of data is going to be verticalized because that is assets. That is a unique, competitive edge, and no one will share outside. Then the next level question is how do you leverage that and turn that into your own model advantage, right? In that sense, open model is much easier to tune because there are a lot of tools building around open models in the community that moves really fast. I have, there are many stories. And that's why I believe in the open source community is there are many stories, even for the originator of some open model provider, an open source project provider. at some point they flip to close up. They build an open software, they close it up for internal use. And their case is again and again showing the open source project move much faster than closed source project. I've seen that at Meta. So Meta is a really open source friendly company, especially on the infrastructure level. But their case is the open source project, and there's too much maintenance to keep the community up and they decide, hey, internal demand is huge on that project, let's just close it and focus on internal need. And it turns out the open source side moved much faster because the sheer amount of resource is much bigger no matter what. So that's where I believe, and we're betting on as a company in that direction to kind of bet on customization based on open models because the whole entire community is working around that. And we want to provide the easy access of the SaaS platform for people to deploy without considering or solving any of the infrastructure problems. So that's our deepest value proposition for, and now we're kind of not just inference provider, but also deeply in the customization tuning side. What are you most excited about in terms of, you know, recent models? Like it's, you know, on the one hand, earlier expressed excitement about all this cool stuff coming out. But then you also articulate a bit of, you know, the underlying model is still a commodity. It's when you, you know, customize it with your own proprietary data, then it becomes this asset. you know, kind of rationalize those two positions. Am I correct that you kind of feel both ways about models? I feel excited about open model because we know that there's a wall on the model capability. Or there's a wall on data, on the public, you know, from like lab accessible data, right? There's a wall, right? Yeah. And that just means, you know, the model more or less is going to converge on some dimensions. And that just means the open model is going to be catching up with closed model closer and closer, which is happening today. And all the model providers, when they announce their model, they will show benchmarks against the best, right? It doesn't matter if it's closed model or open model. They all compare themselves against the best across the industry. And they will say, hey, we beat the best in those metrics. That makes me excited because now they can see it. Before the open model only compared with open models and compare with each other. Now, they are like declare victory against the closed models. And it happens more and more often these days. So that's a sign that the models are all converging. Under the assumption models are all converging, then the customization becomes, open model becomes much more interesting because you do not need to customize to catch up with closed model. You can customize to beat the closed models. So that's kind of a very interesting shift of the dynamics happening across industry right now. It's kind of, I would believe in that direction never changes. And we have been betting on the open model world to converge with a closed model very, very closely. And which recently announced models are you most excited about? I'm assuming the Kimi K2 and the Kuen, the new Kuen Coder and those which you had in mind when you lit up earlier. Yeah. Yeah, we are super busy, super swamped these days. But yeah, I think summer will be busy. There will be more open models coming up from various different labs. And is the idea in putting this in terms of your busyness that as each of these new models comes out and demonstrates its capability on benchmarks, then you have to then do some work to support them on your platform and make them available to your users? Yeah, we always aim at supporting those models from day zero. Whenever they launch, it's available on a platform so people can evaluate those models quickly. And then we will roll out the surrounding functionality. For example, a lot of those models are really good at tool calls. And we need to enable tool calls to make it work well with our service APIs, with on-demand, some kind of reserved throughput setup and all this. And the context window length has become bigger and when to enable that properly. And obviously, more longer context window, it just means more pressure on processing those tokens, generating those tokens and so on. So, yeah, so a lot of next level optimization. Some models come with new model components that, you know, we need to next level optimizing for. And yeah, so I think many labs give us great surprises as they are very innovative. they are when they train models they are putting inference into consideration how to accelerate inference during training time the different model components they run just much more efficiently economically yeah so I think this community is very active right now and really looking forward to a more state of art model launches fund open model world. Can you talk briefly about when you're presented with a new model, like what the engineering workflow is like to support that new model? So our inference engine has a lot of building blocks. So first of all, we need to kind of pick the right building blocks to assemble for a particular model. And then, as I mentioned, this 3D optimization have many options. And then we need to go through this search space and find an option we like. And we run a lot of quality testing to make sure the quality is correct. there are when we example those pieces they're not like numeric arrows that leading to accurate issue and so more like focus on quality correctness at the beginning and then there are other functionalities there is constraint generation correctness there is there are many add-on capability on top of the raw inference there's two calls there is LoRa adapter there are many other things we also need to make sure it works well with that particular model and and then we launch it and then we start to do continuous optimization so that's kind of the typical workflow order of magnitude how long does it typically take to support a new model couple hours like yeah and you know besides the the things that We've talked about new models. What else are you excited about kind of looking forward for Gen AI? I do think making customization accessible to all developers is the most exciting trend. And we are fully committed to make that happen. But I think it's not just us. The bigger community is actively thinking about that. There are many players covering different sets, different part of this journey for customization is actively building tools. I do think that's the next phase of heavy development and adoption is going to happen there. People will not just be satisfied with using off-the-shelf APIs. They want to actually fully leverage the production data curated and in a closed loop. closed loop as before. And this is a different loop, but the mentality or mindset is similar to before of closing the product analytics loop. And this is a kind of new loop that they're going to close. And therefore they have their, use their own data to have their own model and to build their own leading position in a heavily competitive space or application development. So yeah, so we would love to see those app developers continue to be very successful in leveraging their own data. Awesome. Well, then thanks so much for taking the time to share a bit about what you've been working on and how you've been approaching inference and tuning for Gen.AI. Very cool stuff. Great conversation. Thank you so much. Thank you. Thank you.
Related Episodes

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759
TWIML AI Podcast
52m

Why Vision Language Models Ignore What They See with Munawar Hayat - #758
TWIML AI Podcast
57m

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757
TWIML AI Podcast
48m

Proactive Agents for the Web with Devi Parikh - #756
TWIML AI Podcast
56m
The CEO Behind the Fastest-Growing AI Inference Company | Tuhin Srivastava
Gradient Dissent
59m

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755
TWIML AI Podcast
54m
No comments yet
Be the first to comment