Back to Podcasts
Latent Space

Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

Latent Space • swyx + Alessio

Thursday, October 16, 2025
Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

Latent Space

0:000:00

What You'll Learn

  • OpenPipe was founded to distill expensive large language models like GPT-4 into cheaper, more accessible versions for production use.
  • The company gained initial traction quickly by providing a managed flow to capture and fine-tune models, but faced challenges as model prices dropped over time.
  • Loras (low-rank adaptation) emerged as an attractive technology for fine-tuning, allowing for more efficient deployment and per-token pricing, but fell out of favor as fine-tuning became less popular.
  • OpenPipe was ultimately acquired by CoreWeave, providing an exit for the founders after about two years of operation.

AI Summary

The episode discusses Kyle Corbitt's journey from leading Startup School at Y Combinator to co-founding OpenPipe, a startup focused on distilling expensive large language models like GPT-4 into cheaper, more accessible versions. It covers the initial traction and challenges OpenPipe faced as model prices dropped, the rise and fall of technologies like Loras, and the acquisition of OpenPipe by CoreWeave.

Key Points

  • 1OpenPipe was founded to distill expensive large language models like GPT-4 into cheaper, more accessible versions for production use.
  • 2The company gained initial traction quickly by providing a managed flow to capture and fine-tune models, but faced challenges as model prices dropped over time.
  • 3Loras (low-rank adaptation) emerged as an attractive technology for fine-tuning, allowing for more efficient deployment and per-token pricing, but fell out of favor as fine-tuning became less popular.
  • 4OpenPipe was ultimately acquired by CoreWeave, providing an exit for the founders after about two years of operation.

Topics Discussed

#Large language models#Fine-tuning#Model distillation#Loras#Startup acquisitions

Frequently Asked Questions

What is "Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)" about?

The episode discusses Kyle Corbitt's journey from leading Startup School at Y Combinator to co-founding OpenPipe, a startup focused on distilling expensive large language models like GPT-4 into cheaper, more accessible versions. It covers the initial traction and challenges OpenPipe faced as model prices dropped, the rise and fall of technologies like Loras, and the acquisition of OpenPipe by CoreWeave.

What topics are discussed in this episode?

This episode covers the following topics: Large language models, Fine-tuning, Model distillation, Loras, Startup acquisitions.

What is key insight #1 from this episode?

OpenPipe was founded to distill expensive large language models like GPT-4 into cheaper, more accessible versions for production use.

What is key insight #2 from this episode?

The company gained initial traction quickly by providing a managed flow to capture and fine-tune models, but faced challenges as model prices dropped over time.

What is key insight #3 from this episode?

Loras (low-rank adaptation) emerged as an attractive technology for fine-tuning, allowing for more efficient deployment and per-token pricing, but fell out of favor as fine-tuning became less popular.

What is key insight #4 from this episode?

OpenPipe was ultimately acquired by CoreWeave, providing an exit for the founders after about two years of operation.

Who should listen to this episode?

This episode is recommended for anyone interested in Large language models, Fine-tuning, Model distillation, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this deep dive with Kyle Corbitt, co-founder and CEO of OpenPipe (recently acquired by CoreWeave), we explore the evolution of fine-tuning in the age of AI agents and the critical shift from supervised fine-tuning to reinforcement learning. Kyle shares his journey from leading YC's Startup School to building OpenPipe, initially focused on distilling expensive GPT-4 workflows into smaller, cheaper models before pivoting to RL-based agent training as frontier model prices plummeted. The conversation reveals why 90% of AI projects remain stuck in proof-of-concept purgatory - not due to capability limitations, but reliability issues that Kyle believes can be solved through continuous learning from real-world experience. He discusses the breakthrough of RULER (Relative Universal Reinforcement Learning Elicited Rewards), which uses LLMs as judges to rank agent behaviors relatively rather than absolutely, making RL training accessible without complex reward engineering. Kyle candidly assesses the challenges of building realistic training environments for agents, explaining why GRPO (despite its advantages) may be a dead end due to its requirement for perfectly reproducible parallel rollouts. He shares insights on why LoRAs remain underrated for production deployments, why GEPA and prompt optimization haven't lived up to the hype in his testing, and why the hardest part of deploying agents isn't the AI - it's sandboxing real-world systems with all their bugs and edge cases intact. The discussion also covers OpenPipe's acquisition by CoreWeave, the launch of their serverless reinforcement learning platform, and Kyle's vision for a future where every deployed agent continuously learns from production experience. He predicts that solving the reliability problem through continuous RL could unlock 10x more AI inference demand from projects currently stuck in development, fundamentally changing how we think about agent deployment and maintenance. Key Topics: The rise and fall of fine-tuning as a business model Why 90% of AI projects never reach production RULER: Making RL accessible through relative ranking The environment problem: Why sandboxing is harder than training GRPO vs PPO and the future of RL algorithms LoRAs: The underrated deployment optimization Why GEPA and prompt optimization disappointed in practice Building world models as synthetic training environments The $500B Stargate bet and OpenAI's potential crypto play Continuous learning as the path to reliable agents References https://www.linkedin.com/in/kcorbitt/ Aug 2023  https://openpipe.ai/blog/from-prompts-to-models  DEC 2023 https://openpipe.ai/blog/mistral-7b-fine-tune-optimized JAN 2024 https://openpipe.ai/blog/s-lora MAY 2024 https://openpipe.ai/blog/the-ten-commandments-of-fine-tuning-in-prod   https://www.youtube.com/watch?v=-hYqt8M9u_M Oct 2024 https://openpipe.ai/blog/announcing-dpo-support  AIE NYC 2025 Finetuning 500m agents https://www.youtube.com/watch?v=zM9RYqCcioM&t=919s AIEWF 2025 How to train your agent (ART-E) https://www.youtube.com/watch?v=gEDl9C8s_-4&t=216s SEPT 2025 ACQUISTION https://openpipe.ai/blog/openpipe-coreweave  W&B Serverless RL https://openpipe.ai/blog/serverless-rl?refresh=1760042248153

Full Transcript

Hey everyone, welcome to the Latent Space podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swix, editor of Latent Space. Hello, hello. And we're so excited to have Kyle finally in the studio. Welcome. Hey, I'm very excited to be here. Kyle, you're CEO, founder, co-founder. Co-founder, CEO. Of OpenPipe, which started two years ago and recently got acquired by CoreWeave. Congrats. Thanks. where I think you might be our first like started and exited founder that we've had on the pod. Maybe-ish. I don't know. I'm not keeping it. Especially on that timeline. Well, I don't think I was exited when we, I don't remember if we set this up before or after we announced we were getting acquired. I specifically pinged you because you got, I think you got acquired. You've been on my list to watch. Obviously you've spoken three times at AIE. and you've been on my list of like, when is it a good time to have an open pipe or fine tuning or our discussion? And then you got acquired and I'm like, okay, yeah, that's a good, that's a good time to talk about it. Also, because I think like it gives us a window to talk about acquisitions, consolidation, like what should be an independent company, what maybe doesn't have to be anyway, but we'll maybe do this chronologically. So we don't, we don't get too far ahead of ourselves. You were famously director of startup school. Yes. Maybe for people who don't know, like what, what is startup school? Did that make you become like fall in love with the data color orange? I'm wearing an orange shirt for those who are listening. A very bright orange shirt. This is my conference shirt. And I felt like, you know, it was appropriate for the pot as well. So yes, I was at, I was at Y Combinator for about four and a half years and led the startup school team there. So startup school, it's, it's changed over the years. It meant one thing before I was there. It means another thing now, but during the time I was at YC startup school was basically all of the external facing a lot of the content, certainly all of the tech. So it was things like we had a MOOC, effectively, where founders could come in, they could learn about how to start a company, they could get advice from YC founders, YC partners. We had a co-founder matching service that we built, which actually worked really well. We got a lot of people through. Our total, I guess technically I can't, that probably doesn't matter anymore, but a very large fraction of the batches that went through YC while I was there, were directly attributable to people that we found and end up recruiting to YC through their experience too at startup school. So that was kind of what we were working on. Yeah, I always kind of consider it as like the scout program for YC. Yeah. Right, like the YC before the YC. Any notable, like famous people that met as part of your co-founder matching? Because I'm always very negative on those things because like it's like online dating. Like the chances of success is super low. Yeah. But when it works, it's really nice. You know, that's a great question. When I left, so we launched that product probably nine months before I left. And so I don't know what the long-term outcomes were of that specifically. Yeah. So you left YC, you spent a year in kind of the wilderness. You went to YC S23. What's that journey like? You know, I was very excited about AI things in general. So I left YC, I guess, beginning of 2022. And I was trying out a bunch of different things. Ended up landing on what turned into OpenPipe in early 2023. This was, let's see, so I'd been working, so my co-founder is my brother, my little brother, which has been a fun journey on its own. We were looking at different ideas, and one thing we realized was we actually started the company immediately after the GPT-4 launch. And what we saw as the opportunity in the market at the time, which has changed since then, was GPT-4 was insanely expensive and extremely powerful. But there was an opportunity to distill specific workflows from GPT-4 down to much smaller, much cheaper models. And there was like a very clear value prop there, given how expensive GPT-4 was. It was hard to deploy in production, but you could sort of like take those abilities and deploy them much more cheaply. So that was kind of the first thing we built was this kind of very managed, very clean distillation flow. What was that process like in the beginning to like get people to actually care? Because I'm assuming most people are doing experimentation, but like didn't really have these large production workflows that they needed to like distill down. And then I think maybe once we got there, the models get cheaper and faster. So what was like the initial, you know, six, nine months of the company through the evolution of the model? Yeah, so it worked. It was great. So, I mean, it did take us a while. I guess we formed the company early, maybe March of 2023. By the time we launched our product, it was August, I want to say. There were some like different things we were trying in between. And actually, it was not hard to find people and get them excited. There weren't very many. I mean, this was even late 2023. There weren't very many people in production, but anyone who did have production workflows, it was extremely painful. Like, you know, they were paying hundreds of thousands of dollars a month to open AI. So it was very easy to convince them to try this out. And so we got our first three customers after launching probably within a month. And we were doing significant revenue over the next six months. We actually got to a million in ARR over about an eight-month period following that launch. So by the latter part of 2024. So actually, yes, initial traction was super strong, very clear value prop. But then, as you were alluding to, there was just this slow march of the frontier model token prices just dropping over and over by 3, 5x over and over again, which kind of ate away at our value prop over time. What was the process of fine-tuning the model? Because even the open models were not that great. And so what were maybe the bottlenecks? Like instead of having three to get to like 30 customers, did you feel like in the beginning, it was like a matter of like just the market growing, like the open source models not being good enough, like the fine tuning not being simple, efficient enough? The pain point, I guess, repeating what I said before, was the price was too high on the closed models. But you couldn't just drop in an open model and replace them because like you're saying, the quality was quite bad, especially as you're moving to smaller model sizes. But larger models, open models weren't even available at that time. So that's kind of where the value prop was, was like, hey, the closed models are too expensive, at least the ones that are performance enough. to do the job. The open ones are not good enough. We have like a very clear managed flow. The way the flow worked was quite simple. You simply put in our SDK. It's a drop-in replacement for the OpenAI SDK. It's capturing. You continue to use GPT-4 in production for a period of time. We're capturing the requests and responses. And then we had just a very clean managed flow where it's like, okay, at some point you say, hey, I want to distill this down and you train on that. And then we provided an API that was a direct drop-in replacement. You would just change kind of the inference URL and you were using your own model in it, your app continued working. Yeah. I think the market analysis here, because I was also exploring starting a business around that at the time. And that's why I ended up not investing. Was basically, you get squeezed between the GPU providers who also want to do fine tuning as a service, because then that makes people more sticky. And the labs who keep putting out the still versions of something, whatever, many versions of their models. What was the analysis on the NeoCloud side? Because you kind of also want to host the inference. Yeah. Honestly, we, like I said, felt very squeezed from the Frontier Labs that were putting out just more capable models at lower cost. I did not see the competition ever really materialize from the NeoClouds, from the GPU providers. Everybody had an offering in fine-tuning. When we talked to customers, nobody used them because they just were really hard to use. So I do think that, like, you know, call it a product thing, I guess. like it's not their focus yeah who cares yeah interesting developer experience matters it does yeah still does did i don't know maybe it doesn't matter anymore now we just have coding models to everything no it still does like when you have very and when you have thinking machines launching an api and people getting excited about the api you're like yeah okay that's just a pure developer experience there that's fair yeah yeah what's the i'm just going through the chronological list here what's like the mistral 7b find two and kind of like one of the big inflection points like in the history of the company it's like okay this is like a good open model and like the 7b size or is it just yeah mistrial and mixed trial that was like a golden period of fine-tuning startups because mistrial was like a credible open source model yeah they were really strong models um better than the llama 2 that they were you know effectively replacing um and they also have a super open license uh which which i think the licensing has become maybe less of a concern over time at the margin because people are getting used to maybe. But at the time, that was like a pretty big deal that they had this fully open Apache 2 license. And, you know, yeah, maybe they have their own like IP issues with how they train it. I don't know. I have no inside information there. But at least the guarantee they were making to people using their model. Yeah, I call this mistral washing. As long as it's like, it's this, you know, constant sparkling region of France called mistral. It's okay. Don't ask about what goes into it. There's plausible deniability. Exactly. arm's length connection there yeah okay there was this mistral period uh jan 2024 you talked about s laura and there was a there was a period of time where laura's uh became more important i feel like they then became less important and i don't know what what's like the rise and fall laura's for for you as a business yeah so so laura's um have really really so if you're predicated on the fact that you're doing fine tuning at all laura's have very very attractive properties relative to doing a full fine tune, right? Because if you're doing a LoRa, you can, at training time, it makes, it helps some, you're using less memory to train, but it really, where it really helps you out is at inference time. Because if you're doing LoRa's, then when you deploy it for inference, you can multiplex, you know, basically an arbitrarily large number of LoRa's on the same GPU deployment that lets you do things like do per token pricing as opposed to GPU hour pricing. It just gives you much more flexibility at deployment time. I'm actually still a LoRa Bowl, like for the record, You know, you're talking about the rise and fall. I think I think Laura's, you know, their their their future is still out there. I mean, they're cool again because thinking machines. Yeah, I felt very vindicated by that blog post for the record. Just I guess for listeners, thinking machines put out like a week or two ago with a blog post doing quite a lot of research on the tradeoffs between Laura's and full fine tuning in various different training regimes. I think the reason Laura's were uncool for a while was was mostly just because like fine tuning was uncool. Like, I think if you're doing fine tuning anyway, like Laura's are still like, you know, in many cases the way you want to do it. But not that many people are doing fine tuning. As a marketing guy, Laura's had bad marketing. Like they were just like, oh, like you can't afford full fine tuning. Here's like the Walmart like store brand fine tuning. No, that's that's fair. There is some of that. I think we didn't have a huge issue. Like we've had to do some user education. Like, hey, just try it. I think for the training runs that like the types of training runs that we're interested in, where it's like, hey, I'm doing a relatively lightweight customization of an existing model for a specific task. There's really no downside to using a LoRa. And there's a lot of like upsides from an like infrasimplicity point of view. I agree that there's like a branding issue around that. Hopefully the Thinking Machines blog post kind of like, you know, like rank one. And like, you know, I think there's there's different hyperparameters, the LoRa's that you can use to to make yourself happy. the fact that john shulman was like no like we're actually banking the company on this at least for now is a pretty big vote of confidence i you know i feel i think it's surprising that no one's done the research prior to to them and i was talking to someone i think machines prior to their launch who had come from one of the big labs and and what that research was like oh no everyone doing post-trainer research inside this big lab uses laura's i mean not for like the full run but like when they're doing like their experiments they'll just use laura's on on a base model to to run the experiments and works fine for listeners at the pod that was leaked in one of the pods that we released, but it's up to you to find it. Cool. And then, so then it was the first World's Fair. You talked about, you probably don't need fine tuning as a fine tuning founder. Basically, I think your talks are really good. I would recommend people watch all of them. What I pulled out was you had a piece of advice. So your talk title was obviously somewhat intentionally clickbaity, but your actual advice on when people should fine-tune is when it's cost latency or quality consistency that you that you really care about yeah i mostly stand by that i don't think it's changed and the biggest one we see today and this is true for kind of like classical sft it's also true for the rl stuff we're doing today crossing my fingers it's not always the thing but the main one i see that really drives fine-tuning is if you have to move to a smaller model and it's typically for latency reasons and this is usually like real-time voice so if you're sort of forced into a smaller model anyway then there's a very high chance that that doing some tuning on that model is going to get you like it will be necessary basically to have a successful deployment so we see that a lot coming from customers that again have those latency requirements there's other reasons as well sometimes for whatever reason you really have to deploy on a single gpu you have to deploy within your own cloud and you want a you know you basically have to use a smaller model to do that so basically in the case where you're forced to a smaller model anyway then fine-tuning it is often necessary i would say for 90 of use cases where you aren't forced to a smaller model, then it's still not a good ROI. And you probably shouldn't invest in it today. How do you quantify these things? So costs, right? Could always be lower. So is there kind of like a threshold of like cost to ROI? Because it's also hard to figure out how much it's going to cost you to fine tune because you need to get the data and all of that. Like, do you have a mental model of that? This is sort of like a function of the total amount of overhead required. I'd say there's two parts on the cost side, and then there's multiple parts on the benefit side. On the cost side, the main things you have to think about are the upfront effort required to get an actual training system set up for your task. And that can be quite variable, but I would say at a minimum, you're going to have to dedicate a couple of weeks of a fairly competent engineer's time. and if you have and if you have like a very complex system and you're doing rl and you need to set up a whole environment it could be a lot longer it could be you know a couple of months of time so that's just like a fixed cost you have to pay there's also like an ongoing carrying cost where once you've committed to doing fine tuning it does make other parts of your stack less flexible less nimble because whenever you're updating your prompt or like you're adding new context or whatever like now you have to like you know spend a few hours training a model and that's just going to like slow down your iterations like which is a real cost in many cases that's the larger cost so you only want to do that if like the benefits are large enough the dollar cost i would say is basically never a factor um it's just so much less than the time the amount you're spending this engineer to to do the work but it's not i mean it's you know each of these runs is between five and a couple hundred dollars um and it's just you don't have to do that many of them Yeah, because most of the data is like first party. Yeah. Right. Okay. When was the switch to RL? Was it when 01 preview came out? You were maybe like, okay, it's time to move on from SFD? Yeah. So that was a big moment for us. There's all the leaks before that about Strawberry and all this. And a lot of people talking about, okay, how are they doing it? We realized through that that, okay, someone's figured out how to make RL actually work with LLMs, which was not a thing. I mean, it was a thing like some people had played around with before that, but it wasn't like I think many people were thinking about. And so our bet at that point was, yes, let's figure out whether this works for TAS specifically. And the space we just I think it's important to kind of like tease out different parts of the market. I think with the release of O1, and this has been proved out many times with releases since then, I think there's now a very strong consensus that, OK, on the frontier model, general purpose model side, investments in RL are paying off. I don't think most people would argue with that. Especially as you're getting into these agentic tasks and training them to do that, it seems very clear. Well, obviously, the big lattice are paying ridiculous amounts of money for these environments and everything. But also, they're actually getting really good results. the models coming out you know we're seeing it especially on the coding model side but like in other in other contexts as well we're seeing the sort of especially agentic use is working way better because of this so i think like even late 2024 it was pretty clear that like rl was going to work in that context and then the question in our mind was like can we apply this in a different segment of the business which is kind of like task specific customization and so the question is like does that work well how much effort does that take is it going to be something that ends up being unnecessary because, oh, the big labs can just like train on every single task and the base models are going to be just good at everything. And so there's, you know, no benefit to it. So those were kind of the open questions in our mind, but it seemed like there was like at least a good enough bet that, you know, we wanted to try it out. Yeah. And you had this agent reinforcement training framework and you did the email agent. It's kind of like the first proof of concept. Was that obvious to do email? Was it obvious to call it that way? What was like the behind the scene? How should we package this? So what I told our team, and this was, we decided to go all in on RL in January of 2025. And we've been doing some experience before that. We released before that kind of like an RL model that had, you know, would generate like hacker news titles from articles, which is a fun project. So we've done a little bit before that, but that was kind of like, hey, we going to bet the company on not in a literal sense like we could have done something else later but like this is like the thing that we going to spend all of our time working on for at least a few months and like what i told our team at that time in january 25 was like there probably like a 25 chance that this is the right direction in the sense that like a year or two years from now all the companies you know everyone doing inference should be doing rl and task specific training so that like their models just way way better at their task is a relatively low chance but it was sort of like one of those big of true things. Like if that is true, if it turns out that like just doing RL on your task is just like something everyone should be doing and it's, and it's just, you know, teaching these agents continually, teaching them through experience is just going to be a huge benefit than like being the first people working on that would be a really, really like awesome position to be in. So that's how we thought about it is like, you know, less than 50% chance, but really big outcome. If not, if so, I think since that time, and I've been very transparent with this, like with our team and like when I'm talking to other people, like, I don't think the chance that that is the right approach is 100% yet. I think that we're still in the process, even after going through this and, and, you know, doing that of like figuring out, but the probabilities in my mind are going in the right direction. Like now I think they're actually like today, I was actually just thinking about this with another conversation. I think that the chances that like everyone should be, or, you know, everyone who's deploying an agent at scale should be doing RL with it either as part of sort of like a, you know, like pre-deployment or even like continuously as it's deployed, that that's like the pattern that that's going to get to, I'd say there's like a 55, 60% chance that that's just like the better thing to do. And that's informed by kind of like our experiments working with customers. So anyway, not a hundred percent, but like going all the way back to your question, like, no, it was not obvious. It was an informed bet. You know, it's, it's still a bet, but one that I'm feeling pretty good about right now. One thing I think that is tricky about just as he's onboarding onto this space is all the math. I remember reading the DPO paper. I think they were at New Rips for 2023. And people were very excited about it. Some of it's like just being pretentious for a paper, but some of it's actually like real complexity. You know, you don't have like a PhD, like a prior sort of ML background. How do you sort of come to grips with it? Like what were the best ways to get around it for you? I would probably push back on that a little bit. I don't think the math is actually that complicated. I think that like when you, you know, you see the ppo equation or something with all the symbols like if that's your first intro to it then it feels very complicated but i think like if you were to show that exact same equation just like code not maybe not pytorch code because that you also have to like understand but if you just like did the naive implementation in like python and like showed someone like hey this is this is kind of like how we're computing the loss here who was like a strong engineer like i think it's actually like quite grokkable so yeah i mean like i don't think it's like the buried entry is that high i I think you just have to like believe you can do it and then like spend some time staring at it. That would be what I would recommend is like, you know, you can read the papers and look at the equation. I think actually this is one area where OMS have been super helpful. If I'm reading a new paper and I look at one of those equations and I'm like, I don't understand how this new term they introduced like corresponds to like these other terms. Then I can like dump like all the context around it into, you know, GPT-5 and say like, hey, can you like write this out of Python for me and show me what they're doing differently? And that's super helpful for kind of like my background, I guess. Yep. The way I put it is I wish that all these papers would just publish with pseudocode or just straight up Python instead of math. Yeah. Because you actually just need to look at the implementation. Yeah, totally. I know like Jeremy Howard's been beating this drum for years and I mostly agree with him. Well, I mean, there's a little website called papers with code and like people just keep not following it. I remember interviewing the DPO guys when they're at NeurIPS and it was just like, they were just very obsessed with like proving. in principle, equivalence to PPO. And it was very hard to follow. I'll definitely say that. And I think now, obviously, at some point, GRPO kind of took over the general consensus. It was very strange because I think when DeepSeek first started talking about it, it was viewed as an optimization. They tend to just generally coach everything as an optimization. But I think the later insight, which I think you touched on in one of your blog posts, was that no, it actually makes comparisons independent rather than global. And that's actually what unlocks some models like self-supervised RL. Yeah, I mean, it's interesting. There's real pros and cons. If you're moving from PPO or something similar to it to GRPO, there are some big pros. I mean, one pro is just sort of like operational simplicity. Like there's a whole extra model you need for this value model you need for PPO. that you can throw away with GRPO. And that just makes your life easier. You don't have to train that model, but also there's no hyperparameters around that model that you have to configure. So that's nice. Another thing is the benefit that you're talking about, which we've observed. So the way GRPO works is you have to do a set of different trajectories or a set of different rollouts all in parallel with the exact same environment, the exact same conditions, and then you score each of them. And GRPO uses the differences in those scores to promote the trajectories that did better and sort of like decrease the probability of the ones that did worse because they do it in sort of a group relative way the only it lets you be a little bit looser with how you score them potentially like you don't have to necessarily have a globally aware scoring function you just need some scoring function that is able to distinguish between this small set of things you have in front of you and then that's easier that's easier for a human you know if you if you tell a human which of these who choose which of these is better it's easier for to do than say like is this one good or bad in absolute terms yeah so that's nice the big downside the huge downside of grpo and i think actually the reason why grpo actually is is likely to be a dead end and we probably will not be continue using it indefinitely the fact that you need to have these parallel rollouts in order to train on it is actually the like that makes the data generation much more complicated because you need a fully reproducible environment to be able to do these sort of parallel rollouts. And it turns out in practice, that's like getting that set up is the hardest challenge today with getting RL working is like actually designing this robust, reusable, you know, environment that you can run all of this training in most companies. And that's not true. Like sometimes that's easy to do. Like, like there's certain situations where you can do that, but for the work we do, at least where we're training agents on real code bases to like operate, like, you know, real applications, it turns out it's like really, really hard to sandbox those things in a way that's like totally reproducible. And PPO now in practice, a lot of times when you're training with PPO, you also will use an environment like that because it lets you do a bunch of runs and be more data efficient. But at least in principle, you have the option with PPO, you can actually like purely train on like, say real production traces of like real people interacting with your app. And so you don't have to have a simulated environment at all, which makes the deployment like much easier. Can you double click on why it's hard to do the sandboxing because in principle we just capture all the inputs yeah well you don't need to just capture all the inputs you need you need a system that reacts the same way your production system does that's and in many different ways and um so let's say you're you're Airbnb right and I'm bringing this up because this is like an example of one that like you know companies have gone out and built sandboxes like if you're Airbnb and you're trying to um you want to train an agent to like maybe you're not Airbnb fine you're you're a company like us that's trying to train an agent to do really well at operating Airbnb and booking on your behalf, right? You have to build a copy of the Airbnb website that reacts to you as the user the exact same way that the real one does with the same failure modes, right? Because if you don't include the same failure modes and bugs they have, then when one of those bugs comes up in production, your agent's going to have no idea what to do with it. It's just going to fall over. You also need to simulate, if this is a sort of cooperative agent, right, where it's getting human input as well and kind of working with the human to get something done, which in practice is the way a lot of these are deployed. you also need to simulate the user and i mean you can do the naive thing and just say oh we're going to have a separate llm that you know with a system prompt that is like the user simulator and we do that but it's like okay but like the breadth of ways a user might respond there's like a lot more diversity in that than the actual diversity you'll get in practice when you have this like simulated user and so then it's like okay well is this environment close enough to how a real user would interact that like you know if a user says something different that it's going to know what to do and the answer in many cases is no if you're just purely training on kind of an LLM user simulator, it's going to have its own idea of what the correct way to answer is. And the breadth of a way a human might respond in this situation is wider and your agent just may not be able to deal with that. Do you feel like it's hard to build the simulations as a company that needs to build the product that lets everybody do it? Or do you feel like even for the individual companies that own the code base that are domain experts in their own product, it's still just a very hard infrastructure problem? I think it's still very hard. You know, like ideally all companies should have this anyway, because they're getting, you know, if you're doing end to end testing, like theoretically, if you're following best practices, you would have one of those set up. When we talk to enterprises almost universally, that's like not something that really exists. So there are some startups like there's some companies we've talked to that do have it and we can just like use that. But it's a very, very small number that actually have an environment like that. And I think it's hard to do. And like there's lots of like weird bugs that don't show up in an environment like that. And even if they do have a testing environment, they don't have it populated with full realistic data, which is also important so that it understands how to interact. So I think in practice, it's hard in both cases. Maybe it's easier for the company, but at the same time, depending on the quality of the company's engineers, it might not be easy for them either. Yeah. How do you classify the types of environments? So you have formal environments, like a compiler. You can put in there. You don't need to do any work. They just work. then you have this kind of like rl environment startups in a way that are building a bank environment they're building these things that are not digital twins or whatever term of like the actual environments but they're like close to it and then on top of it you have helping people trying to build the exact replica of their thing there's obviously value in like the formally verified ones we verified that do you think there's value in this like rl environment startups that are building like somewhat generic but test-specific environments. And then if none of those work, then what do we do instead of GRPO? I guess the question. Yeah, I suspect there is value in that. You know, I think the folks buying those environments and training on them in the big labs would have the best knowledge on how well they work. I think they probably work okay. I think they probably also are like, you know, and we'll see maybe with the next generation of models released, like how well they transfer. I would say so far, it seems like they don't train well enough. Like if you use, you know, OpenAI's agent interface, it's like, okay. Or if you use the computer use products that everybody's putting out, they're like, okay, but like not reliable enough to like actually like let go do something interesting unsupervised in the world. And I think if the environments they were training in were high enough fidelity, then they would be good enough in the same way that like coding agents can go much further. because I think that in that case, we do have environments that are much higher fidelity because it's a much simpler environment in a lot of ways. It's like, it's a code base. It's like maybe running a web browser. Like it's much easier to capture the full realistic environment in that context. For those who are interested, when you make a reference to our environment startups selling to the big labs, they're selling it for a lot of money. Yeah. Like at least seven figures, right? That's my understanding, yeah. I'm not a buyer. Please drop data points because people who are not in Silicon Valley don't know this. And it's probably the current thing in VC, which is our environment startups. Anyway. A lot of them. There's like 20 of them, apparently. But it's like a small number. I know that, yeah, all the labs are buying ad hoc. But in a way, it's almost like they don't even care. It's not a product. It's like they're basically paying the company to build an environment ad hoc for them. It's a very services business. Services business. But I mean, if you're spending like a billion dollar in a training run. You can specialize in like, we are the one that does e-commerce. Like we are the e-commerce experts. So come to us for e-commerce. Go to the other guys for like social media. Go to the other guys for like, I don't know. But I'm curious, your take is like, how do you need to get the data out to make it fit in your training run? Especially when you get to like these larger labs. I think they're like very sophisticated post-training pipelines. And I don't know if there's like a way to just build a company where it's like you just send them a CSV of like data. It needs to be very integrated in it. But I'm curious what you've seen working with customers, too. So for RL, like the whole way this works is, you know, it has to sort of be getting feedback from the real environment. So I don't I don't see a world where it's as simple as like, hey, you can you know, there's like a CSV type approach. I guess you could code anything as a CSV, but if you try hard enough. um for our old work you have to be looking at real runs ideally of your actual agent in its current state across within an environment as real as possible so you have to like look at actually um and like the data format's like actually super simple like it's just like basically a list of you know like chat completion messages um it's it's effectively whatever tool calls yeah you're exactly yeah it's whatever your agent will be seeing and doing when it's running so that the getting the data is not hard but what's hard is like when you're doing one of these runs and your agent makes a tool call okay now that tool call has to connect you know somehow it's got to get data back from something and that data has to look like it will look in in real usage so setting up that whole part of the system is uh is the challenge and then for just a reference job for more people web arena is my first instance of this kind of thing where you literally have a docker container that has like a clone of reddit a clone of wikipedia clone of gitlab clone of cms and a clone of a e-commerce place and i think since then there's like mind to web maybe i don't know if there's other large, well-known academic environments where people are basically using these as benchmarks, but probably also it's pretty useful for trading. So if you want to check out those things, you can definitely check there. I think the question for you is, as someone who bet on SFT, then you bet on RLFT, and then now you see these guys making a lot of money, why didn't you go there? It seems to me like that definitely is a services-heavy business at the moment, as it's presently constituted. I'm sure that these companies are all developing different kinds of secret sauce on like how to, how to do this like more quickly. So that's part of it. I don't particularly enjoy services businesses. Um, but you know, I also kind of feel like we will move towards a world where either the big labs, like it's one of those businesses where like the only customers right now are like whatever, four big, maybe, maybe, maybe six big labs that like, you know, are training these models on environments. And I don't think I'm a little Right, what's the tab? Yeah. But, you know, like, look, you can say the same about Scale AI and all of their competitors that are like, you know, many billion dollar companies that have basically the exact same customer set. So, yeah. It may work out. Yeah. I don't know if you want to do a small shameless plug for Varys. Oh, yeah. I mean, so Varys, one of our portfolio companies, they work with the people building the agents, not with the model, on like their internal tool call loop. So they can observe all the internal traces and build the data to then have like a open pipe do the RFT on the thing. I think in the enterprise, we've seen a lot of that, especially for chatbots. It's like the less sexy use case, but like their work with a lot of financial services company where their customers go in there and say, what's my balance? Like, when did I do this transaction? And those are all tool calls, you know, and they need a way to test and improve that behavior. And the models haven't got in that much better because these tools are like badly documented. They're like badly named. I think that's kind of like the problem with a lot of the agent builders that are not AI native companies. It's like they just put this like very generic tools in the thing and then they expect it to work like magic. And the simulations kind of help them also have the usual compliance things. It's like before shipping this, we tested that it doesn't give financial advice. We tested that, you know, there's all these different things. So I'm curious to see how much the companies generalize. I think like there's a lot of success in like highly regulated environments because of different requirements. But I'm curious if you have a different way to segment the market of like when you think about RL, there's like environments that are like low stakes. There's like environment that are like high stakes. There's environment that have implicit rules that are made by the SEC or other government agencies. How do you think about it? Yeah, I don't know that that segmentation is necessarily the most relevant. I'd have to think more about that segmentation, whether it's, you know, there's like a strong difference in how useful RL is across those sectors. Where I see the segmentation is something basically just like capabilities based where it like hey if I trying to do something that like much more advanced and you know maybe like long horizon then RL can probably give me a much better behavior And I might almost think that like, yeah, those sort of like more compliance, like I feel like in those kind of environments, you probably don't want your agent doing very much because then it's like you can't make any guarantees about what it might do. And so you're probably not doing these long horizon things. And maybe RL is not going to get you what you want. But I don't know. Yeah, I haven't thought about it too much. I think a lot of the customers don't necessarily end up doing RL anyway. It's almost like the simulation and the environment. It's like a way for them to understand the paths that the agent can take. And less about we need to then use that data to do fine tuning. But I think it's going to be a spectrum. What replaces your PO? yeah it's a good question we need the alpha yeah i mean i don't know is is the short answer i do think this this is like a fairly high salience question in the research community i think there's a lot of folks like trying to figure that out every paper has a variant um yeah but a lot but i think you know the the big question is like are we doing you know normalization based on grouping or in some other way right that's that's like i would say like i would claim we're just going to keep calling it grpo as long as the normalization is done within like a group even though yeah there's a lot of things that like probably should get their own names a lot of things that have tried to get their own names and have failed on the marketing side yeah i think something that like doesn't require group level normalization which a lot of you know older things didn't probably works but i think that the older things also are really finicky so there's there may be other kinds of simplification and i don't know exactly what what those will be where do you put the prompt optimization thing. We did a DevDay episode and we mentioned JEPA and then everybody came out of the woodwork on Twitter. Yeah, exactly. Tell me, have you or people you talked to tried JEPA? I want to know like what? I read the paper. I'm just like, look, like the prompt layer of updates are not the same as weights updates, which they're just comparing apples and oranges. And I talked with a few people I respect on this, on the RL side, and they kind of validated like the way that these grad students market their papers is their thing beats the current hot thing and the current hot thing is grpo but like i they're just not that comparable i disagree with that like i actually think they are comparable in the sense that like it depends on for what purpose right but like if i'm a company and trying to like get the best performance out of my agent like i don't care if you're changing my prompt or if you're changing my way so if you get better performance on my agent you know i'm i'm happy on that front i do think they're comparable and we've evaluated i mean we evaluated like like the so their answer was you are going to do both if you really want max performance you're going to do both yeah we valued everything from dispute and we we evaluated jeff as well and it's like it just doesn't work okay like okay that's going to be the fighting words yeah jeff doesn't work it didn't work on the problems we tried it on it just didn't it got like a minor boost over the sort of like more naive prompt we had and was just like it was like okay, just kind of like our naive prompt with our model gets maybe like 50% on this benchmark and like JEPA got to 56 and we do our own, we get to like 96. I mean, it was just like not even comparable. And so maybe we were holding it wrong. So both sides are claiming skill issue, right? So what they would say is you probably used it wrong. And then are all people are saying that probably JEPA guys, when they set up the JEPA benchmark, it wasn't a very fair comparison, which is exactly what my source said. It's hard to tell, You know, everyone has, everyone has, uh, is trying to get to some version of the truth. Yeah. Uh, what I will say is like, we, we want it. I mean, I don't know if I would say it goes so far as to say we want it to work, but we certainly want to know if it works. Like that's like actually very relevant to like the, if it's more efficient to get there, uh, then you should probably do it. That's yeah. It's actually kind of more credible now that you're like, you know, you, you're part of a larger core weave that you're not obviously. because i think jepa maybe is uh makes uh open pipe like less relevant i i totally would disagree with that okay because like the level we see ourselves operating at is actually we're not like rl bros trying to figure out like the use case for all rl we're like hey we're working with all these enterprises we all these big companies we're talking to and we're trying to figure out like how we make their stuff work better and so like i personally i'm very motivated like if something like jepa works like okay let's let's build a product around that that's how that's how i think about OpenPipe at least. No, I mean, that's a good clarification to make. Even more so, you actually took a sincere look at it and you concluded that there was nothing to do, nothing to build. Well, you know, maybe we were holding it wrong. So we had Shen Yu on the podcast a while ago and I think he's been a proponent of automatic prompt optimization and this idea that you can do a lot more in the prompts than you can do in the weights. And in principle, I'm biased, inclined to believe that something like a DSP, something like a JEPA works. uh so i'm very surprised to hear this yeah like we keep trying it you know we tried the mipro b2 stuff that was hyped before that also okay i should not bury the lead on the best argument for this which is it basically jeppa models how the big labs do their system prompts it's uh genetic evolution you know and they and they sort of incrementally uh evolve based on like the overall evals that they have it's slow because it's done by humans but jeppa theoretically improves it automates this okay hold on is the kind of the big labs have something uh this is this is no no no this is philosophically the same i'm not saying like oh sure but like you're injecting a whole lot of human intuition and kind of like potentially out of band we have the best model in the world which is humanity yeah or like smart humans uh and now we're doing jeopardizing dumb lms right but they're also like the humans can bring in out of band information that like maybe is not captured in the actual like you know the eval like they can be like oh yes technically this did well on the eval but it's like not really you know like i would suspect that a lot of that ends up getting injected through that human being in the loop yeah yeah i've always been very surprised at how these guys work on their system prompts which are tens of thousands of words long and there's no ablations they just kind of pick what seems to work and then chuck it in there and that is the cloud system prompt can't argue with success is gpd5 the first model that had a prompt optimizer by one of the large labs i believe so but i don't uh cloud workbench had this like a year and a half ago if you see it that way it just wasn't like fully automated but it was extremely good for its time i kept telling people about it nobody believed me do we know if they used it internally cloud workbench yeah Okay. Why not? Oh, I don't know. Like, I just, my experience, you know, knowing a lot of people at these labs is like they launch a lot of products because like some team is super excited about this product. But that, I wouldn't put that much weight on it just because they launched it. For some measure of use internally, I am sure the people I talk to are biased. I don't know if you fully explored that. Yeah, no, I think that's, it's just interesting that now it's been a knowledge that like the LLM can improve your prompt. And so I think like Japan always also writing this way of like, okay, maybe we can do this programmatically, but I also think the long tail of people just prompts really badly. And so I think there's some value there versus once you go into RL, you already have like a more sophisticated audience, you know, like who gets to do GRPO? People that are really smart, who gets to do prompt optimization, like everybody's trying to do it. So yeah, that's right. Maybe, maybe our baseline was, was, I know your naive prompt is probably like, you know, top 10 percentile of prompts that people put in these LMs. I'll take it. And then the other thing that comes to mind as you were talking about things, injecting things out of ban and all that, I think it's a broader trend that I'm tracking for WorldSphere 26, which is the move to online evals. The way that we do evals today is probably too locked down. You're kind of fighting the war that you already know should be fought and you're not fighting the wars that you don't know about because you didn't plan for it, whatever. How can we sort of move more online evals into our JEPA process? Maybe that's what it is. That part I'm much more bullish on. And we can make the analogy, we can pull in RL intuition here, which is, if you're doing JEPA on a static data set of, oh, this is the input, this is what makes it good or bad output, then as you're updating your prompt, your information, the data you're training on becomes less useful. Because it's generated, because it's based on the problems you were running into before. and that's the same problem you have with with rl where where you have this concept of being off policy where it's like as you're doing training you really want to be training on rollouts that came from the latest version of your model because if you train on some that came from further back then it's like it's sort of stale data and it's like not it's no longer representing the current issues with your model and so if you try and correct for the issues that existed back then it may not actually be helping you that much and i think you know for either rl or prompt optimization, that's definitely true. I think that like one way to apply that in practice is exactly what you're saying, where you're using the actual data from your, your real evals. You have some way of saying like, Hey, either people flagging these or no, I'm flagging these or some way of saying like, this was a good or bad output. I totally agree with you that like, if you're bringing that into your process, I'm like much more optimistic that you're going to get good results. Yeah. And the pipelines are not set up. Like this is like analytics and UX people like trying to being drawn into the ML process, which they've never been done before. If I had to make it better as a big theme for next year, this is going to be it. No, I agree. And I mean, I think that like all of the sort of observability people like platforms see that and like are trying to figure out what the right shape is. I haven't seen the right shape yet, but yes, it seems like a theme for next year. Statsig? Maybe. Yeah, I haven't used them, but OpenAI seems to like them. yeah i mean like uh i do think like buying you know an experimentation platform makes sense and like you know i think it's sort of like i've said before on the podcast i think that i'm very bullish on model routing as a feature but less bullish on model routing companies because of exactly stuff like this where like it is just going to get get absorbed into the model it's it's a very big part of building the process you probably don't want to and it's not that hard Like it's not rocket science. You're just like connecting pipes and making sure things are set up so that it's easy to use that data. I have a question for you, a general question. So what fraction of tokens generated by, say, like the end of 2026, do you think are going to come from open source models versus proprietary models? Oh, that's a fun question. So we have an answer from Ankur, from Free Interest, where it was like it's 5% and going down. I think it's going to go up because of the amount of enterprise adoption of open models that I'm seeing. And also... Because there's a lot of demand. Like, the enterprises would much rather be on open models if they actually could get the performance they're looking for. Yeah, for cost, for privacy, all that stuff. And I think like basically, honestly, it's just literally like we may have hit quote unquote AGI in a sense of like it is the average LLM is capable of the work of the average human, not the best human, but the average human. Sure. Like it's actually pretty decent at customer service. And it's actually pretty decent. And like, I don't know, transcribing things from PDFs, whatever. So like, yeah, I mean, totally. I think that should rise. But people who believe that it should rise to like 50% are out of their minds. And I think it's a true question. We should take coding out. I think once you take coding out, I think, yeah, it can be like 15, 20%. But I think with coding, it's still going to be very low. Because like these max plans are like so subsidized and so many tokens are being generated. Like Anthropic is like, you know, 50% of the revenue. Is your claim that it'll mostly be, you know, that coding will mostly be closed models because the tokens are subsidized or because the models are just so much better? I think as long as, I mean, I'm paying 200 bucks a month and it's like, I'm spending thousands of dollars. Like by accident, by accident, I pay with like my credit card and I spend like a hundred bucks in like an hour. And it's like, this is like the thing that no one wants to talk about for Anthropic. Like Anthropic went from like 1 billion in revenue to 5 billion. And it was like, ooh, yay. And then like, what's the margins? you have this like goose me i'm going like what's the margins um they say it's like six percent there you are part of the six percent that is abusing everything so everyone i'm not abusing you're the last leader it's not like i'm rotating accounts i'm just using the products for you know it's like yeah yeah but like through you people like hear about cloud code they pay the 200 a month and then they don't use it and they they pay for your influence thank you thank you everyone keep doing it right so i don't have to go away but i think like i don't really see it's hard to see a world in which quant coder or whatever model replaces that because between quality and cost it's like to make to generate this amount of tokens for 200 bucks a month i don't know how anybody can like offer like together fireworks they cannot really offer it at that price and the quality is not as good but the reason they can't offer that price is is because of the subsidies right which which is not like the long-term like sustainable i mean it's interesting because so both Anthropic and OpenAI are building their own infra, right? And like, they're going to get to a place where they're going to have idle GPUs that they own. And so they will also be incentivized to have 100% utilization. And so, you know, they will subsidize some of it. Just the same way, you know, if you go on SF Compute, you pay a buck 40 for like an H100 instead of like the 220 listed price on AWS. So I think it will continue, but again, it depends on whether or not they actually have the 500 billion, like they were saying, which I think they do. You know, just to be clear, I think Stargate will go online. But once it goes online, then it's like, well... If they figure out how to pay for $500 billion worth of compute, then they probably can subsidize for a while. I think they have the 500B. They're going bigger. Isn't it obvious? What do we mean by have? At the start of this year, when they announced Stargate, people were like, oh, you don't even have 10. Elon was like, you don't even have 10. Whatever. And then Satya's like, I'm good for my 80. but like now now we're seeing all the money start coming in and like probably it's in the order of like 200 300 billion like that you could probably get raised and committed and they're gonna get the rest like it's it's fine like i think that the plan is actually a lot i just can i just say i love this industry it's like yeah they've got like two or three hundred billion and like what's another couple hundred billion there's no other industry in the history of the world where you yeah yeah it is it is stupid but like also like do you doubt it like i i don't i like yeah no like i i literally like after last week i think maybe two weeks ago with the whole oracle nvidia and then even amd deal i'm like oh like these guys not only they've locked down stargate one they're working on stargate two wherever that uh that is and and like the sheer ambition is like freaking crazy there is still one more shoe to drop which is the non-sovereign wealth funding that open ai needs to get which they've promised to drop by the end of this year and my money is on they have to do a coin like it's i'm not a crypto guy at all but like you know this is like an open ai coin this is the one ai founder that has his own coin already yeah and like he needs more money and he said that they will come up with new innovative financing methods it's what else is there yeah i mean they're already in the token selling business like but you gotta that's a great line uh like buy an open air token it translates to a gpt5 token like you sure it's a stable coin you'd have to you'd have to get you'd have to get a lot of political buy-in i think to to take that level of what the white house that is most crypto friendly since the dawn of time. Well, I guess like Elon's out of there now, so maybe they can get the, make, make, make, make the friends. Yeah. I think it's doable. We'll see, you know, like, uh, who knows? Uh, yeah. I, I, for what it's worth, uh, nobody's like, this is a, this is a me theory. I don't have any insider information. Uh, yeah. Should we go back to ruler? Yeah. Sorry. Right. Open fire. Anyways, we were saying, I think this story takes us to July 25 when you release ruler, which we call easy mode for RL rewards. And then, I mean, shortly after you get acquired in September, So maybe you just want to talk through the summer, you know, what was the vision, then maybe how the acquisition came together. Yeah, absolutely. So, you know, I mentioned my initial, like, opinion of, like, how likely this direction was to work was maybe 25%. We're up to, you know, 55% or so. And Ruler is actually a big update on that got me from the 25 to the 50. So let me, you know, I guess just for context there. So basically, there are several problems you have to solve if you want to use RL successfully. The problems you have to solve, I mean, some of them are just like really dumb, basic, like, hey, you got to get the infra and like the libraries have all really sucked and been built by, you know, PhD students who don't know like how to build reliable software. So there's like all these like practical issues that we're working through. That's one thing. And that's kind of what we're trying to solve with art. But even after you got that solved you got like major issues which is like you got to know if your agent is actually is actually or whatever system you using on RL is doing a good job Right That that fundamental You have to have a reward You have to know what doing well or poorly Sometimes that's easy to do. If you're solving like a math problem or something, you can come up with a data set of math problems and the known solution and check if it's the same. the on the coding side there's been a lot of like innovative work around i mean there's first of all like a lot of open data and a lot of you know there's like i think the approach a lot of companies take is you you find existing test cases and then you break them but there's sort of like a way to figure out if you know you you can run the test case right and see if if your code fixes it or not in a lot of other domains it's like much more murky it's like what is a good job versus a bad job how do i know if i did a good job and you really need that information so we've tried a bunch of different things. Ruler is a library that we released. Which let me, let me, relative universal LLM elicited rewards. Thank you. Yes. And the way it works is basically this depends on the sort of GRPO insight, which was mentioning earlier that you actually don't in, or with GRPO, it has this nice property where you don't have to have like an absolute judge of the truth. You just have to judge relatively. And so simplifying a lot, it's basically just LLM is judge on a whole group so you say okay this is the task i'm trying to achieve here's four different runs of an agent trying to achieve it which of these did best and it stack ranks them and it turns out that works phenomenally well with grpo like way better than i expected way better than you know anyone who kind of like i talked to before we actually tried this expected because it's sort of in in in the the lm years in his judge it can sort of like self-ground because it's it's just getting these relative ranks right so it doesn't have to like have like a omniscient view of like what good or bad looks like so that has worked at basically everything we threw it at um we've done it with a bunch of client projects we've done a bunch of our own customers it basically just works like it's basically like i i honestly kind of feel like the reward assignment problem is like fairly solved yeah which is it's fantastic just any lms judge off the hook like we've tried it with so many things like so one of one of the results we published was we used quen 2.5 14b as the model we're training and as the judge we used quen 2.5 32b which is like not i mean it's fine but it's like not it's it's much worse than any frontier model right and even with that combination we were able to get our our agent doing like state-of-the-art better than any frontier model on on the task we tried it on even with like an extremely weak judge model so it really doesn't depend on having like a really great judge model um in practice so yeah it's it's just like it's just not something we've had to worry about since then at all so that's kind of like checked off so that's sort of like got me like a significant increase in like okay this is actually something people can apply this is now something that's packaged up people can just use our it's a we open sourced everything you can use it off the shelf if you stick in your trainer run it will probably just work so that leaves the remaining problem which we were i guess we were talking about the amount of order but like that remains leaves the the environment problem right that's like the one big remaining piece that like we don't know yet how to automate or remove and requires a lot of manual work for for every single task for listeners you know this is why i kind of refer to it as self-supervised because it removes more and more of the human judgment. And the history of machine learning, all the way from, I guess, the start of ImageNet and everything, is really that insight of, you should just take humans increasingly out of it and scale up the data you can just throw in there with no supervision. Yeah, totally. Yeah, it's really awesome. Are you bullish on dedicated LMS judge models? Have you looked at those? bespoke labs we did an episode with them and they're they're really trying to carve a niche in there we've looked into it we've trained some ourselves we've also like used some off the shelf there's there's uh there's an evaluation benchmark that the ai2 people uh put together a reward bench um and so reward bench is kind of like trying to benchmark models on on serving as and reward models are lms judges in your mind it's the same same thing yeah yeah um they have mildly different depends on the task like lms judged is is usually more sort of product facing and reward is reward modeling is much more specific within like a chat task um which is that that used to be the old meaning of reward model i don't know maybe terminology has changed like i think i think they're they're pretty equivalent i understand that yeah i can i can see you aside anyway so so yeah reward bench is is kind of like and so we've tried a bunch of off that um the thing is like i guess my my maybe meta take on this is any task that is extremely common is going to end up in like as a specific like part of the training data for the frontier labs and lms judge is just something everybody's doing in so many different contexts that you have to assume that all the frontier labs have a bunch of like lms judge style tasks that they're training their models on and i do believe that if something does kind of like make it in in a like more than minor way into their training data that like they're going to do at least as good a job as as a dedicated model. So I don't think there's probably a lot of alpha in dedicated LMS judges, just because it's something that like the, let me caveat that and say, like, if you've got like a very, very specific task that's like weird and has weird requirements and you have a lot of data on what's good or bad, then like training a reward model for your specific task, I think could still work. Um, or, you know, fine tuning an LMS judge on your specific task could work. I'm pretty bearish on like, uh, Hey, this is a model that is trained as an LMS judge, but it's a generic LLMist judges, it can be used to judge anything. I just don't think you're going to beat the Frontier Labs on that. Yeah. One other version of this that is not quite an LLM, but some people are thinking about it is something that we're working on for a future episode, which is world models. Sexy. Yeah, very sexy. First applied in video, as far as I can tell, for Genie 3, Genie 1, 2, 3, and now with code and potentially with virtual cells for AI Bio. Any exploration there that's interesting to you yeah um so we've been playing around with it a little bit it's one of the directions that i'm like fairly optimistic on for solving the environment problem specifically because if you think about it like like a world model it's it's a simulated environment that's like what it its whole purpose right so if you get one in an lm like thing not like a docker uh yes yeah so so it's it's like you know whatever hallucinating generating imagining the responses you'll get from the world. So you can imagine, right, if you had like a really, really great world model that you're training on. Yeah. It's like your, your agent that you're using, it would go out and make some tool call. And then this world model model would generate, Hey, this is like probably what the tool call. And if you have a smart enough, strong enough one, then it could keep its own, you know, effective internal state of like the changes you made so far and how that affects. So we've played around with it some, you know, I think if we can get it to work really well, then that could be a solution for the environment problem where you just take a bunch of production traces and use those to condition your world model so it understands your specific system and what its failure modes are and then train against that world model um and uh and we're and you know and the resultant the you know agent that you train with that would would then be able to perform in your real environment so i do think it's like a like a really interesting area of research yeah and did you see the meta cold world cold world model the work i don't think i saw that one okay yeah it was like two weeks ago uh we we just confirmed the guy for AIU code in November. And it's really interesting. Like the world model is... Oh, sorry. You're talking about the meta one? Yeah. Okay. Yes, I did. I saw that one. I said a lot of syllables. It may not have parsed. But like, yeah, it's literally like having a debugger as the environment, as the world model and opening up the execution trace to the model to see what's going on and see the state and track the state as the code executes seems to be smart and, you know, exploits the unique situation of code environments where we can actually do these things. Yeah. I think the way they envision that model being used is a little different. Like I think they're, they're, they're trying, it's actually, I'm curious. I'll have to see the talk. Um, but my understanding from that paper is like the goal they're imagining is this is almost sort of like a pre-training step. And then now that this model understands code really, really well, we can then use it as basically like a code generation, um, or a coding agent of some kind. Okay. Yeah. which i think makes sense that's almost more like a different kind of pre-training i would say um the way i'm interested in applying world models is as not it's basically as its own end right where it's like actually the goal is to come out of this with something that simulates the world which is not something you really need in code at all because it's so easy to like run code and you don't need to model what will happen if you execute this code typically because you can just execute the code and see what happens right for training purposes but it closely models how we think about code when we code is we kind of mentally execute the model as we type and we go like is that what we really want yeah i don't know anyway it's the first model that met us really since the msl reorganization uh we know you know just based on our context that they're very very interested in code models as a path to agi which i'm also of course very interested in um i know we kept in here for a while let's wrap up on the acquisition so a lot of people say you know companies are not sold they're bought what was that process like for you did it just happen like what was the behind the scenes yeah so that was driven by actually mostly the weights and biases founding team lucas yeah so so yeah lucas and sean um uh particularly so they uh you know had recently been acquired by core weave and core was looking to you know continue um growing up the stack and so yeah they they approached me were like hey you know like no pressure but like this is like an area that we think is really promising and we, you know, would you like to work here? And so that's how the conversation started. It was like long, it was pretty painful. There were, there were points as, as latest, you know, like the week before we actually signed where it was like unclear if it was actually going to happen. So that part was super painful. However, we've been there a month now. We just shipped a product yesterday, which I'm super excited about. It's been fantastic working there so far. Like I was like very concerned. I was like, okay, yes, this is great. We make, make a lot of money by selling our company, but is the work environment going to really, really suck? And I was like, well, I guess that's just a risk we'll have to take. It's been fantastic. It's honestly been way, way better than I could have imagined. Do you go down to the office, the one down here? I was there today. We work for it, so I'm based in Seattle. And they have a small office up there that we work for. Ways and Bias' office in San Francisco is fantastic. If you have the chance, go visit. They do all the hackathons and co-working things. Yeah, there's a hackathon going on in a month or so. Every week there's a hackathon. but yeah i mean so so do you consider yourself working for weights and biases or core reef or both and open pipe too no no yeah it's uh so we so we i i report to the weights and biases like yeah founders so we're within that organization in the org chart we're there i don't know like branding wise they're trying to say everything kind of that's not being sold to like big labs is kind of weights and biases. So like our stuff we're launching is weights and biases branded. Yeah. It's not, yeah, not core we've branded as much. I don't know. It's still like, they're still figuring it out. And what's the product you launched? We launched serverless reinforcement learning. Basically, it lets you offload all of the GPU management. You don't have to worry about crashes and out of memories and like, you know, scaling up and down. We handle all that for you. And you just like define your environment, you define your reward function. And then you just like, every time you run a step, you kind of like ship back to our backend. Hey, these are the trajectories. These are the rewards now update my model. And we just like make it work for you. It makes it way easier. Yeah. Okay. Very thinky. Like it is very thinky. Like I love the thinking machines launch. I think they have a really good idea. It's also very validating. How did this take so long to appear? Like it seems, I don't know. Yeah. But that's, I felt this way about everything. Like there's so many things that should exist. Like clearly, I just think there's like still not enough people, like smart people working in this space. Like, honestly, we need, like, I realized that there's like, you know, like a lot of people, it just feels like there's still a lot of low hanging fruit nobody's doing. Okay. One thing I saw from your, your post was your North star as the RL team at CoreWeave is to build an old world where every agent learns continually from his real world experience. So you're touching on the hot topic of the moment, continual learning. What else do we need to get there? I super believe that. And like, that's basically the vision where I'm like, you know, I keep talking about these percentages, 25, 50, like if we get to the world where we build that, then I think it's just like the advantages are huge. They're clear. Everyone should just deploy their agents that way. We want to be like the team that builds the software that makes that easy to do. So I talked to a lot of engineers at our customers and they're trying to deploy agents. And it's so easy to get the initial prototype and like something that like kind of works well. It is so hard to get from that to something that like you are confident is reliable enough to actually deploy in production. And when you actually look at what those failure modes look like, it's like, oh yeah, like we know if it gets in this situation or if it gets like these kind of like inputs, like it behaves funnily, but then it's like, yeah, you can update your problem to address that. But like, that's not scalable because at a certain point, it's like going to start breaking other things. You know, you don't know what it's breaking. You really want some way to just like say, okay, look, this thing you did there, that was the wrong thing. Just like adjust this behavior when you get in this and then, you know, otherwise carry on, right? And that's what we can do with RL. And that's what we can do with continual learning is like, we don't have to like have this concept of like, oh, upfront, I'm like trying to make the perfect model. It solves everything. It's like, I'm trying to make a model that's good enough. I can deploy it in production. And then when these errors come in, I'm going to say, oh, you know, exactly. I mean, very analogous to how you train a human employee. Like, be like, oh, no, actually, that's not what you should do in that situation. All right. Fix that and carry on. And that's just going to make this whole process so much easier. And I think that, you know, like, I think that there is today, like, 10 times as much AI inference that could exist than is existing right now, just purely with projects that are, like, sitting in the proof of concept stage and have not been deployed. Because there's, like, huge bucket of those. And it's all about this kind of, like, reliability issue where it's like, okay, like, it works in controlled circumstances. There's areas where it doesn't work. And so, if we can solve this problem, there's that, like, 90% of the, like, inference market, like, addressable market today that's just going to, like, come online. because we've solved that problem. So that's what we want to do. I'm super excited about it. And like, I think we have very concrete ideas on like the specific pieces we need to make that work. And we just have to execute against them. Do you feel like the online RL is more susceptible to like the reward hacking, especially as you're like shorting this loop and like you don't spend this much time like looking at the different checkpoints? I'm not that worried about it. And the reason why is because it is, reward hacking is quite easy to detect once it starts happening because once the model's found some hack, it just starts doing it all the time. It's like, oh yes, this worked great. I'm just going to keep doing it. And so you notice very quickly, whoa, it's doing this thing. And assuming you're using, at least in part, an LLMS judge to determine which ones are good and bad, it's so easy to just throw in an extra term and be like, hey, that weird thing that you keep doing, if it does that, that's bad. Give it a low reward. So we've done this with a bunch of customers and reward hacking does happen, but you just see it and you adjust your reward prompt and it just goes away. what's a thing from yc that guided you through your entrepreneurship journey and what's one thing that maybe you like find that you disagree with yc on oh that's a good question one thing that i that i i really identify with and i've tried to do a good job is kind of like you know sort of i think they say like hold your um problem tight and your solution loosely right where it's like that's what you did yeah spend a lot of time thinking about what is the problem people are trying to solve and then it's like don't be too bought into like the way you're solving it today. Um, I think that's super important. Everyone, you know, it's, it's very easy to, to, to get that balance wrong. If you're not thinking about it very consciously, that's something I disagree with. That's a good question. I think there's, there's lots of things I disagree with, but I don't have it like cached in that direction in my brain. Um, I don't know. Like I, like I definitely have disagreed with lots of specific pieces of advice, but, um, uh, yeah, I don't, I don't have like a great answer right now. I'll bridge it for you in case something comes up. Sam Altman's like, you know, everything I said as president of YC was wrong for opening AI. Right? Like, do B2B, ended up doing B2C. You know, you should ship, like, products often. Like, ended up being installed for three years. Yeah. Yeah. Actually, I think that second one does resonate with me a lot. Like, we have tried to ship really quickly and just kind of, like, sort of, like, follow the gradient of the market. I think if I do another startup, like, and I don't know, maybe this is just me like being beat up by the market too much. If I do another startup, like I would like, I think at least some points I probably would have done better to be like heads down and execute on my vision for longer and like kind of like go for the more ambitious thing. But that would take longer to sort of like prove value, which is definitely not the YC way. But I think if you have like, I don't know, a good vision and good taste, then like that can like work quite well. Yeah, we'll see what that is whenever that comes out. But thanks for your time. And this is a great overview of everything. This has been a super fun conversation. Thanks to both of you. Awesome.

Share on XShare on LinkedIn

Related Episodes

Comments
?

No comments yet

Be the first to comment

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies