The Startup Powering The Data Behind AGI

Gradient Dissent • Lukas Biewald

Tuesday, September 16, 202556m

Spotify Apple

Gradient Dissent

0:0056:15

What You'll Learn

✓Surge was founded in 2020 to address the challenges of obtaining high-quality data for training AI and ML models, which Chen experienced firsthand working at companies like Twitter.
✓Surge's approach is to provide more sophisticated, higher-quality data solutions, rather than focusing on commodity-style data labeling that many competitors offer.
✓The company has achieved over $1 billion in annual revenue in just 5 years, without any venture funding, by focusing on customers who value quality data over scale.
✓Scaling a data labeling business requires significant investment in technology and quality control processes, not just hiring more human workers.
✓Many competitors in the data labeling space still rely on very manual, spreadsheet-based processes, with little focus on improving quality and efficiency through technology.

AI Summary

The podcast discusses Surge, a startup founded by Edwin Chen in 2020 that provides high-quality data collection and labeling services for AI and machine learning models. Chen shares his experience working at tech companies where he faced challenges in obtaining the necessary data to train models, leading him to start Surge. The company has grown rapidly, reaching over $1 billion in annual revenue without any venture funding, by focusing on providing sophisticated, high-quality data solutions for engineers and researchers, rather than commodity-style data labeling.

Key Points

1Surge was founded in 2020 to address the challenges of obtaining high-quality data for training AI and ML models, which Chen experienced firsthand working at companies like Twitter.
2Surge's approach is to provide more sophisticated, higher-quality data solutions, rather than focusing on commodity-style data labeling that many competitors offer.
3The company has achieved over $1 billion in annual revenue in just 5 years, without any venture funding, by focusing on customers who value quality data over scale.
4Scaling a data labeling business requires significant investment in technology and quality control processes, not just hiring more human workers.
5Many competitors in the data labeling space still rely on very manual, spreadsheet-based processes, with little focus on improving quality and efficiency through technology.

Topics Discussed

#Data collection and labeling for AI/ML#Startup growth and funding#Quality control and technology in data services#Challenges in training AI models

Frequently Asked Questions

What is "The Startup Powering The Data Behind AGI" about?

What topics are discussed in this episode?

This episode covers the following topics: Data collection and labeling for AI/ML, Startup growth and funding, Quality control and technology in data services, Challenges in training AI models.

What is key insight #1 from this episode?

Surge was founded in 2020 to address the challenges of obtaining high-quality data for training AI and ML models, which Chen experienced firsthand working at companies like Twitter.

What is key insight #2 from this episode?

Surge's approach is to provide more sophisticated, higher-quality data solutions, rather than focusing on commodity-style data labeling that many competitors offer.

What is key insight #3 from this episode?

The company has achieved over $1 billion in annual revenue in just 5 years, without any venture funding, by focusing on customers who value quality data over scale.

What is key insight #4 from this episode?

Scaling a data labeling business requires significant investment in technology and quality control processes, not just hiring more human workers.

Who should listen to this episode?

This episode is recommended for anyone interested in Data collection and labeling for AI/ML, Startup growth and funding, Quality control and technology in data services, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode of Gradient Dissent, Lukas Biewald talks with the CEO & founder of Surge AI, the billion-dollar company quietly powering the next generation of frontier LLMs. They discuss Surge's origin story, why traditional data labeling is broken, and how their research-focused approach is reshaping how models are trained. You’ll hear why inter-annotator agreement fails in high-complexity tasks like poetry and math, why synthetic data is often overrated, and how Surge builds rich RL environments to stress-test agentic reasoning. They also go deep on what kinds of data will be critical to future progress in AI—from scientific discovery to multimodal reasoning and personalized alignment. It’s a rare, behind-the-scenes look into the world of high-quality data generation at scale—straight from the team most frontier labs trust to get it right. Timestamps: 00:00 – Intro: Who is Edwin Chen? 03:40 – The problem with early data labeling systems 06:20 – Search ranking, clickbait, and product principles 10:05 – Why Surge focused on high-skill, high-quality labeling 13:50 – From Craigslist workers to a billion-dollar business 16:40 – Scaling without funding and avoiding Silicon Valley status games 21:15 – Why most human data platforms lack real tech 25:05 – Detecting cheaters, liars, and low-quality labelers 28:30 – Why inter-annotator agreement is a flawed metric 32:15 – What makes a great poem? Not checkboxes 36:40 – Measuring subjective quality rigorously 40:00 – What types of data are becoming more important 44:15 – Scientific collaboration and frontier research data 47:00 – Multimodal data, Argentinian coding, and hyper-specificity 50:10 – What's wrong with LMSYS and benchmark hacking 53:20 – Personalization and taste in model behavior 56:00 – Synthetic data vs. high-quality human data Follow Weights & Biases: https://twitter.com/weights_biases https://www.linkedin.com/company/wandb

Full Transcript

You're listening to Gradient Dissent, a show about making machine learning work in the real world. And I'm your host, Lucas B. Wald. Today, I'm talking with Edwin Chen, who's the CEO of Surge. I was really looking forward to talking to Edwin for a long time for a number of reasons. I mean, one thing is that Edwin has built an incredibly valuable business with no venture backing in a really short amount of time. So I think he started this data collection business in 2020. In 2024, he crossed a billion dollars in revenue, which is just spectacular success. But that's probably not even the top reason I wanted to talk to him. The human data collection business is this really important part of building AI systems that's not talked about enough. I got actually into that business in 2026, 2027. I started a company called Crowdflower that did this data collection in the early days. And around the time I sold Crowdflower is around when Edwin started Surge. And the data collection business has changed a lot over the years. But one constant has been that it's a huge spend that people building high quality models do. So Edwin has these front row seats into what most of the foundation labs, most of the foundation model builders, what they're doing and how they're thinking and how they're building their models. So he has a lot of insights. He doesn't do a lot of podcasts. So we were really lucky to get him. I hope you enjoyed this conversation. So, you know, you're the first guest we've had in the human data generation space. And it's, you know, it's a space, obviously, you know, I was in for a long time. So I'm really, I'm really curious about this. Can you first of all start by telling the story of Surge, what you were thinking when you started it and how it's gone? Yeah, I can give the founding story. So basically, I used to be an MO engineer, a bunch of these big companies. And the problem I kept running into was that we just kept facing all of these issues, getting the data that we needed to train our models. So for example, I used to work on our search and ad systems at Twitter. And one of the first things I wanted to do was build a sentiment classifier. So, you know, sentiment analysis is a super simple problem. And all we needed was 10,000 tweets to train our models. But our human data system at the time was literally two people we'd hired off of Craigslist, working 9 to 5. So we had to wait a month to get started. And then we had to wait another month for them to label the tweets in the spreadsheet because our tools are just terrible. And when we finally got the data back, we were just seeing that it was complete junk. They didn't understand slaying, like, you know, she's such a bad bitch. So they were labeling this negative. And they didn't understand hashtags and all these other aspects of tweets. And so what ended happening was I just spent a week myself labeling all these tweets because that was actually so much faster and better. And I think one of the things that we often said was that this was really simple things. Like, yeah, at the end of the day sentient analysis isn't all that complicated. And at the same time, we just had this bigger problem going to solve around how we wanted to optimize our ML systems for the right objectives. Like when I first started working on Twitter, this was the old days when it was purely chronological timeline. And so one of the things we wanted to do was make it easier for users to discover who we'd safely care about. And so the question was, how do we train our recommendation algorithms? And the obvious choice was clicks and retweets at the time. Like, you just train your algorithms to produce as many clicks and retweets as possible. But the problem is, we tried doing these things and it turned out to be this incredibly negative feedback loop. Like, once you optimize for clicks, you get the most click-baity content in the world rising to the top. You get lots of racy content, you get lots of bikinis, you get lots of listicles about 10 horrifying skin diseases, and so on. And so we wanted to train our models on all these deeper principles instead, where we'd ask human raters to label tweets and recommendations according to our product principles. But if we just couldn't even get simple sentiment analysis right, we definitely couldn't get this more complex data at the quality and scale we needed. So yeah, so I guess eventually, there's a problem that happened over and over again at Google and Facebook too. And so eventually I just realized it's just something I needed to go out and build myself. All right. So what year did you start Surge? So we started Surge in 2020, right in the middle of a pandemic. Oh, nice. And so I feel like 2020, I guess, you know, there were some, you know, established data labeling, data generation companies at that time. Did you have a particular, you know, take on the space? Yeah. So our take on a space was that all of these solutions out there, they were basically focused on this idea of very commodity labeling, a very low skill. like the example I often give is take the problem of drawing a bounty box around a car like yeah you and I we can all draw bounty boxes around cars like almost like a three year old can draw a bounty box around a car the bounty box that I draw isn't going to be any different from the bounty box that Terence Tao draws or Einstein would draw it's like a very very low ceiling on the complexity of data that is required in contrast if you think about all the things that we want to do today like yeah we want models that can write poems We want models that can solve relativistic equations. There's almost like an unlimited amount of intelligence that we want to feed our models. And so all the other solutions at the time, yeah, they were designed for like a very low-skill commodity style of labor. And so it was not focused on quality at all. Instead, it was focused on scale. I see. And so how did you source the people in the beginning? so literally the first few people that were on our platform were like they'd actually just been it's kind of funny but i've kind of been working on this problem for a long time and so i already had a network of like people who are really interested in doing this kind of work and so when they heard that i started surge um i mean a lot of it was it was also me in the beginning but it was also you know all these like laborers that everybody and sometimes accumulated throughout the years that's cool and then um uh who were the early customers so the early customers are a lot of like tech companies so there were a lot of so we had this idea that we wanted to really really focus on engineers and research scientists who really really understood the quality of data and so these are a lot just again like i kind of been working this space for for for a while and so there were just like a lot of friends or just contacts i had all these companies who we had been dying for this kind of higher quality kind of like next generation solution where we could just do far more advanced tasks than what were possible at the time. And so, um, yeah, like there were a lot of companies like, um, I think like Airbnb, Twitter, uh, like a lot of just like startups in, in the search and algorithm space. Um, and at that time, maybe 2020, I mean, that was actually when I, right around when I was leaving the space. Um, I remember like, It seemed like what was really taking off then was sort of like autonomy and, you know, robotics, a lot of vision applications. But it sounds like you were more focused on text. Is that fair? Yeah. So we were always focused on language and what I call behavior from the beginning. I wouldn't say it was necessarily text per se. Like there are a lot of complex problems in the image space, too, especially nowadays. But what we didn't want to focus on, like, so what we've never done is like very, very simple image labeling tasks or very simple, like, you know, bounty box style tasks where I just don't really think that there's any intelligence really needed to build such solutions. And so we always focus on this, like higher complexity, higher, kind of higher skill, higher sophistication space. well i would say i think the i think the image labeling when you actually like dig into it is is more more complicated you know for the record but i i take your point um i guess uh do you um it seemed like you really built surge um under the radar like like was that an intentional um decision i mean i think you might have said recently that you're over a billion dollars in yearly revenue is that is that right can can we say that yeah yeah yeah i mean We were over a billion last year and we kind of like hit our building number ARR a while ago. I mean, that's like an incredible achievement, you know, in just like five years. And you did it with no outside funding. Is that right? Yeah. That must be, that's a historic level of growth of that funding, I think. I mean, did you sort of intentionally avoid the VC path? So I think one of the things that was really important for us was we wanted customers to be buying us because they really, really believe in having high quality data and not because they saw us mentioned in some TechCrunch article. We really wanted partners who had the same vision we did and who could go out and show the world how important data actually was. But again, if you contrast with kind of the kinds of data people were looking for five, 10 years ago, people, again, really weren't focused on quality. They treated data as this kind of commodity where they would, like the researchers themselves, they would barely look at the data. They would just let their vendor management teams kind of candle the entire process because they just wanted to outforce it. And so, yeah, we just had this idea where we really, really want to focus on people who believed in quality or buying us for that reason. But so, I mean, to get to a billion dollars plus in revenue, you know, you've obviously had to, you know, kind of scale your system. So that goes beyond hiring a few people that you've known for a long time. So can you talk about how you've scaled your processes as your scope has expanded? Do you think it's more important to scale the technology that's enabling this? Or is it internal human processes? Or is it all in the hiring? Or is it in the management of what's going on? As much as you can say about how this works, I'd love to hear it. Yep. Yeah, so I think the thing that people underestimate in this space is how much technology you actually want to build. People tend to think that humans are smart, and so if you just throw 10,000, 100,000 humans at problem, that will solve it. It's a little bit crazy to me. Sometimes I'll interview candidates from some of the competitors in our space, and when you describe to me what they're building and how they operate it's just incredibly incredibly manual like they are kind of just body shops at the end of the day and they literally have no technology like you've asked them could you tell me the quality of this worker could you tell me how good this worker is at this bigger task could you show me a dashboard with how your quality is improving over time and what a b tests you run and what like algorithms you're building to improve it they literally can because all they're doing a lot of times is like even some of these uh so called technology companies they're literally just dumping data into spreadsheets and their their employees their engineers are literally either creating the data themselves or reviewing it themselves and so they're they're almost there's like no no technology that's going on And so I think what you have to realize of this space is that quality control actually is incredibly difficult. And so if you want to get the highest quality out there, you need to find or you need to create a lot of these kind of really sophisticated algorithms to detect the highest quality, highest quality data that you can. And it's incredibly complicated because in this space that we have with LLMs today, you really want LLMs to be good at every task in the world. So it's not just a single domain. It's making sure they're really good at poetry. and not the underlying extreme, you really want to make sure that they're really good at physics. And so how do you find the highest quality data in order to train the models and then how you also remove the worst at worst? And so, yeah, you kind of can't just throw warm bodies out of you. You really need to build a lot of technology to manage it. Look, I mean, you know, I'm kind of coming from a different era. I think of this, you know, labeling where like a lot of labels were simple, you know, kind of like, you know, yes, no, or, or, um, you know, kind of simple tasks where it's really clear, you know, what's right and what's wrong. But I think, you know, you're doing much more complicated tasks. And like you're saying, I think, you know, even getting two people to agree on high quality and low quality as the task gets more complicated, I think it's like, you know, more and more difficult. And so can you talk a little bit about like what technology you actually have to build to, to manage quality for like a generic you know complicated expert task involving language so let me start actually by contrasting with a very old school take on data and data quality and how we just think about the problem differently and so so yeah so let me give you an example so let's say you wanted to train a model to write an eight-line poem about the moon and the way most companies think about it is okay well let's just hire a bunch of people from Craigslist or through some recruiting agency, let's ask them to write poems. And the way to think about quality, again, going back to the image annotation days, it's like, okay, is this a poem? Is it eight lines? Does it contain a word moon? They just check all these boxes in the same way you might check, okay, this is a cat, it's a dog. They're just checking boxes and they're saying, sure, this is a great poem because it follows all the instructions. It follows all these checkboxes. Then what happens is, okay, you get these terrible poems that feel like they're written by kids in high school right like a kid in high school sure they can write an a-line poem about the moon but is it a great poem is it like an evocative poem is the type of poem that a nobel prize lawyer would write no and so checking boxes doesn't work and so some other companies might be like okay sure these people on craigslist don't have any poetry experience and so what i'm going to do instead is i'm going to hire a bunch of people with phds in english literature but what they don't realize is this is all actually also terrible like a lot of phds again going back to what I was saying about even people with MIT CS degrees. A lot of PhDs, they're not good writers or poets. Like you think of people like Hemingway or Emily Dickinson. They definitely didn't have a PhD. I don't think they even completed college. And so I think we think about quality completely differently Like what we want isn poetry that checks some boxes and uses complicated language Again we want the type of poetry that Nobel Prize Awards would write And so what I think you need to do is like you need a mindset shift. Like one of the things we often think about is like there are certain people who are trained up in this very objective, very like this domain of computer vision in a sense that lacks all of these nuances and lacks all this inherent subjectivity. and so we want to do instead is we want to recognize that poetry is actually really subjective and rich like maybe one poem is a haiku about moonlight on water and another poem is something with internal rhyme and meter and another one is somehow focusing on emotions behind the moon rising at night and so you actually want to capture that there's a thousand ways to write a poem about the moon in a way that you there aren't a thousand ways to draw a bounding back on a car like there are a thousand ways to label something as a cat or a dog like there isn't a single correct way to to write this poem and each different way that you write it or each different preference that you have for different types of poetry they should give you different insights into language and the mind and imagery and human expression and i talk about poetry a lot but it's not just poetry like if you think about math as well there are a thousand ways to prove the Pythagorean theorem and again each way you're proving Pythagorean theorem it's kind of based off of different kind of insights into the mathematical reality of the universe. And so one thing that happens is that when you think about quality the wrong way, you get kind of commodity data that optimizes for things like inter-reader agreement. So again, going back to the computer vision world, like sure, if all you're doing is living images of cats and dogs, you want high inter-reader agreement. You want people to agree that this is a cat and you want people to agree that this is a dog. One of the very, very easy ways that you do quality control world in this old school world is just by seeing whether there's agreement with a majority. But in this new-gen AI world, again, there's no way that you can ask people to write 1,000 poems and then take a majority vote and get the best poem. What actually happens when you optimize for a high-ender-rate agreement is you get the lowest common denominator of things that people want, which actually turns out to be kind of trashy and low-quality and unengaging a lot of time. and so you just need to think about quality in a different way in order for your data to kind of like really really embrace human intelligence and creativity i mean one of the things that that we found at you know i was like working in this space is that um maybe the biggest issue was actually eliciting you know from a customer what they really wanted like i think you know and and obviously we're doing simple examples but you know you think about putting a bounding box around a car. It seems so simple to an ML researcher, especially one who hasn't looked at a lot of specific data examples. It is really hard. I think you're one of the rare ML researchers that really wanted to look at data. I was too, and that's why I got into the space. But you think, okay, putting a bounding box around the car is really simple, but okay, what if the car is occluded? Then do you put it around where you think the car is? What if the car is in a reflection of a car in a mirror? Do you still want the bounding box around that car? What if it's a billboard with a picture of a car and it's not a real car? So even in the simplest cases, as soon as you start to look at real world data, it gets way more complicated. I remember working with customers, there would actually be a long process of eliciting from the customer what they wanted. Some customers would write these giant documents trying to enumerate every case and exactly what they want. But I think those two kind of would end up, you know, be hard to kind of reason about exactly, you know, what you want. So if you, if you followed that, you know, those instructions to the letter, you'd end up in some ridiculous cases, which isn't actually really what, you know, the person wanted in, you know, the first place. I'm kind of curious, like, how you think about that problem? What we always try to do inside is try to understand the goal or the principle that the researcher has. And so when we understand the goal or principle or how they're going to use the data, then we can almost put ourselves in their shoes and instead of asking them, okay, what happens when this car is occluded? What happens when this car is in a reflection? And we just think about, okay, kind of like deriving from first principles given what they told us. What do we think that the user, what do we think that the researcher wanted? Like, so I mean, this is where I don't know if I could envision, but okay, maybe if you know that the data is going to be used for some LiDAR model, then you would just know that this makes sense and this doesn't and so can have that higher level goal instead of like a mechanical list of instructions that it does kind of always what we what we aim for and it is hard because you know at the end sometimes researchers don't know either um but this is where i think we as a company we often also try to have a strong opinion on what their best type of data is as opposed to um as opposed to almost like blindly accepting whatever people tell us A lot of times we actually do like to get into a feisty debate about what type of data would be most useful to bird emuls. So then do you hire a lot of ML researchers? Are those the people that you want working directly with the customer? Yeah, we have a lot of ML researchers. I think one of the things that we often think about is we almost consider ourselves a research company. just that instead of researching like IRLMs and in a way that, you know, other French trail apps might, we're more about researching the data. Totally. And has it been, I mean, I really admire, you know, how you've kind of avoided, you know, status games and float under the radar, but I would think that in doing that, it might make it harder to hire. Like, do you have a different strategy for hiring or do you think that maybe it doesn't matter? You find the people that really connect with your mission. yeah so i think one of the things that we often think about is we don't want people who are just joining us in order to put in order to like notch another brand or their resume oh like sure there are a lot of people out there who do that and we're missing out on those candidates and i think they also tend to want to build like they tend to be people who want to build large teams they tend to want to be people who want to build empires and it almost turns into a hiring for the sake of hiring when you okay so like yeah like why do you want to why do you why do you want to hire this person well i needed to hire this person because the other person doing this job is spending all their time interviewing and so why why is that person spending all their time interviewing well it's because somebody told them that they needed to hire a internal tooling team And so why do that person need to hire an internal tooling team? Well, just in order to make the engineers more productive. Well, why are the engineers not productive enough? Because they're spending all their time in meetings. Why do they spend all their time in meetings? It's because we hired all these people and they need to communicate with them. And so I think there's like a lot of benefits if you start from having people who really, really believe in your mission. And then at least for us, we've been able to stay smaller with a much smaller team. How big is your team? So we're a little over 100 people. Wow, that's an incredible revenue per employee. Congratulations. I'm curious, you've gone through this really fast change, I think, from being an ML researcher, essentially, to running a very significant, large company. Where have you felt stretched the most? What's been challenging? So the thing that I found most challenging is sales. I mean, probably any researcher will tell you that. Just the concept of having to go out and hawk your product is a little foreign to me. So I think it is, I think we're lucky in a sense that our product actually at the end of the day is it is for researchers. And so kind of having that research mindset, like at the end of the day, like we, like what we're trying to build, which is not, it's almost like that Disney quote. You know, we don't make movies to make money. We make money in order to make movies. So in some way, what we're doing is we're not trying to generate revenue. We're trying to generate data that will help AGI. Like one thing that we'll often do is sometimes when customers or like when new companies come to us and they ask us if they can work with us and if their goal is kind of just like unaligned with AGI, then we'll actually just say no to them because we don't want the revenue and we want to focus on the AGI companies in a sense. So I think like, again, going back to not raising, the fact that we don't have a board or like that we don't have an external board, the fact that we don't have VCs who are just dying to make as much money as possible. I think that gives us sort of freedom that allows us to focus on the most important problems, which has allowed us to kind of maintain that research focus. Totally. So you only work with companies focused on building a jet yeah so like for example if a company came to us and you were saying like um uh like yeah we just want to train a i don't know what good that was let's say i just want to train a like i'm a newspaper and i just want to train a i don't know like a like a category algorithm category classifier yeah we just say no what if i like what if i want to make like a video like auto video generator would that be um in your real house or is that too oh uh yeah i mean we do in that sense just because um like building such video generators is like a kind of part of building agi um so yeah we're doing that sense look i mean i imagine you know five years ago the data being collected was pretty different than you know the data collected now like i would think probably some of the tasks, well, tell me if it's true, but I would think some of the tasks that you'd be doing five years ago would be easily automated by LLMs today. It's just been such an astonishing pace of improvement. Can you talk about how the types of tasks have changed over the last few years? Yeah. So, some of the types of work that we do are very, very different. So, when we first started, a lot of our work was in tasks like search evaluation or content moderation and uh like you know whereas today it's almost purely llm work uh so that's one big difference and then even within l even within lms there has been this obvious trend towards higher complexity higher sophistication higher expertise um so i can give a couple examples so like for example there's been a big increase in multimodal complexity so a few years ago it was all text data just conversational text assistance but now we do a lot of work with images and audio and video and I think the interesting thing is like you actually want the models to understand all of these modalities simultaneously. Like one of the things you might want to do is like okay I'm taking a video of something on my phone and now I'm asking the model to create a program based off the video on my phone that simulates this in real life. So yeah there's been a lot of multimodal increasing complexity. There's also been a big expansion in languages like you know first people were naturally focused on English only work but we actually work in over 50 languages now and And I think what's also interesting, it's like very hyper-specialized. We support coding in Argentinian and Spanish. And we support legal and financial expertise in Bolivia. And it's because each, like even today, I think a lot of the models, they're actually just not that good, surprisingly. So like they're not that good at the different nuances of different languages or different dialects or different cultures. And so I think there actually still was a lot more progress to be made there. And then probably the biggest shift is just like the depth of expertise that a lot of the work requires. Like, yeah, you see the models winning IMO gold medals now, and you see them doing all these incredibly advanced tasks. And so you actually really, really need serious thinking power behind this. And so even today, some of the tasks that we do, they involve spending days or even weeks solving these really, really interesting problems. And so it's a very, very far cry from, you know, tasks five, 10 years ago where you might spend five seconds labeling a task. So do you actually have people that, like, can solve, like, Olympiad-level math problems, like creating those problems and then solving them? Yeah, yeah. It's funny, this is kind of an aside, but I've been sort of surprised how at the scores the latest models are getting on these Olympiad problems. When I put in kind of more fun, you know, brain teaser problems that my friends pass around, they often can't do them. Do you have a sense for why there's that disconnect? Yeah. So I think a big problem with a lot of the frontier models today is that they've basically been benchmark hacked. So you have all these benchmarks out there. And a lot of them just aren't really good. They're like overly academic or they're overly synthetic. And a lot of these benchmarks, they have a single, going back to my point earlier, they have a single objective answer. and so the models have been narrowly constrained to be good at these very narrow objective problems when yeah like a you know math problem in the real world or like the problem that you would ask as a frontier mathematician they're not going to be closed ended problems they're going to be like open-ended explorations so yeah i think a lot of it stems from from this kind of benchmark hacking that's that's going on so if you're going to build a benchmark today to compare frontier models, what kinds of things would it include? So what we always say is that the gold standard for evaluating models really is human evaluations where you just can fully automate it like people try to build these leaderboards where um they take kind of automatic verifiers but automatic verifiers they still kind of only work well in these very objective domains or they've tried building benchmarks like ellimsys which i think is an absolutely terrible uh absolutely terrible leaderboard that has basically i think set the industry back by at least a year uh so i can go to that more detail but i i think a lot of the a lot of the kind of benchmarks out there they are flawed either because you're built using low quality data or they are just like overly synthetic over the academic over the objective in a way that the real world is on okay well i mean Now I want to hear this. So why do you think the LMSIS, you're saying that benchmark is not just bad, but it set the industry back. Can you tell me more about that? Yeah. So basically LMSIS is this, if you don't know what it is, it's like this popular leaderboard of LMS models. And what happens is that the way it works is people go on to it, like literally anybody around the world. Like you could be, I mean, what we often hear is that it's literally high schoolers, middle schoolers who can't access LMS any other way. And there's going on to LMS Marine as their only mechanism. And so what happens is they go onto this chatbot arena, they'll enter a prompt, and then they see two model responses. And then they vote on which one's better. But you think about it, they're not taking the time to read or evaluate these model responses at all. They're literally just looking at responses for like two seconds and just picking whichever one just matches their fancy. And so the models could have made everything up. They could have completely hallucinated everything. They could have not followed the instructions at all. And they'll just vote on it because, okay, in order to progress, they need to vote on something. and oh yeah this model has emojis and it has a lot of bold formatting so it just looks really impressive. Like one of the things that we've learned is that the easiest way to improve in this arena is simply to make your model responses a lot longer and to like double the number of emojis that you have. And if you actually think about some of the models that have been released in the past year like think about Llama. Like yeah they or at least the version of Llama 4 that was optimized for Alimsys. Like if you actually looked at it it exactly matched all of these patterns. and so one of the like one of the phenomenon that we often hear from researchers is they will tell us they'll tell us i'm only going to get promoted if like my vp has told me i'm only going to get promoted if we advance or model by 10 points on the leaderboard and they will look at the leaderboard data themselves and they'll see that yeah a lot of responses that are preferred by the the elmsus um like in elmsus data sets like literally the the model that has followed instructions worse literally the model that has hallucinated everything these are the responses that are getting preferred just because they have a lot of formatting and emojis and so what like these researchers tell us is like yeah like i want to work on um improving fundamental capabilities in my model like i want to i want to work on improving them encoding i want to work on them uh affixing their hallucination abilities but instead if the only way i want to promote it is by optimizing for this leaderboard. And the easiest way to optimize for this leaderboard is to make my model hallucinate so that it just generates these crazy answers that are like, even though they're completely wrong, they're just like very, very compelling to kind of like amateur raters who are only spending two seconds. Then yeah, that's what they're going to do. And so we've often actually seen models. They've like, like if you look at actually some of the models that are at top spots on leaderboard today and compare them to how they were performing six months ago, you actually see that they're actually worse in many ways. Do you think, so like another thing that kind of strikes me is like, you know, we have these different models made by totally different organizations. Like, you know, you have like X, you know, like run by Elon and making Grok. And, you know, you look at Anthropic and OpenAI. These clearly have different cultures. And we've had different folks from these organizations on the podcast. And yet the models seem to have this kind of surprisingly consistent tone to me. They sort of like, you know, they feel like a little bit like annoyingly positive, like a little bit like, you know, Boy Scout-y, like, yes, like, thank you. I will answer your question. And I sort of imagine that that must somehow be trained in at some point in the process. And I wonder if it's in like the pre-processing, like, you know, is that just sort of like what high quality data on the web, you know, looks like? or is it sort of intentionally inserted later in the process? Is somehow the annotation that you're doing, like contributing to that sort of consistent... I feel like we'll look back on it as almost like this LLM tone of 2025. Yep. I mean, I think it's kind of a combination of everything in that, like, for example, we actually don't teach the models explicitly to... We generally don't teach the models to follow proper grammar. Like, we generally don't teach the models to use m-dashes. But it's kind of just there in maybe in pre-training where these kind of behaviors are baked in. I think in the future, the models will become more and more differentiated where you will be able to tell that they have a noticeable difference in style. Like actually, even today, I personally, maybe because I've looked at this data enough, I can actually just tell from reading the model responses themselves which model is which. like some of them will just have certain prefaces some of them will just use certain uh certain words or certain uh stylistic patterns that i i can actually generally tell but interesting what's can you can you think of any tells is it more subconscious or could you cite some specific things that lets you know it's a particular model some of them would just use certain phrases like absolutely more often than others some of them would use emojis in a way that others wouldn't llama so many emojis i totally agree with that some of them would like do a thing where they just like repeat three adjectives in a row um you know some of them use a lot more markdown than others when you look at um the way that these different organizations collect data are there like striking differences or does it feel like everyone's sort of chasing the same types of data i think there actually are very very striking differences between the different companies um like every company almost has their own philosophy on the right way to both train and evaluate their models so yeah i think it's actually been surprising how different they are do you have like a point of view on like how you think it should be done like if if you're like if you're running an agi company do you i guess what what would you do is there something you do differently than than what you are asked to do we had this very strong view that our hf data was a lot more effective than sft data maybe you should just uh explain the difference here for yep the general answer so the way sft data works just to give an example would be and this is fine-tuned fine-tuned data supervised fine-tuning yes so sft stands for supervised fine-tuning and so the way it works is okay so let's suppose i wanted to train my library to become better at poetry the way it would work would be okay i would i would uh you like you as an hater you as a rater would write a prompt so the prompt might be yeah write a shake experience on it about cheese and then you would just literally write a shake experience on it about cheese and that would be a demonstration to the model um so yeah so that's essentially supervised fine-tuning in contrast in rohf so rohf stands for reinforcement learning with community feedback the way it works is you write a prompt so then you might take the same prompt write a shake experience on about cheese and then you ask the model to generate two responses so um these might be two different models like two different model checkpoints but model a would spit on an answer and model b would spit on an answer and then what you would do is you would just rate in some sense which one is better like okay model a is better because i don't know it had emotion behind it because it was better written because it actually followed the format of an experience on it and so then you're basically teaching the model like okay a is better than b and so the benefit of this is it's that you're um i mean there are many benefits but um just to enumerate some of them like one it's a lot more efficient oh yeah like it's like even if you are a nobel prize lori it takes you a lot of time to write a to write a poem about cheese um second it helps teach the model all these latent preferences and it helps teach the model what's good but also what's bad and yeah so basically we had this like very very strong view that um i mean sft is important sometimes like one of the things you often tell customers is that you often want to use a little bit of sft data in order to more like efficiently boot trip your models into a phase where rhf is useful but we just had this review that rhf was a lot more effective and so we actually uh kind of steered our customers into that direction like if you actually read the llama 2 paper you'll see that one of the one of discoveries um that they that they described is like yeah like actually at the beginning all the researchers thought that sft data would be more effective as well but then they ran a bunch of experiments they actually found that rhf was was just so much more effective um yeah so that was one of our beliefs in the past you know when you look into the future like you roll things forward a few years what kinds of data do you think you're collecting now that you won't be collecting and and what new kinds of data do you think you'll be collecting yeah so i actually don't think that we're going to stop collecting any of the data that we're still collecting right now like one of the things that we've often found is that i mean the portions may change um like just in the same way that sure as you become a more sophisticated adult. You don't need to be trained as much on arithmetic that you did when you were a little kid. But what we often found is that if you don't persist with at least some amount of this data, models often just regress in very surprising ways. Like I won't name names here, but some of the frontier models, they just make the most bizarre mistakes these days. And I think it's just kind of a little bit wild. and that's just because the data mixtures have been tuned in very certain ways. But I'll say some of the things that we expect more and more, some of the ways we expect the data trends to continue are in ways that they haven't before. So I'll just list them out. So one is even higher expertise problems than we have today. so like right now there's a lot of work being done in kind of like i would call it graduate level stem work but if you actually want the models to make new scientific discoveries you're going to need to progress beyond you know like what even like the average phd can do so i think there'll be kind of like super super stem work that will happen in the future so that's one of them another will be but wait can i ask so like super stem work would that be like literally like I'm going to go out and discover some new like principle of chemistry just to train a model like how would that even manifest oh yeah I mean I think it actually is exactly that like imagine that you are a Stanford professor and you're literally working on I don't know like your latest frontier research what you want to be doing is yeah you want to be collaborating with an AI to kind of generate new hypotheses to help you test your experiments and so it's this idea of a scientific collaborator in the same way that we have a coding collaborator right now in the form of cloud code and whatnot. So I think there will be things like that that are going to be increasing. But how would you collect the data? Would you get a Stanford professor and have it just score responses? I guess a scoring response would be easier than just generating responses. Yeah, I mean, it wouldn't necessarily be... So there's a lot of varieties, different types of data that we collect nowadays, so it's not just scoring. but yeah in a sense of yeah it's like literally a standard professor or some other person who's capable for understanding frontier physics um who's collaborating with the models and then basically teaching them the the right signals um yeah so certainly that's one domain another domain is just like longer and longer horizon tasks like today the models can like sure they might be able to definitely perform tasks that a human could have done in you know 5 10 15 30 minutes an hour some of their time horizons I think will just get very very long and so yeah like what are types of things that you want the models to do that would take a couple days or even weeks so that's another domain a third domain almost going back to your point about the language that these models exhibit today it's like how do we actually make the models really really good at kind of like long form creative writing like even today it's kind of kind of interesting like even today I would say that I don't like any of the poetry or I don't like any of the short stories that the models can right even though they have like perfect pros they're just not creative enough like they've almost collapsed to this like very very vanilla generic dimension and so i think there will be more work there um it's just that yeah like uh up and down out of maybe higher higher economic value uh tasks that people want the models to do instead i think it actually is an important thing for a friend to do to like exhibit this kind of creativity in a kind of like a non-generic way um and then i guess a fourth one would just be more and more actions in a real world like agents have kind of uh percolated or maybe percolate is the wrong word but agents have uh obviously exploded in in popularity over the last few months but they're still at a very very nascent level and so just the concept of ai models that can plan and reflect on their actions and try new ideas and so on and actually like again like actually affect things in the world I think that will be an increasingly important concept I guess one of the things that we might not have expected models to do a priori is put a lot of their reasoning into code. I think we've kind of seen a trend where models use code to do the reasoning more and more. You can kind of see it like in the chat window. Do you think that's kind of an artifact of the limitations of models today or do you kind of expect that trend to continue? I actually think it's right to continue. I do think that the code that they're writing is very important for certain types of verifiability that you just might want the model to exhibit. So yeah, just in a way that if I could write a lot of code for my day-to-day life, I would. So, yeah, I actually think it's a very, very, very important capability for Molsoft. Do you have a sense, I guess I have this intuition that the amount of code used to train these models is kind of increasing as a fraction of the overall data. Do you think that's accurate? Like, are you generating more and more code for your customers than other stuff? Yeah, yeah, we are. I guess that's a lead-in to another question I have, which is, you know, do you think that the trend towards, is the trend towards reasoning models changing the types of data that you're collecting? Yeah, so definitely the reasoning models have led to a bunch of new types of collection methods. and so I think probably the biggest one is this concept of creating rl environments from scratch where essentially a lot of what we're doing is we're basically building these video game-like universes with interesting tools and data sets that the models need to solve so like just to give an example like you can imagine a universe consisting of a simulated ai startup and so in this environment or in this universe you basically have a bunch of like let's say gmail messages and and Slack messages, and Jira tickets, and GitHub PRs, and code bases, and so on. And so then what you need to do is, like, you need to, like, you basically want this universe to be as rich and complex and, like, high fidelity to challenging tasks in the real world as possible. So, like, for example, you actually imagine in this universe, suddenly AWS goes down, and suddenly Slack goes down, and it's like, what do you do? What does the agent do? Like, I can't use Slack, so I need to make sure that I know how to do that. and he's like figure out how to solve this problem and yeah I think there's like a lot of interesting things that we're basically building into these art environments How do you handle like the subjectivity of these models I think one of the reasons that I spend a lot of time I have a five year old and she really likes to generate stories so we spend a lot of time I've gone like really deep trying to get these models to write interesting stories and I agree with you I don't think they really write interesting stories. And I think a big part of it is I think they're trying to maybe like average the preference of a lot of different people. But, you know, my five year old's taste is is really different than, you know, my taste in stories. And, you know, we're actually like asking the model together. So maybe it's not even possible for it to make, you know, both of us kind of happy with the story, let alone, you know, the average person. So it seems like that might require like a different type of like labeling if you actually want to get like interesting art out of it. Because like good art probably shouldn't be, you know, like get thumbs up from like every single person that looks at it. Yep. Yeah, I mean, I think there's this concept of personalization that is actually very, very much unexplored with the models still. And there's like kind of like service level personalization where, yeah, the model knows that you live in Ohio versus you live in New York City. But there's this concept of like, there's just like a lot of unexplained latent preferences that you have that are kind of hard to articulate. and how do you get the model to somehow learn those from the data and for every single person and then apply those to the responses that they generate. I think that actually is a really interesting concept that still hasn't been fully explored yet. Just to give an example where you are a mathematician here are the restaurants that you should try and pair. Sometimes the models are just going to make these weird little personalization generalizations And so I think there's still a lot of work that remains to be done with personalization. And do you think that's because it's hard to collect good personalization data? I'd imagine to put yourself in the shoes of someone that you're not, the data is never going to be as accurate. I think that's part of it. But I mean, I think a big part of it is that all the Frontier Labs have only so many things that they can focus on. And so they just haven't quite doubled down on personalization yet. Okay, so let's talk about synthetic data. I think everybody's dream is always to be able to automatically generate useful data. We've seen a lot of people using LLM as a judge for many applications, which is a kind of synthetic data. Do you do synthetic data generation? And how do you think about that? Yeah, so I actually do think synthetic data is really useful in some places. But I think a lot of people actually overestimate what synthetic data can do. So I can give a couple examples. So like right now, there are a bunch of models that have been trained really heavily on synthetic data. But similar to what you and I were saying before, it's like that's partly why they're very good at these very academic homework style, benchmark style problems. And they're actually really, really terrible at real world use cases. So synthetic data has made models good at synthetic problems, but not real ones like that kind of phenomenon. Actually, I got to think of an example now. So I remember like maybe a year or so ago, we were running human evals for one of the researchers that we work with. And or human evals showed that their models suddenly tanked. And when we dug into it, like when we talked to a researcher, it turned out that they had just trained their model on like 10 million or 20 million synthetic math problems. and so what they hadn't realized was that by training the model on these 10 to 20 million synthetic math problems they and they're like synthetic math problems like a very very narrow domain of math and what they hadn't realized was that how much that was making the model worse at basically every other type of task like you know there are only so many like you know very close ended SAT style math problems that you want the model like there's only so much you care about that And so the model actually just became worse in basically every other domain. So yeah, that's an example there. And then I think, like another thing we actually often hear from companies is that they often tell us, yeah, I spent the past year training my models on synthetic data, but they've only now just realized all the problems that that's caused. And so they actually then spend like multiple months throwing a lot of it out. Like a lot of them will actually tell us they've thrown out 10, 20 million pieces of synthetic data because they found that even just a thousand pieces of really high quality human data is more useful. Like the human data is both, I think, more diverse, more creative, as opposed to, you know, kind of just getting 10 million pieces of the same thing over and over again. And so actually what ends up happening in practice is that, yeah, companies will try using synthetic data for certain problems for like six months. And then a lot of the work that we do ends up being cleaning up the synthetic data. You know, you kind of have front row seats to, you know, like what most of the labs are doing. I mean, do you have a point of view on if we're kind of at a moment where performance is sort of, you know, stuck with like kind of needing, you know, new methods to get to the next level? or if um you know the the current strategies of you know kind of um you know pre-training and then you know like reinforcement learning to learn reasoning will kind of get us to um a much better set of models yeah so i definitely don't think we're stuck at all like i think there are a lot of new methods that are just appearing and a lot of new data that hasn't been collected yet and from Some of the, I mean, even just like some of the early experiments that we've run ourselves, I think that we are going to actually see some massive progress in some new domains very, very soon. Can you describe what some of the new methods are at a high level? Yep. So, I mean, even just some of the methods that I was just describing in terms of RO environments and some of these new RO methods, I think people, they're still kind of new to a lot of researchers in industry. and it's basically this concept of exposing the models to all these new environments they haven't seen before. What would be the example of a new environment that they haven't seen before? So even just the example I mentioned earlier where you have like a simulated AI startup and then suddenly the model, like the environment, it loses access to Slack and it loses access to AWS. And so how does the agent continue operating in this environment and how does it solve the problem? I think it's something that just doesn't appear in some sense in any pre-existing data or any data that we've generated before. So it's just like this entirely new concept where the model needs to go through all these different actions and needs to kind of reflect on what it's done and needs to just kind of like find new ways of solving a problem. And especially when you expose it to very, very messy types of data that it may try to retrieve. Very messy types of data that, again, it may not have encountered before. Just the models can kind of fail in these very, very unique ways, but I think in ways that it can be taught to progress through. What do you think about the ARC benchmark? I mean, that's kind of an interesting one where it's like deceptively simple problems that I think you or I would have no trouble with, certainly easier than, you know, math Olympiad problems that are sort of more visual pattern recognition. Like, do you think those are likely to get solved in the next year or two? Yeah, so I will admit that even the ARC benchmark is very surprising to me as well. Like I, if you asked me without ever having seen the ARC benchmark, if models could solve this, I'll be like, yeah, absolutely. and so it actually is very very there's something about them that I still don't quite understand in why models can't solve them it's actually very very interesting because like right now we are actually doing a lot of work to generate arc style problems so it'll be interesting to see how much that helps the models improve but yeah I would say right now actually I don't have a good understanding of why models can't solve them do you have an opinion on whether open source or closed source models are going to win in the long term? So my guess is that, at least with the current state of things, closed source models will continue winning. And part of that is because, I mean, elements are just so valuable that if you ever try to build open source models, the way incentives currently work is that eventually you're going to be forced to close source them and so if we want to build truly open source models that are really really good we almost need a different kind of incentive structure to make sure that that happens or remains in place otherwise if you just look at the history of other types of open source models they have kind of gotten more closed over time what are you referring to? so I mean if you even think about like Meta that is thinking about making their uh making volume up flow stores and just models are just so expensive to train and people want to fully capture the value if you ever build a truly truly good open source model that i think won't remain open open source for very long again unless you can change center structure in some way that i haven't figured out yet what do you think is like the ratio of spend on data to spend on compute in the in the training of like a large model and how do you expect that to change over time so i mean i definitely think it should be a lot higher um like sometimes people will i would say they almost skimp on gathering data when so what do you what do you think it is today like roughly so i actually don't have a good sense myself i think it actually varies highly depending on the labs but it can be anywhere from like a couple percentage points from one percent to like ten percent um but i think like one thing that we've often heard from some researchers is that there are some researchers who are really really good at using human data they didn't know how to come to us to gather it they know how to like apply it in their own work and what they often tell us is that like some of their counterparts like they don't know how to use human data and so they just move a lot slower um like they may try these like uh kind of like somewhat wild ways to get around the fact that they don't have any human data and it just ends up being this kind of complex slowdown for them. So I hope it will get higher in the future. But yeah, I think there are a lot of ways in which people still underestimate using human data. Awesome. I appreciate your time and congrats on building a fantastic business. Yeah, it was a great chat. Great chat. Thanks. Thanks so much. Bye. Thanks so much for listening to this episode of Gradient Descent. Please stay tuned for future episodes. Thank you.

Share on X Share on LinkedIn