What’s the path to AGI? A conversation with Turing Co-founder and CEO Jonathan Siddharth

Gradient Dissent • Lukas Biewald

Thursday, November 7, 202454m

Spotify Apple

Gradient Dissent

0:0054:48

What You'll Learn

✓Turing's mission is to accelerate AGI progress by providing access to high-quality human intelligence, especially coding and reasoning skills.
✓Turing has built a large developer cloud with over 3.7 million vetted software engineers that it uses to supply training data and services to help enterprises leverage large language models.
✓Turing's early focus on geo-labor arbitrage and automating talent sourcing, vetting, and matching gave it a competitive advantage in scaling its developer cloud.
✓Coding tokens are important not just for pre-training language models, but also for post-training to enable reasoning, symbolic manipulation, and integration with other systems.
✓Turing is also building applications for enterprises that leverage large language models, allowing its developer talent to both train the models and build real-world applications on top of them.

AI Summary

The podcast discusses Turing, a company founded by Jonathan Siddharth that is focused on accelerating the progress of Artificial General Intelligence (AGI). Turing's key insight is that the next bottleneck for AGI advancement is not compute power or data, but rather access to high-quality human intelligence, especially in the form of coding and reasoning skills. Turing has built a large developer cloud with over 3.7 million vetted software engineers that it uses to provide training data and services to help enterprises leverage large language models for their business applications.

Key Points

1Turing's mission is to accelerate AGI progress by providing access to high-quality human intelligence, especially coding and reasoning skills.
2Turing has built a large developer cloud with over 3.7 million vetted software engineers that it uses to supply training data and services to help enterprises leverage large language models.
3Turing's early focus on geo-labor arbitrage and automating talent sourcing, vetting, and matching gave it a competitive advantage in scaling its developer cloud.
4Coding tokens are important not just for pre-training language models, but also for post-training to enable reasoning, symbolic manipulation, and integration with other systems.
5Turing is also building applications for enterprises that leverage large language models, allowing its developer talent to both train the models and build real-world applications on top of them.

Topics Discussed

#Artificial General Intelligence (AGI)#Large Language Models (LLMs)#Developer cloud#Talent sourcing and vetting#Coding and reasoning skills

Frequently Asked Questions

What is "What’s the path to AGI? A conversation with Turing Co-founder and CEO Jonathan Siddharth" about?

What topics are discussed in this episode?

This episode covers the following topics: Artificial General Intelligence (AGI), Large Language Models (LLMs), Developer cloud, Talent sourcing and vetting, Coding and reasoning skills.

What is key insight #1 from this episode?

Turing's mission is to accelerate AGI progress by providing access to high-quality human intelligence, especially coding and reasoning skills.

What is key insight #2 from this episode?

Turing has built a large developer cloud with over 3.7 million vetted software engineers that it uses to supply training data and services to help enterprises leverage large language models.

What is key insight #3 from this episode?

Turing's early focus on geo-labor arbitrage and automating talent sourcing, vetting, and matching gave it a competitive advantage in scaling its developer cloud.

What is key insight #4 from this episode?

Coding tokens are important not just for pre-training language models, but also for post-training to enable reasoning, symbolic manipulation, and integration with other systems.

Who should listen to this episode?

This episode is recommended for anyone interested in Artificial General Intelligence (AGI), Large Language Models (LLMs), Developer cloud, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode of Gradient Dissent, Jonathan Siddharth, CEO & Co-Founder of Turing, joins host Lukas Biewald to discuss the path to AGI. They explore how Turing built a "developer cloud" of 3.7 million engineers to power AGI training, providing high-quality code and reasoning data to leading AI labs. Jonathan shares insights on Turing’s journey, from building coding datasets to solving enterprise AI challenges and enabling human-in-the-loop solutions. This episode offers a unique perspective on the intersection of human intelligence and AGI, with an eye on the expansion of new domains beyond coding. ✅ *Subscribe to Weights & Biases* → https://bit.ly/45BCkYz 🎙 Get our podcasts on these platforms: Apple Podcasts: http://wandb.me/apple-podcasts Spotify: http://wandb.me/spotify Google: http://wandb.me/gd_google YouTube: http://wandb.me/youtube Connect with Jonathan Siddharth: https://www.linkedin.com/in/jonsid/ Follow Weights & Biases: https://twitter.com/weights_biases https://www.linkedin.com/company/wandb Join the Weights & Biases Discord Server: https://discord.gg/CkZKRNnaf3

Full Transcript

You're listening to Gradient Dissent, a show about making machine learning work in the real world, and I'm your host, Lucas B. Wald. Jonathan Siddharth is the CEO and co-founder of a company called Turing. Turing is a company you might not have heard of, but they've become increasingly important in the LLM ecosystem because LLMs increasingly rely on code as training data and Turing is the leading provider of code as training data to these leading AI labs. They don't just do that. They also provide services to help enterprises unlock the power of LLMs and use them for real enterprise applications. So there's really two threads here that are interesting to explore and Jonathan really lets me go deep on how his business model works and the data he's providing. I found this episode really insightful. I hope you enjoy it. so we actually we worked together and you said that i was um your only boss but the last um podcast that we recorded um our guest guillermo and i kind of nerded out about early versions of linux for 30 minutes and then somehow that got left in to the recording and then my friends are making fun of me that it's not like an ai podcast it's like a linux distro circa late 90s uh podcast So I feel like we should fast forward to now and then maybe try working backwards and see how that goes. So maybe you could start by telling us what Turing is today and kind of what you believed about the world, what your insight was in starting the business. Yeah, thank you, Lucas. I'll talk about what Turing is today and the insight that led to starting the company, both of which are sort of directionally winding in the same direction, but it is somewhat different. So Turing today is focused on accelerating AGI advancement and deployment. Our thesis is that AGI progress used to be blocked on compute and data. Compute is in a largely good spot with NVIDIA scaling up, Apple, Google, Microsoft, all creating custom chips. There are lots of exciting startups creating custom chips and hardware. So I feel like the world realizes the prize that's at stake. So compute is in a good spot. Data, on the other hand, hit a bit of a wall a couple of years back because all these foundation models, all the leading foundation models were all trained on the same subset of the internet, like Common Crawl, V4, GitHub, Archive, and like some proprietary mixtures of that data for pre-training. And all the basic tokens are kind of eaten up, right? So our view at Turing is that the next unblock for AGI is going to come from scaling human intelligence to feed into these LLMs, right? Like we need more intelligent tokens. So the bottleneck for AGI Proverbs used to be computer data. Now it's human intelligence. And we think human intelligence is key for three reasons. First, it turns out, Lucas, that when these models get better at coding, they get better at a wide variety of other tasks, like symbolic reasoning, logical reasoning, maps, et cetera. So clearly coding tokens are really important. And it's kind of interesting. It's important both during pre-training and post-training. And it turns out that teaching a model to code is a little bit like teaching a person to fish. They can do anything. It's doing math, right? Like if a model knows how to code, you know, like if you give it a math equation, It doesn't just have to do next token prediction if it knows how to call an API for a calculator, or better yet, if it knows to write some Python code to compute a result. And search engines have figured out also that there is value in having coding in the query execution loop. So a lot of natural language queries on a search engine can be better served if there is an agent in the loop that can write some code to download data sets, analyze the data, and present the results back to the user. So coding is helpful for that. And secondly, OpenAI came up with this technique called process supervision, where like this value, not just in supervising the outcome, but supervising the reasoning chain that leads to the outcome so that the model comes up with a chain of thought that's endorsed by humans. So that's the second reason human intelligence is key, because we kind of have to harness these reasoning tokens from humanity's collective intelligence and feed it to these LLMs. And third, when you're solving problems with generative AI in the enterprise, you again need human intelligence in the form of domain intelligence specific to a business workflow or that industry that needs to be distilled into LLMs. So our thought was that human intelligence is going to be the blocker. How do we remove that blocker? I mean, step one, you need a way to find the world's smartest humans in software engineering, data science, and a wide variety of other knowledge domains at scale. You have to find humans at scale. You have to vet them and match them to the right projects. And toward, you probably need some product scaffolding to help these humans do evals, supervised fine tuning, reinforcement learning with human feedback at scale to keep quality high. And that's exactly what we did. That's what we built at Turing. We had this interesting journey where we had already built many of this infrastructure over the last six years because what we are building is what we call a developer cloud. We have like 3.7 million software engineers at Turing who are vetted. And we do two things. On one hand, we are training AGI by supplying high-quality coding data and other STEM advanced knowledge work data to train these models. And on the other hand, we are building applications for Fortune 500 companies that use these LLMs. So we'd already built the infrastructure to find software engineers at scale, to bet them at scale, to match them at scale. So we found ourselves in a really good spot when it turned out that these LLMs were in desperate need of coding tokens. It's kind of like NVIDIA had this infrastructure for making GPUs. And of course, like high-end gaming is a good use case for GPUs. But AGI is another exciting use case. And we have the world's largest developer cloud and the infrastructure to find these smart humans, let them match them. And it turns out that a really sort of highly leveraged application of software engineers is to trade these LLMs. And that's what we do. And so how much of that was something you drew up back in 2018 when you were starting the company? and how much of that was like kind of jumping on an opportunity that presented itself. I mean, either way, it's incredibly impressive. But when I talked to you back in 2018, I don't think you were at least publicly talking about the developer cloud being applicable to training LLMs. I mean, I don't think that at that time anyone really knew how much coding examples would be useful in terms of making things like GPT do reasoning well? It was very, very opportunistic, Lucas. So we'd started off by building this interest. Our goal was to build an AI-powered tech services company. Use AI to find software engineers, bet them, match them, and amplify their ability to do projects. We did not imagine that software engineers would be such a key input to large language model post-training. So we did not predict that. But the way we are set up with this distributed team in the cloud and using AI to efficiently sift through, I mean, we evaluate like 50,000 software engineers every month. And because we use product to source talent, vet talent, match talent, it became very easy for us to really scale up this talent cloud for these different use cases and adapt it also to now find PhDs in physics, chemistry, math, biology, other types of knowledge work like marketing, finance, et cetera. So it definitely was a black swan to us. But the way that we were engineered made it easier for us to attack it than like, I don't know, like an Accenture or like another large tech services company to do it. We were just always product first. And it just helped us. The AGI workload really tested our ability to scale demand really, really fast. And we were always set up to tap into the world's labor arbitrage geos, like in Asia, Africa, LATAM, parts of Central Europe. And this market rewards that because it's almost like optimizing for price performance. You want Silicon Valley caliber talent, but when you're very cost efficient, it helps all the AI companies get these highly intelligent tokens at a much more reasonable cost than if you were not smart about where you sourced it from. Our mission, though, Locus, is still the same. It's always been unleashing the world's untapped human potential. And today we're doing that in these two different ways, train AGR and build applications with generative AI. In fact, the talent that we bring on board, these software engineers who are creating high quality data for evals, SFT, RLHF, one of the reasons they're attracted to Turing because they don't just get to train the models to get better at thinking, reasoning, and coding. They also get to build applications on top of them when we are working with our Fortune 500 clients. That's a big draw. And now if you go back to 2018, I think you're not the only company that would have the kind of hitch that you're talking about. But obviously you did really well at finding lots of great talent and getting them efficiently into your developer cloud. What do you attribute that to? Like what were you doing differently in the talent sourcing and even deploying that talent at that time? I think there were a few things that we did differently. The first was we indexed very hard on geo-labor arbitrage. There were other talent platforms that were much more focused on one country. our thesis always was there's incredible talent all over the planet in places where nobody's looking and if we can find them very efficiently and it would take data science to find them if we could find them efficiently it's great for the talent because they could be making often we find like making 20% 30% more money than their current and it's obviously great for the companies that are leveraging that talent like to build products more efficiently And the reason, Lucas, was my first startup, that it started with my co-founder, Vijay, the biggest lesson from that company was we were forced to look for Silicon Valley caliber talent. but in places like Poland, Ukraine, India, China, Canada, where there wasn't as much competition. And this was like in 2012, when the world wasn't really thinking about remote work or distributed teams. We did that out of necessity because it was really hard to compete with Google, Apple, Facebook, Amazon for high quality engineering talent if you are restricting your hiring to just the Bay Area. So we had this inherent DNA to like, we internally would call it money ball for talent, Just find these super smart humans from all over the planet. So that DNA from 2018 was key. The fact that we were willing to, and we would lose business at the time. There were some customers who would say, hey, if you don't have talent that's from this one country, then maybe we can't work with you. Or they would want talent that is working from their offices. And we were inherently distributed. So this focus on labor, focusing on labor arbitrage was number one. Number two was this belief that we could automate the sourcing, vetting, and matching of talent. And we were using supervised machine learning, like we would have software engineers be vetted along three dimensions, like different types of roles, like front-end, back-end mobile, AI data science, DevOps, et cetera, different tech stacks like React Node, different flavors of JavaScript, different seniority levels like individual contributor level three, level four, tech lead, tech lead manager, et cetera. And our thesis was that supervised machine learning would work And we saw it look as almost like an information retrieval problem from the web search days In web search days like you matching query document pairs Here it was the document is a developer So we thought of the problem as vetting a developer, as building what we call the deep developer profile, a detailed, comprehensive, continuously updating vector representation of a developer, and learning sort of weights to match the right developers to the right projects from this really large pool. So I guess this vetting process probably involves mostly writing code. Is that fair? Today, it's mostly writing code. It used to be writing code as well as a bunch of multiple choice questions on specific tech stacks. but then those things are not really LLM proof anymore. So we kind of had to move away from some of that. We also invested quite a bit in evaluating soft skills of developers to varying levels of success. How do you automate that? We haven't managed to fully automate that. For some of the soft skills, we have a final interview that we have with a human at the end of the process, when they've been fully validated. When we say soft skills, obviously there is one portion which is just English comprehension and instruction following, stuff like that. but there were other things as well like culture fit like some engineers like want to work in a company like Apple some engineers want to work in a company like Meta some engineers would when I say company like Apple I mean a culture that's much more focused on quality and getting things right versus there could be there's another culture where you want to move fast and break things there's another culture where like you're you're operating like a startup where you you know you want to write code and ask questions later just move really fast. So we had to like figure out these different personas. We've done a variety of other tests too. Like we've, there was a time when we did a test where we would have the developer do like a standup on a project over a period of a week. The project itself was kind of relatively time boxed. Like it was like some 20 hours of work or so, but the developer would be working on it usually part-time over a period of days. and we were checking the regularity of their stand-ups. Like, is this person diligent? They were given instructions to share an A-think stand-up at a specific time. And we would check if they are doing that and how regular and reliable are they? How well do they communicate? Stuff like that. And so, I mean, walk me through the moment where you got your... How did you discover the business of providing code to people like OpenAI? I think not everyone listening to this would realize that these models, like you were saying earlier, are heavily trained on code examples, and there's a lot of demand for high quality code examples. I think you're probably the leader in generating this, but this market didn't even exist, I think, two or three years ago. So I'm really curious to understand how you saw that opportunity and how you leapt on it. Was it a casual conversation with the OpenAI team or what happened? Yeah. So it just turned out that I think there were some papers that were already starting to come out about how having coding data in the pre-training mix would help these models get better at not just code generation and code completion or like in-domain tasks, but also these out-of-domain tasks like symbolic reasoning, logical reasoning, arithmetic, et cetera. And we had demand from a couple of labs that wanted to start working with us on post-training. So post-training would involve, usually when we are working with a foundation model company, it starts with evals where we are helping companies. First, we would ask them, what do you want your model to get good at? If a foundation model company says coding, then maybe we pick Python and JavaScript. Usually those are the two that almost everyone's interested in. We have a taxonomy. Imagine for Python, if the root node is Python, there's Python, Python data science, Python machine learning, supervised machine learning, NLP, and maybe writing code with TensorFlow versus PyTorch versus scikit-learn. And we would do these really deep human evals where we have humans coming up with different prompts, questions to ask these LLMs, and then do a diagnosis of where the LLMs stand today. Are they red, yellow, green, for the LEAP-level notes of the taxonomy? And then our team would generate data with SFT, partnering closely with the researchers, and then they would fine-tune the model again. We would do another round of evals, and then maybe do RLHF. It's a very iterative process, But I would give credit to the AI labs for figuring out that this was such a valuable signal. And then they partner closely with us. We generate proprietary human data for all these labs. And for every lab, Lucas, like a way to think about it is we almost, we built like a custom factory for them. There is like a dedicated team that we set up. And we built custom tools in many cases. And in many cases, they're using our tools to keep quality high. And in our tools, there would be an AI upstream of the human that's generating some synthetic data to make the human's workload minimal. there is an AI downstream of the human that's checking quality of the work. And we have these different quality measures. So that's where we started. And I think it just naturally played to our strength in terms of us being strong at quality per dollar, like price performance, because of this global talent pool that we had. And we were beyond an AI-powered tech services company. So we already had that consulting layer. So our head of R&D used to be a VP of Edge at Meta. So he's built a team that would work very closely with all these AI labs. I think one insight for us was that this is a problem better served as a managed services product versus like a SaaS product. And I think Silicon Valley oftentimes has this blind spot when it comes to services. I feel like this pattern matching that kind of happened over the last couple of decades, services bad, SaaS good. And we just kind of completely bucked that trend. We were unabashedly an AI-powered services company where it's a services layer leveraged by software under the hood versus some of our competitors who try to have a platform that just sells data, where it's just product, there is no service layer. I think our insight was recognizing that you need a combination of services and product. So it was just easy for many of the AI companies to partner with us. We would literally meet them in person, work closely with them, customize the product as it needed to be customized for that particular client. Like 80% of the product is the same, but 20% gets customized. So what we have for one lab is slightly different from another lab and another lab and another lab. And the word of mouth spread, like when people move around like these labs, they would always pick Turing for coding, these advanced STEM topics. And now we are expanding to multimodality, other types of knowledge work like marketing, finance, et cetera, across a wide variety of domains like retail, healthcare, CPG, et cetera. The way I think about it is distilling human knowledge and skills into LLMs. That's really what we're doing, like transferring intelligence from human minds into machine minds. So look, you've put out press releases around increases in developer productivity, actually from these tools. So when you're working on providing data for some of these labs that are literally building the tools behind products like Copilot and Cursor, are you also using Copilot and Cursor and things like that to generate that data? And is that an issue at all? So one press release we put out was actually in partnership with the Gemini team, where we did an A-B test just to see what happens when developers use Gemini's Code Assist, which is their code completion tool. So in those cases, we actually do, because we are doing two things, train models and partner with AI Labs to train models and work with Fortune 500 companies to deploy solutions around these models. This was in the deploy solutions bucket. So when being deployed this with a customer, we saw a 33% lift in developer productivity in terms of pull requests merged per developer per week. So we did an A-B test. And we work with almost all the foundation model companies on one side. And on the other side, we are, again, picking the right models to help our enterprise clients be successful in solving their specific problems. And usually they have some constraints. Like they might already be using a specific cloud provider, and they may want us to use a specific model. Or in some cases, we might recommend what model they use. So it's not been an issue yet. Was that a specific type of issue you were thinking about? Well, I was just wondering, I mean, there are a lot of kind of recursive seeming strategies to improve model quality. And I guess in some sense, good code is good code. But, you know, I wonder if there's an issue sort of, you know, for example, like using OpenAI to generate high quality code that's then fed into the next, you know, OpenAI model. You can imagine that the fixed point here might not be the optimal strategy because it's sort of training on its own reasoning. Although lately that seemed like less of a problem. I'm curious if that seems like an issue to you at all. So typically like different model companies have different policies in terms of whether you can use their model to train other models. And the answer is usually no. So we don't do that. But the same model provider could be in some cases kind of want to use like a bigger model to bootstrap a smaller model or to generate like some data for a smaller model. So when I speak with some researchers in the industry, we see this happen a bit, like where somebody is using a bigger model to bootstrap a smaller model. But typically to take a frontier model forward, it's tough to do that without bringing in new human intelligence tokens in. In some cases, like a simulator can give you some synthetic data, but we don't see this to be as big of an issue. Typically with our clients, Lucas, we provide them this high-quality data, and they train, they fine-tune the models. So different labs take different approaches in terms of what they do, and they're quite secretive about it. So we also keep it very confidential in terms of what each lab does. Totally, totally. Do you feel like code examples are becoming an increasing part of model training? I mean, I feel like there are certainly rumors that these models are increasingly trained on more and more code as people have observed the, people have observed that code works really well for reasoning. Also, do you expect that the models could eventually be trained entirely by bootstrapping? Could the model generate code and then test it on its own and cut your workforce out of the business? Do you imagine that might be in our future? Yeah, that's a great question, Lucas. It's one of the things that I talk to a lot of researchers about and like ask, try to get a feel for where the industry is heading. So far what I hear is that the, I think everyone's realized that coding is very helpful So like the use of coding tokens is going up In terms of and one nice thing about code is in some ways you can validate the output Like you can see if it executes or not if it compiles or not. You can have some checks with some test cases. So that's good. like those are some positives. At least so far we haven't seen a way to completely use synthetic data instead of human data. I think synthetic data can definitely amplify the data that we get from humans. It's been an active area of research for the last couple of years. But unlike with AlphaGo and some of these game-playing systems where you have a nice reward function that you can use for self-play, here it doesn't seem to have been the case. But we are looking into it very closely. It's something that we are always mindful of. And with AI, there's always the, you know, it's very hard to predict stuff like more than like a few years out. But there's something that we actively track. To my knowledge, at least, like today, there isn't a way to do this effectively without that human input. But I think there are ways to like be more efficient with how that human input is gathered. For example, when we generate data for like, for example, when we are working with, say, in our tools for RLHF, when the model gives two outputs and the human has to pick which output a human has to prefer, we build staff holding in our product where the human can, in addition to picking the model, run the code automatically. We have like three checks. We first check if the code compiles and executes. Like, is there a missing library? Like, is there some other dependency that you have to fix? So we kind of help the human fix that very quickly. We recommend like what packages to add, like when the human is trying to check these two outputs. We then have some co-pilots to sort of help the human check whether the functionality matches what the prompt has. And that's obviously hard to really automate. You know, sometimes we'll see the functionalities, the prompt is asking for the human to make sure the application does three things, but the human did maybe two things and missed the third. But we want some type of AI assistance to help the human make sure that they didn't miss the third part of the spec. And we also try to help them generate test cases, make sure the humans are checking all the edge test cases. it's kind of a hard computer science problem to like to like really solve it fully automatically but we built our own product scaffold to like make it easy for the human to write correct code and sort of pick the right model output if it's RLH. And the infrastructure that we had built when we were to evaluate software engineers we just used it. Because we already had this container in the web where testing people's ability to write code. We lifted and dropped it here. But when I meet researchers, Lucas, this is the number one question I ask them. For code, because we want to be ahead. If synthetic data is the key, we want to make sure it's baked into our offering first. So far, what I hear is that, at least right now, this is the state of the art, but we're always checking to see if it changes so that we can be ahead of it. Did you see in the GBT-01 paper the notion that you could run for longer and get back better results? I mean, that seemed very evocative to me. And it sort of implied a kind of obvious bootstrapping strategy of like running the model for longer to generate training data for the short-term model. Did that kind of worry you at all? I mean, that seems like a strategy maybe you could actually employ to generate high quality data. So when you say run a model for longer, do you mean like a bigger model generating training data for a smaller model? Or do you mean, yeah, what do you mean exactly? Well, it's like they were saying that you didn't explain all this, right? But you could kind of wait longer for the result and get a higher quality output. And the timescale and the quality wasn't really specified. But they kind of showed that. And you could sort of imagine, you know, putting agentic systems on top of any of these models and kind of using repeated iteration to generate higher and higher quality data without any human involved. Which kind of implies, OK, then there's some data here that you could take and train, kind of use to train the model. Maybe that would take the human out of the loop. Finally, I'm not sure. Yeah, I'm not sure as well. I don't know the details of how the O1 model was trained. The thing that I knew was one common technique a few folks seem to be using just outside in from what publicly published and stuff is this notion of, And I mean, there's always been this work on hidden chain of thought, like where you're generating different sort of reasoning chains and searching and picking the right chain, knowing when that chain of thought goes nowhere and knowing when to backtrack. So that was happening. In general, what I hear from researchers is that we are still not quite sure how to teach the teacher. And a really powerful model can definitely, thinking longer, generate better solutions that could be good training data for fine-tuning a smaller model. But it's still not clear how you would take like a frontier model and teach it something new or novel without this intelligence infusion that comes from somewhere. It could come from synthetic data if you have like a really good simulator of the world that could help you generate stuff. It still didn't seem obvious how you would do that. Like a lot of people still believe these models are still not exactly generalizing. Like they are kind of interpolating from the training set. So it's kind of still, at least I don't know if there is a way to use synthetic data to bootstrap a frontier model like an O1. Okay. So given your position here in late October, 2024, working with all these labs and giving them training data to improve the models, do you have a sense in your head of what kinds of tasks models do well and do poorly? I mean, we're all kind of wrestling with that, but I feel like you're right there on the front lines trying to give them the data to get to the next step. So I'm really curious how you would describe where the models are working and not both in code and outside of code. So I think, firstly, I'm super excited for where we are today and where we're going to be headed over the next couple of years. I think in code, it's going to be more complex real-world projects where we're going to have to help the models get better. The cool part is the scaling laws seem to continue to hold. Bigger model, more data, more compute, and the models just seem to keep getting better. So either through human intelligence data with companies like Turing or with synthetic data, Hopefully they keep getting better. So on the code front, I would say it's complex real-world projects because today we are still in the co-pilot phase, like where the LLMs are amplifying the productivity of a software engineer. I'm yet to see an agent really work well in production, like where you can't really hire an LLM as even your IC3, individual contributor, level three software engineer today. an AI can pass the vetting, but it still can't do the job. And even in coding, what a software engineer does in front of their code editor is like a small portion of their day-to-day. Sometimes they're in a meeting with their manager. They're in a meeting with a product manager. They're probably like figuring out what to build or reviewing a PRD and figuring out how to estimate the complexity for different projects. I think all of those non-coding-related aspects of a software engineer's workflow will also need to be automated. So I think we're still very early in coding, and when we look at languages beyond Python and JavaScript, we are still pretty far away. I think the true test will be when you can hire a software engineer GPT GPT, and that can actually do the job. And today, we're still in the demo land. I mean, the demos work well, but we still have significant distance to go. In non-coding, we are much, much farther away. In coding, if you could say, we are kind of at the level of a smart intern. When you look at let's just look at different functions like marketing, sales, finance, HR, et cetera, for every industry, retail, healthcare, CPG, life sciences, et cetera, I feel like we are nowhere still in terms of like actually automating key workflows to deliver measurable business impact. But we'll get there. I'm very, very optimistic about the impact. When I speak with Fortune 500 CEOs, CTOs, CIOs, they're all excited to get productivity gains from LLMs, but they're still very early in deployment. I feel like only when they deploy these systems in production, we have like a good active learning feedback loop where we'll actually see what real world prompts these LLMs encounter in the enterprise. And then we'll have like some good data for evals for like, what do we have to get better at that we're not yet good at. I feel like in consumer, we have some relatively good feedback loops, at least with the success of products like ChatGPT, Gemini and Cloud and many others. But in enterprise, I feel like we still don't have that right feedback loop yet. I would say in terms of the journey of zero to one, to me, it still feels a lot closer to the beginning than to the end. These models are just going to get so much smarter. When you can have like an assistant that's in the background, like either through your AirPods or like on your phone or on your computer that you can always keep talking to, always be problem solving with. And when these models have like really rich context on you and your prior context and kind of almost like, I mean, I really like the Andre Carpathi abstraction of the LLM as a computer, like the way it could like write things to a file, know what, when to look up stuff about you, where to look up stuff about you. I really feel like we are going to approach the equivalent of her, like both for personal and enterprise use. And I think we're still doing relatively simple tasks. I think the tasks will just get more complex across all of these domains. But now you actually sit in an interesting position that's kind of similar to weights and biases where you're simultaneously selling to the foundation model, building labs and companies, but at the same time also selling something to the Fortune 500, to the kind of traditional enterprise companies trying to use these Gen.AI systems, right? And in fact, I think you're helping Gen.AI, you're helping these companies actually make useful things from these Gen.AI products. So, I mean, could you talk about some of the specific successful use cases that you've seen in enterprise in Gen. AI, because I think, you know, some people are super excited by the opportunity. I think a lot of other people are thinking like you, man, does this work for anything right now in enterprise? So some successful applications I think would be really interesting to talk about Yeah So I can travel at three applications and then a fourth one that we working on So very briefly one was a coding copilot for a large software company where they needed a RAG system built on top of their own code base And we partnered with Google Gemini on this So they got custom code completions specific to their code base. And this was the company that saw the 33% lift in developer productivity. I actually think that 33% lift was an underestimate because it was still primarily in a code completion context and not a code generation context. I think in code generation, the impact will be even more. So we are working with a large insurance company on an underwriting copilot and a claims processing copilot. Usually it's these workflows around what they call intelligent document processing, where there's a bunch of unstructured data that these companies have that has to be processed in a specific way. and you'd probably need a good retrieval system also built on top of it, a retrieval augmented generation system on top of it. And almost all of the time, these use cases are human in the loop. And my hunch is that 90% of enterprise applications, maybe even more, will be human in the loop systems. They won't be fully autonomous systems. And we are also working with a healthcare company on an audit and compliance copilot where they have a very specific, they have some compliance requirements they have to follow. These requirements are written down in a bunch of documents. I feel like a lot of the use cases are on natural language understanding of these unstructured text in the form of documents, contracts that need to be reviewed and analyzed by people that today take a lot of time for very costly human cycle, which I think will be automated. it. There are also large financial institutions that want to, that today, like these financial institutions have this goal of usually increasing assets under management. And today, given the cost it takes to service the client, the minimum floor that they have for like, what is the net worth of an individual that you manage is relatively high. Because for know your customer type workflows, it's still relatively labored. Somebody has to ask a lot of documents, analyze a lot of documents, verify a lot of stuff. But I think with LLMs, again, it's the sort of intelligent document processing pipeline where an agent asking for the right documents from a client, an agent doing a first pass at checking stuff, backing off to a human to do the final check. So those are the usual use cases that we see. A lot of it is still in code generation, this intelligent document processing type stuff. There are some use cases I see around people wanting to build multimodal assistance to solve very specific problems. For example, there's an automotive company we are talking to where it's a large automotive company and they have these mechanics working on cars and they wanted a way for a mechanic to be able to take a picture of like a part. and they have so much data on, hey, when this is the problem in the car, this is what you would do to fix it. And that proprietary data that they have accumulated over decades, they're going to use that to fine tune a model to train their people in the field. I mean, a junior technician will probably be able to do the job of a more senior technician with this model that can take an image or a video as an input and come up with a set of actions for what you should do to fix that particular situation. I think any function where you have to deal with a lot of paperwork, deal with a lot of unstructured documents, probably governments will also become a lot more efficient. So what have you learned? What are you learning in this process? It sounds like at least the code completion application has kind of gone from proof of concept to production. Has anything else, like is that underwriting application, is that like really working and having measurable impact? Or are most of these kind of free deployment? So most of these for us today are in the proof of concept stage where we built it and the internal team likes it and thinks that it'll be useful. We're in the process of scaling it to production. I think what has happened this year is last year, all these larger enterprises were still in a state of shock after the chat GPT moment. Now I think people are ready to start doing these pilots. And many of these Fortune 500 companies have set up internal centers of excellence for AI where their mandate is to roll out an AI platform. Like imagine like Vertex AI or WatsonX or like a variety of other competitive products. And their goal is to showcase to their management what are powerful Gen AI use cases they should be that they've discovered. So we're seeing that happen. I think it's still in the proof of concept stage, to be honest, in most of these cases. and there is one thread, which is deployment of copilots across these different functions, but those are relatively simple deployments. I think we're still early in the deploying custom fine-tuned models trained on proprietary data and sort of baking in proprietary workflows. It's promising though, like the underwriting copilot is, it was a success. That's when they asked us to build a claims processing copilot as well. I think many of them are still grappling with how to measure ROI exactly. And I think one tricky thing will be the impact of this on the fact that the workforce needs for many of these companies could be lower if these systems operate at scale. And I think that creates some tension. There's a lot of... Let me give you an example. I mean, like, you know, when we think about like 33%, you know, improvement in productivity, I mean, that's like a really, really astonishing, you know, improvement. I mean, I think it's like a value talk about like 10x, but honestly, making developers 30% more productive is like a dramatic effect on the business. So have you followed that customer after that happened? Like, did they decide, okay, we could have like a third less engineers? Or did they suddenly start shipping a lot more? Or, you know, I'm kind of wondering if maybe the engineers just sort of, you know, take a little more time off and, you know, somehow the productivity improvements kind of diminish over time. Like, what really happens in a company like that? Yeah, that's a great question, Lucas. Like, in this particular client's case, I should follow up and see what happened. Are you able to measure your own developers' productivity? When you saw that 33% improvement or whatever it was, do you feel like for yourself you could do better than that? Yes. Do you measure it? Do you have a number? So we're doing some experiments now. I'll report back when we have some numbers. I don't have a number right now. but the 33% lift was a very limited way in which we did it, where it was tied just to coding. There are so many other ways to intelligently summarize stuff for a developer. You can imagine having an agent for a developer go attend a meeting, go attend a stand-up, deliver a stand-up on behalf of the developer so that the developer is in a flow state. create like a, you can imagine like a product manager creating a PRD using an LLM with the right prompts under the template for the company, come up with better ways to estimate stuff. I feel like there is, and also like just raw code generation. When I tinkered with stuff for fun, Lucas, like I like to play with some of these LLMs. And like I found like there was like a project that would have taken me like you know seven eight hours to to do like in a world without llms where i'm like messing around on stack overflow and doing stuff it took me like 45 minutes and okay but i do think i think that a lot of like kind of wannabe technical ceos like me and you yeah a little delusional because it's like i think it's like perfect for that kind of application like when i'm trying to do like some kind of toy thing especially the language i don't know that well, it's just like, oh my God, these like LLMs are incredible. But then when I'm trying to work on an existing large code base with a lot of like, you know, Croft and having like communicate with other people, I don't know. They're like a 30% improvement. I'm just, I'm really not sure. Like I'm trying to, I'm trying to understand like what's the real productivity improvement. I guess you seem like you have the best chance of knowing. So I want to know your estimate. Yeah. Yeah. So I'll get back to you with what we measure and come back with. But my gut estimate would be if we did something optimal today, it would be 50% or higher with the right scaffolding and the right workflow. But you're exactly correct, Lucas. Describe the right scaffolding. I want to hear about the right scaffolding. Is it really like agents go to stand-ups? What should I do? Give me some advice, some parting wisdom for my engineering team to make them 50% more efficient, please. And firstly, I also agree with you that the estimate that I gave for myself, like you're correct that, you know, want to be technical CEOs like that's not the that's not the it will not translate to production and like a multi person team. You're exactly right about that. The only reason I mentioned that is in the test that we did, it was much more about code completion. We did not do a cursor style, some code editor that's optimized for code generation itself. And I just felt like my hunch, my gut is that it'll be higher than that. That is narrow, co-pilot sense. So let me think. Probably for different levels of engineers, Lucas, I'll probably start by just doing a time audit of their entire day. My guess is that an IC engineer and a tech lead and an eng manager at Turing have very different days, very different sort of workflows. Our eng managers spend a lot of time in documentation and in meetings with other leads. I think my step one would be to, let's pick one persona. Maybe it's like a IC software engineer. Perfect. I'd do like an audit of their day to see where they spend their time. So the obvious thing would be to use like an LLM optimized code editor with like a good chat interface that they can use. Do you have a favorite? I mean, I'm imagining cursor here, but what should they use? Let me talk to my team at GetBack. Like my own personal biases are different, but I know our team really also like Google Cloud Workstations, but that's also because we're on GCP and it's like in the cloud and it's just easy to work with. And it integrates with a lot of other stuff. let me get back to you on this it'll be my homework I'll get back to you on the recipe for 50% I don't want to say something off the top of my head give us a little write-up we'll put it in the show notes that would be awesome I think it's a great place to stop thanks for the interview awesome have a good one great thanks Thanks so much for listening to this episode of Grading Descent. Please stay tuned for future episodes.

Share on X Share on LinkedIn