Back to Podcasts
Gradient Dissent

R1, OpenAI’s o3, and the ARC-AGI Benchmark: Insights from Mike Knoop

Gradient Dissent • Lukas Biewald

Tuesday, February 4, 20251h 12m
R1, OpenAI’s o3, and the ARC-AGI Benchmark: Insights from Mike Knoop

R1, OpenAI’s o3, and the ARC-AGI Benchmark: Insights from Mike Knoop

Gradient Dissent

0:001:12:01

What You'll Learn

  • The R1 and R1-0 models, similar to OpenAI's O-series, represent a paradigm shift from previous language models that relied on scaling up pre-training and memorization.
  • These new models have demonstrated a significant increase in their ability to adapt to novel situations, as shown by their performance on the ARC-AGI benchmark, which tests a system's ability to solve problems it has not seen before.
  • The key difference is that these new models use a 'chain of thought' approach, where they take more time to think through a problem step-by-step before providing a final answer, rather than just quickly generating a response.
  • This increased ability to recompose and apply knowledge in novel ways is a major advancement over previous AI systems, which were often limited to narrow domains and could not easily adapt to new tasks or situations.
  • The episode also discusses the implications of these advancements for the development of more robust and reliable AI systems that can better handle unexpected situations.

AI Summary

This episode discusses the recent advancements in AI models, particularly the R1 and R1-0 models from DeepSeq and the O-series models from OpenAI. The key focus is on the ability of these new models to adapt to novel situations, as demonstrated by their performance on the ARC-AGI benchmark, which tests a system's ability to solve problems it has not seen before. The episode also explores the differences between these new reasoning-focused models and the previous generation of language models that relied more on memorization and pattern matching.

Key Points

  • 1The R1 and R1-0 models, similar to OpenAI's O-series, represent a paradigm shift from previous language models that relied on scaling up pre-training and memorization.
  • 2These new models have demonstrated a significant increase in their ability to adapt to novel situations, as shown by their performance on the ARC-AGI benchmark, which tests a system's ability to solve problems it has not seen before.
  • 3The key difference is that these new models use a 'chain of thought' approach, where they take more time to think through a problem step-by-step before providing a final answer, rather than just quickly generating a response.
  • 4This increased ability to recompose and apply knowledge in novel ways is a major advancement over previous AI systems, which were often limited to narrow domains and could not easily adapt to new tasks or situations.
  • 5The episode also discusses the implications of these advancements for the development of more robust and reliable AI systems that can better handle unexpected situations.

Topics Discussed

#Reasoning models#ARC-AGI benchmark#Chain of thought approach#Generalization and adaptability#AI safety and robustness

Frequently Asked Questions

What is "R1, OpenAI’s o3, and the ARC-AGI Benchmark: Insights from Mike Knoop" about?

This episode discusses the recent advancements in AI models, particularly the R1 and R1-0 models from DeepSeq and the O-series models from OpenAI. The key focus is on the ability of these new models to adapt to novel situations, as demonstrated by their performance on the ARC-AGI benchmark, which tests a system's ability to solve problems it has not seen before. The episode also explores the differences between these new reasoning-focused models and the previous generation of language models that relied more on memorization and pattern matching.

What topics are discussed in this episode?

This episode covers the following topics: Reasoning models, ARC-AGI benchmark, Chain of thought approach, Generalization and adaptability, AI safety and robustness.

What is key insight #1 from this episode?

The R1 and R1-0 models, similar to OpenAI's O-series, represent a paradigm shift from previous language models that relied on scaling up pre-training and memorization.

What is key insight #2 from this episode?

These new models have demonstrated a significant increase in their ability to adapt to novel situations, as shown by their performance on the ARC-AGI benchmark, which tests a system's ability to solve problems it has not seen before.

What is key insight #3 from this episode?

The key difference is that these new models use a 'chain of thought' approach, where they take more time to think through a problem step-by-step before providing a final answer, rather than just quickly generating a response.

What is key insight #4 from this episode?

This increased ability to recompose and apply knowledge in novel ways is a major advancement over previous AI systems, which were often limited to narrow domains and could not easily adapt to new tasks or situations.

Who should listen to this episode?

This episode is recommended for anyone interested in Reasoning models, ARC-AGI benchmark, Chain of thought approach, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode of Gradient Dissent, host Lukas Biewald sits down with Mike Knoop, Co-founder and CEO of Ndea, a cutting-edge AI research lab. Mike shares his journey from building Zapier into a major automation platform to diving into the frontiers of AI research. They discuss DeepSeek’s R1, OpenAI’s O-series models, and the ARC Prize, a competition aimed at advancing AI’s reasoning capabilities. Mike explains how program synthesis and deep learning must merge to create true AGI, and why he believes AI reliability is the biggest hurdle for automation adoption. This conversation covers AGI timelines, research breakthroughs, and the future of intelligent systems, making it essential listening for AI enthusiasts, researchers, and entrepreneurs. Mentioned Show Notes: https://ndea.com https://arcprize.org/blog/r1-zero-r1-results-analysis https://arcprize.org/blog/oai-o3-pub-breakthrough 🎙 Get our podcasts on these platforms: Apple Podcasts: http://wandb.me/apple-podcasts Spotify: http://wandb.me/spotify Google: http://wandb.me/gd_google YouTube: http://wandb.me/youtube Connect with Mike Knoop" @mikeknoop Follow Weights & Biases: https://twitter.com/weights_biases https://www.linkedin.com/company/wandb Join the Weights & Biases Discord Server: https://discord.gg/CkZKRNnaf3

Full Transcript

You're listening to Gradient Dissent, a show about making machine learning work in the real world, and I'm your host, Lucas B. Wald. This is a conversation with Mike Noop, who is both an AI researcher and an incredibly successful entrepreneur. He started a company called Zapier about 15 years ago with a very small amount of funding and grew it into a very large business. He then got up to speed on the frontier of AI research and recently started an organization called Endia, which is one of the new research labs working on the forefront of AI. This is a really interesting conversation. We go into business and how AI fits into business, and then also some of the details on how the new R1 and R10 models work, especially on the AGI prize, which he funded and made popular. I really hope you enjoy this conversation. Why don't we start with R1 and work backwards? Okay, sounds good. Let's do it. So R1, obviously this suddenly famous model coming out of China. Do you want to maybe just describe that briefly and then talk about R1-0 as well and pontificate a little bit? So, yeah, these basically two actual models got dropped, R1-0 and R1. they are very similar in sort of nature to what I would call OpenAI's O1 model. They're, you know, these reasoning models, and they were trained in a similar way. We know for sure how R1 and R10 train because DeepSeq chose to open source their sort of training methodology for R1 and R10. They really did move the science forward here. I did see a public comment from, I think, Mark Chen, who leads research at OpenAI on this stuff, that he shared that he thinks that it's similar in spirit to how Owen was trained as well. So I think there's some probably pretty good agreement here between sort of the ideas that have gone into creating both of these systems. And fundamentally, the whole O series from OpenAI, this R series from DeepSeek, are fundamentally a paradigm shift from the type of AI systems that we've seen in the past. For example, from OpenAI, like GPT series, from 3 to 3.5 to 4 to 4.0. These are all the same sort of like broad paradigm of scaling up pre-training, where we're trying to make these sort of models more intelligent by feeding them more data, making the models bigger. Now, it was also the case of 4.0 was actually a little like the smaller, you know, there's probably some distillation that went into 4.0 to make it more efficient. But roughly, we're sort of in the broad paradigm of making models smarter by trying to like give them more human data. And they're effectively memorizing answers. This means that they really have no ability to adapt to novelty. May have heard of the ARC prize, the ARC AGI benchmark. What this is, is it's a benchmark that tries to assert or assess a AI system's ability to solve problems that it hasn't seen before. The data is highly resistant to just like being able to memorize the answers. Even if you're given the training set for ARC, you can't just memorize the training set and solve the test set. This is kind of another underappreciated sort of public point. So it legitimately is very hard. That's why it went unbeaten, V1 went unbeaten for sort of five years. And, you know, in December, we had this big news moment with the opening iOS 03. You know, they had the 75% score on ArcV1. They had this 85% score with this really, you know, expensive high compute performance version of it. And this was showing that this kind of like new reasoning type of system has a fundamental new capability that we've not had in computers before that they have the ability to adapt to novelty. And this actually has some implications beyond just that quirky fact. This will actually lead to more sort of robust and reliable AI systems too, which we can touch on. It matters a lot for this agent stuff. But wait, before we go down that path, they're like immediately off track and I love it. But when you talk about novelty, how do you define that? Because I feel like there's a lot of debate on are these systems just memorizing or not? and they're obviously not just like memorizing text and then like regurgitating exactly the same text. I mean, it's quite clear there's like some, you know, adaptation. So yeah, there's some form of generalization that comes from like compression, right? Because like it's not literally a database, which would be like a lookup table. Right, right, exactly. You know, if you had one parameter for every fact, okay, you could just have a database. You know, these like GPT style class systems do compression. That is where you get some of the like interesting generalization coming from. The sort of claim that I would have is that the amount of generalization they're capable of doing is fixed because the architecture is fixed. Effectively, you've got the transformer underlying architecture in every language model up until, you know, last year. And that means that your sort of level of intelligence is fixed. Yeah, you can memorize more, but the amount of generalization you have, the amount of adaptability you can do from your training data to a new novel situation, it has been fixed and that has not changed up until 01. 01 was the first prototype we saw last September of a system that actually had a legitimately increased amount of intelligence in its ability to be given a fixed amount of input information, be able to do more things accurately. They're further away from its training data. But wait, so like what would you say that, like for example, AlphaGo couldn't generalize? It seems like in the domain of Go, it could reason pretty well and adapt to novel situations. That's not true for Arc2, right? I mean, ARK's been around for five years. There's a lot of, you know, state of the ARK and ARK from a pure solver standpoint, if you disregard the, you know, front-end LMs was like something like 50% coming out of the ARK price 2024 contest. But they're very, very domain-specific solvers. You know, they required researchers to have a, they basically, you know, these systems have all of their G, the G factor, the generality is being put into the sort of system that solved ARK during the contest that came from the researcher's brain, right? Because the researcher's thinking about the problem. They're trying to model the problem and say, okay, here's what I'm trying to get this computer to do. I'm going to like encode my own understanding of the problem into the system, into the architecture in order to get this thing to work. And it ends up being fairly sort of, you know, constrained in domain. And it doesn't generalize well. And this is true for AlphaGo too. You know, we've had AI systems for years and years and years that can be superhuman at humans at games. but like the fact has historically remained that you know what can you do that these AI systems can't it's the fact that I could like sit you down and teach you a new card game new board game in like a couple hours and get you up to human level proficiency you know I could take you in a totally new game I could go teach you how to drive a car you've never driven before and get you up to proficiency probably in a couple days this like ability to adapt on the fly to a situation or type of problem that you've never seen before but never trained on is what has been historically very unique and special to you as like a human relative to the AI systems that we have today. And it's probably worth saying, like, I don't know how you, Arc is, I think, better looked at than described in words. But, you know, I think it's... Yeah, we should overlay like a puzzle somewhere here. Totally, yeah. It's for sure to overlay a puzzle. But, you know, for someone just like listening, you know, I think what's astonishing about Arc is maybe how easy these puzzles seem to be and how these systems do actually fail on these easy-looking puzzles. but maybe you could describe a little bit more of what it is. It looks like an IQ test. It's a grid of colors. It's a 2D grid of colors. We were given some examples of input and outputs, and your goal is to find the rule, to find the pattern between what's the common consistent rule between these inputs and outputs, and then apply it on a test that you're given. And yeah, you're right. It's easier to sort of visually describe or visually see than sort of describe. But ARCS basically challenges you to like on the fly, recompose knowledge that you've acquired throughout your life, basically on what we call these like core knowledge priors. These are things like symmetry and rotation and object detection and tracking and basic understandings of physics. ARC requires you to kind of abstract and compose those core knowledge priors on the fly to a task that you've never seen before. And this is what the sort of historical LM scaling at Paradigm is never even good at. Like GBD4 scored, for us were like 4%, for example, on the Arc data set in contrast to the really impressive performance we started to see from these O systems. Okay, so what changed from like 4.0 to these O systems that are doing well? Like what did they do differently to make it work better? Yeah, I mean, fundamentally, like the strongest thing I can say is from a capability, I can make a very strong capability assertion and then I can have some informed speculation on how. You know, the capability assertion, you see this in just the score. If you look at Arc v1, right? for over a five-year period from when it was first introduced in 2019 up through basically last fall, the best sort of LM, GPT-4, got 4% on it. And then O1 came out, O1 Pro came out, O3 came out, and you saw the score rapidly go from 4% all the way up to 75, 85% on this really extreme high-end performance version of O3. It looks like a straight line. Like, it's pretty nuts. And this is actually a really good thing to see in a benchmark, actually. This means there's more signal in the benchmark, and it truly is doing a capability assertion. You know, it's a little tougher to understand capabilities by looking at these like kind of monotonically smoothly increasing benchmark scores that go slowly over time. When you see something have a sharp bend, you know that something has distinctly changed. And I think that's the case with these reasoning systems. And specifically, the thing that they have done is they have added the ability to recompose knowledge that they've been trained on in their foundation models at test time. OpenAI calls this like test time compute. is kind of the broad paradigm of we want to use more test time to, we want to use more compute at test time to like think before we just jump to an answer. You know, GP4, oh, you start spitting out tokens in 500 milliseconds and, you know, that's its answer. Whereas, you know, the intuition here is we want to allow these systems to have more time thinking up front before they have to render a view. And the way they think is by what's called this like chain of thought paradigm. You might have heard like, let's think step by step, right? This like ChromTac that came out literally like three years ago. this paper actually is very special to me. It was one of the things that got me first into go all in on AI. But this like COT moment that happened in literally January of 2022, like we're just sort of like downstream of that moment still. We're trying to figure out how can we apply this paradigm of chain of thought about having the models basically think out loud to themselves in a sequence where you say, okay, hey, what's the next step to solve this problem? Okay, now take that next step. What's the next step? Okay, now what's the next step? And you do this sort of in a big chain And then the model is able to use that entire sort of chain of thought trajectory to ground its final answer. And this is what O1 does. This is what R1 does. This is what R1, 0 does. There is some differences between R1, 0 and R1 we should talk about that are very important. But roughly this is a... Before we go there, maybe let's take a moment to be like astonished that this works, right? Because you're saying that these like LLMs are compression algorithms that, you know, kind of can't reason. they can only sort of, you know, do some kind of like limited, you know, generalization. They're now like generating text and the chain of thought, you know, paper had them generating text in steps. So these kind of dumb models are generating text now in steps and suddenly you claim they can reason like that. That doesn't seem obvious that that would. I don't think it is. And it's one of the reasons why I, like I said, I went all in on AI back in 2022. There was this, like, up until January of 2022, and I co-founded Zapier 15 years ago, but working on it, I was an exec at the time, I was like running all over product engineering org, like half the company building our new product stuff, and I saw this, and I had been paying attention to AI, and I saw this chain of thought paper that came out, and I thought I had a good perspective of like what LMs could do and couldn't do up to that point, and then this paper dropped where it was like, oh, just by asking them all think out loud, you saw these performance scores on the reasoning benchmarks at the time were showing really, really large spiking behavior. And they've grown from 30% to like 70% or so. And that was like my kind of like, oh, shoot moment. Are we on track for AGI with this technology possibly? And I felt that was really important to know just from a Zapier standpoint, like should we start using this in our products? And also, you know, just from a human eye, I just want to know this is really, I think this is some of the most important technology that has to the world. And so I kind of went all in to start understanding the sort of paradigm here. One interesting thing about that old school version of chain of thought, like Zapier was like a proud step in with chain of thought in fall 2022 as well. And with a very similar paradigm of like ask model to iteratively prompt it out loud, it has a very low degree of ability to adapt. And we see this even with a one and R1, you know, the scores on ARK are only about 15%. It is a big step up from the sort of GPT-40 class, you know, the 5% territory, but it's still really, it's a relatively weak amount of adaptation. The thing that has really made these things work with O1 Pro and O3, and this is getting an informed speculation, is they've added search and sampling on top of the COT generation process. They're not just asking for a single COT and then say, okay, now give me the answer. They are sort of generating multiple COT steps in parallel and saying, okay, which one's the best one? Okay, use that one. Now go to the next step. Now ask me for a bunch more, pick the best one, go to the next one. They're effectively doing this like program synthesis or program search at inference time. And this insight has allowed a significantly higher degree of like adaptation as what's getting up to the 75, 85%. And I'll be like, you know, to your original point, this is astonishing. Like it's a big update. I think these systems demand like serious study. And it's also why I'm very excited to see R1 and R1 zero open source, because I think that's and allow more people to do the science. Totally. And before we get into the R1 and R10, what's the human performance on ARK and where are these best models at today? Like when you say 85%, put that in context of human-level performance. Smart humans can get up basically 100% on ARK if you want. I see. The data we have is if you take two, let's call it STEM graduate humans and put them in front, they'll get 100%. Like I think the actual data we have is like 98, 99% across two. This has been one of the actual flaws of V1 is that we haven't had like strong assertions of human capability. Something we're fixing in V2, we've actually been working on V2 for years now, actually. We put a lot of effort into building it last summer and we're putting the final touches on it right now and we're going to launch it with ArcPrize 2025 this year. One of the things we have with V2 is actually strong human study baseline testing to make confident assertions that every single puzzle in the benchmark is solvable by humans in order to justify the easy for human client. I do think that that is the spiritual guidance of like the ARC benchmark in the future is, I mean, it's true for V1, will be for V2 and V3 and all future versions, is that we want ARC to represent this concept of things that are easy for humans and hard for AI. And I think that's the gradient we're driving to zero because I think if you could get that gap to zero, you legitimately, I think it's gonna be hard for anyone to claim we don't have AGI. If you can't find a single thing that humans find easy, but computers find hard, I think that's a reasonable goal to set and a target to shoot for. And this is in contrast to how a lot of other frontier benchmarks work. Dan Hendricks, like Humanity's last exam, or Epic AI's frontier math. They're investing in making these ever more difficult benchmarks. And I think that's a fine thing to do, by the way. I don't discount that effort at all. I think it's useful. But I think there's something more important to understand from a capability assertion standpoint around like, well, what do humans still find easy? I think that's more revealing of something we're missing in these systems still. I totally agree, although it's funny. I mean, it's been such a long line in my lifetime, and especially in the last year or two, of like, you know, these benchmarks just being like, okay, this is human level performance. If we get there, we're going to have AGI. And then like over and over and over and over, you know, beating these benchmarks. Like, do you, I guess it's really hard to make this claim with certainty, but do you feel like ARK could be like the last of this genre? Like, is there something else out there that, you know, AI solves this, but then there's something left as a benchmark that it can't do that humans can do easily? What I can say more confidently is that we're going to run the gap of easy for humans, hard for AI to zero. You know, I think that is up to the world to decide, is that AGI? I personally think it is. but that's the design philosophy we have for V2. That's going to be the design philosophy for V3 in future versions. I'll say this, V2 is going to look probably pretty similar to V1 in terms of domain because we've been working on it forever. It is harder for computers, but it's still easy for AI. V3 will likely be something that looks very different from what we've done so far. I think it will look like Arc still, but I think it's going to test different capabilities that humans still find quite easy that the sort of current benchmarks don't. And it will likely also include a sort of formal version of measuring efficiency, which I think is also going to be a really important thing we care about getting up to AGI. Okay, so R1 and R10, can you describe what those are and why they do well on ARC? So R10, R1, both kind of in the spirit of being trained on COT, change of thought, and then generating them basically a single COT and then giving us an answer I think the really important thing to know about is what differs Why are there two models Why is R1 and R1 different separate This is a very interesting thing. Like, they didn't have to release R1-0. In my view, R1-0 is a more important system to understand R1. As for this reason, what does the zero stand for? The zero stands for no human data in the training load for R1-0. Like alpha zero, I guess, is homage to that probably, right? Yeah, they're using it. They're training it purely on RL, reinforcement learning, where they're using domains and math and coding to create verifiers where they can have the model DeepSeq v3 generate a COT, and then they can use a formal domain of, like literally just run a computer program, and feedback, was that right or wrong? But just to be clear, before we go down that path, so they're not, obviously, there's like an earlier training step where there is human data in it, right? I mean, they can't, I mean, how does it learn language? There's a foundation model, DeepSeek33. This is not a forever given, though. I think this is going to be an engineering try-off that AI system developers have to make in the future, which is how much knowledge do we put in the foundation model that we start from, and how much do we have the system sort of generate itself at test time, basically at runtime? I think likely what you'll sort of... My sort of expectation is you're going to use these reasoning-style systems basically to generate new knowledge and add them back into some like ever-growing corpus of knowledge that they have. And you'll probably use those as your starting points in future training runs and future inference systems that you get. So yes, it is the case. Obviously, 4O was trained on human data. SteepSeq v3 was trained on human data. But this is not going to be likely the forever case in terms of the capabilities of these systems. I think you will see, and this is why R1.0 is important to look at. If it's able to bootstrap itself up in knowledge from literally first principles like math, you know, give me the piano arithmetic operators and I'm going to bootstrap up Calc 3. Like that is something legitimately that a system like R1.0 could potentially do using RL with their human symbol. Okay, so they train the system on, they train like a foundation model on human data and then what's this next step that they're doing with R1.0 just like specifically. So they're generating chains of thought and then yeah yeah it's a single chain of thought you know step by step taking a full chain of thought and then the whole thing is being used to get the final grounding for a final response out from the same deep seek feed through model and how does the feedback get incorporated offline a training time so what I just described is kind of like a test time right it's like okay user inputs query I'm going to like generate one COT very long and give you a final answer yeah let's go to go forward the feedback loops that training time. This happens offline, right? The developers did this months ago, you know, where they took the DeepCV3 model and had it generate lots of COT, and they had another computer programming looking at all those COTs and giving the sort of training loop feedback of like, was that a good thought or a bad thought? And that RL signal is basically being used to sort of fine-tune the ultimate R1-0 model. And then in the case of R1-full, they actually also allow human experts to label those. but in the case of the zero how do they how do they have the model that knows if it's a good thought or a bad thought this is what's different from zero versus one so one uses a greater model where they have a neural model basically giving feedback and saying in text form good job bad job kind of like style stuff R1 did or don't do this it's purely a symbolic verifier right so they are like saying I'm going to run a I'm going to take this like you know potential code that you've just output and try to symbolically run it and use it to give a feedback symbol like is it good or bad and that's that's is that's generally what you're seeing in like this rl space right it's like you want to get a sort of final answer uh and then you're going to symbolically try to verify that and make a like a hundred percent reliable assertion of whether whether that was right or wrong which is you can do it in domains where you can run a computer program to verify you know a python program a math equation a piece of code does it does the code compile it does it get the answer or expect, yeah, you've got benchmarks where you have all this stuff set. This is different from, yeah, some of the O1 stuff where they're actually using a process grader model, which is a separately neural trained one. Again, this is all informed speculation. They haven't shared any of this, so this is all like details I'm just like trying to pick up and understand based on these systems, but that's what I understand is they have a separate actual LM model that's giving feedback in their training loop. So in the case of R1-0, it's pretty cool they've been so transparent about how they do it. is it possible that there's domains like computer programs that are really easy to do kind of chain of thought and then verify within other domains where it's harder? And so you end up with models that are sort of specializing towards the domains where it's sort of easier to tell if the result is accurate or not? This is the bet, right? Like, can R1-0 scale up without having to add humans in the loop? The evidence we have is that O3, which does way better on ARC, like basically effectively beats the V1 dataset, required supervised fine tuning from humans to get that to work at a high enough degree of efficiency to make it like tractable. We don't have any evidence yet that you can bootstrap a pure RL-based LM chain of thought system to get there. That's like what you're probably going to see happen though in 2025, I would expect. I mean, I think another obvious thing that's come up with these models that people have noticed is that they were a lot cheaper to train by... Yeah. By orders of magnitude. Why, do you have a thought of why that is? I haven't looked deeply into this. You know, the only commentary I probably can offer here that's interesting is, you know, the ticket price is what people often are comparing here between like, what's the commercial pricing rate of like, oh, one versus what the deep seek commercial rate is for their model they're hosting. and you know because r1r10 are open you can actually run them on your local infrastructure you don't have to go pay the rate like they're giving you basically that cost like it's very close to the actual cost and open ai has got margin their company they got they got researchers to feed they got future research to invest in like you know they're building a business there so like my again informed speculation is that like there is a quite a bit of margin built into O1's inference costs right now in order to fund future R&D. And the true costs are probably more comparable than most people expect, at least for R1 and R10 versus O1. I think there's this other debate of like, how much did DeepSeek v3 cost to train? You know, are we hiding the fact that like this headline number of, oh, it costs $5 million to train R1 and R10 obfuscates the fact that there was like a bunch of money that went into training the foundation model. And that is accurate. I have no idea how much the foundation model trained. I haven't looked into this. I have no sense of what it costs. I don't know what a GP4 costs either. But I would maybe make the single comment on like from an inference cost, I would sort of broadly put these like our systems and our systems in somewhat of the same bucket from a cost basis, like broadly speaking. Got it. Okay. All right, well, thanks for jumping in. I feel like maybe we should take a step backwards. I think, you know, when I first met you, it was like maybe more than a decade ago. you know you're the founder of a company called Zapier which I've long admired but maybe some people have not heard of you want to talk about what Zapier does? Yeah Zapier is an automation company we try to deliver automation software that's very easy to use for non-technical folks in order to automate you know largely parts of their business we're used predominantly by you know individual people either you know their own life on small teams individual teams within large organizations. It's intended to be very, very easy to use, and that's in contrast with historically the automation system that has existed out there. We're used today by, I think, several, I don't know, three, four million businesses in the US. We have a pretty large international presence as well from a customer-based standpoint. And the core idea is you can connect apps to use. So you've got business software use like Gmail or Slack or Salesforce or whatever. and, you know, businesses build internal workflows around these processes. They often have humans that have to, like, shepherd data between the different steps, make decisions, and this is something Zapier can completely automate for you, and it's easy enough to use for, you know, a line of business type of folks that you don't have to get an engineer involved. And I think some of the notable things about Zapier also is, I mean, this might be unique. I mean, you raise a tiny amount of money from VCs, and then you got off the VC train like almost nobody does and got to, like, incredible scale. I mean, it's been an amazing return for your initial investors. And I think you're one of the first companies. We debated even raising the money in the first place. What's that? We debated even raising the money in the first place. You know, it was a very practical decision, frankly, to sort of do it. You know, the three founders, me and Brian Wade, you know, we're all from Midwest. I grew up in St. Louis. And, you know, there's no venture capital market in the Midwest, really. Maybe they'll just say a little bit, but like, you know, 2010, nothing. And so how do you build a business? Like, build a useful service product, you sell it, you use the proceeds to invest back into the business. And so that was kind of our bias and how we kind of initially operated. And we went through YC and we were really debating, should we, we've got this demo day moment, everyone else is gonna raise money, should we? And we did office hours. Actually, funny enough, with Sam Altman, who was the guy who we chatted with, they were like, should we raise or not? And I think the question he asked us was like, well, look, what's the constraint on the business? and it was a good question because I think you know the legit concern of the business for for Zapier at the time was Brian Wade and I were all waking up and doing support and so like noon every day and so it was taking away time to make the product better to not have as much support in the first place and we're like oh well we should go hire a support person and we didn't have enough cash you know on hand yet to go hire that person so we decided to go you know raise a raise a small round in order to like you know um be able to hire someone ASAP to like you know give us more product energy time back and the funny anecdote sad anecdote is that By the time we got the round raised, got the money in the bank, found the person we wanted to hire, got them on board, started payroll, and they had their first paycheck, revenue had actually caught up enough to just pay them directly. And so I'm pretty sure you could trace the million dollars that we raised in lineage all the way through to today. It did allow us to activate, though, and start, which I think is honestly the biggest value we got at OIC, which was just the activation moment to really go on and go full-time on it. Amazing. And you were also remote first from the beginning, right? Yeah. Another weird thing Zapier did. Yeah, we were a globally remote team and have been since 2011. I think the only other companies at the time that I knew of was like WordPress. Automatic was like fully remote. Yeah. 37 Signals. I think those are the only two that we knew about at least. And then I think kind of the third interesting thing about Zapier from my perspective is you were very early to use LLMs. I think you were the first company that I saw with real LLM use cases. Can you describe what those were and what that experience is like? Yeah. I think this comes back to the chain of thought paper that came out in January 2022. Like I said, my background is in engineering. At college, I did mechanical engineering. I did optimization research, which ends up being the exact same math as all this deep learning stuff. I didn't figure this out until 2017. but like as soon as I realized you're on the gradient descent podcast actually I yeah exactly as soon as I like figured this out I was like oh okay um uh like I should I know I know how this works it's like it demystified it a lot and so I started paying more attention to the research side but like Zapier was growing I had other priorities you know we had to like you know grow the company introduce new products yada yada and so I really wasn't like that much paying attention um you know I read the GPT-2 paper or the release played with it. I read the GPT-3 paper, gave a whole presentation of the company on it. And, you know, and then it's like, okay, cool. We can do some maybe basic stuff around the margin. Cool tech, but that's it. And then this like January, 2022, chain of thought, Jason Wade's paper came out. And that's what really got me to say, oh, this might actually be extremely relevant now for what Zapier customers are trying to do. And I actually went to Wade, who's CEO. And I said, Wade, I need you to take back all my half the company. You need to go run product engineering because I need to just go do AI research here at Zapier and figure out what does this mean for us, for our business, for our customers. And so for a good six, 12 months, me and Brian, CTO, just coded all day long and tried to understand what can this tech do, what can it do, what were the limits. And that got us, and this is like summer 2022. So we're still four or five months before even Chachukati came out. and we had built graph, like we had built like tree of thought prototypes. We had built a version of Chachi Patee internally using the technology. Like we had pretty much prototyped all the kind of foundational pieces in probably like three or four months and identified that like probably the most obvious place Zapier could start playing first was this like concept of tool use. Could we equip LMs that are frozen weights, right? That don't have the ability to take action in the real world. Could we equip them with the tools on Zapier's platform, all the actions, all the search endpoints that we have and allow them to do more. And that kind of activated us really early on to start building and delivering AI products. And I think that's the reason why we're so early. And it also gave me a pretty, you know, this leads into the later story, but it was also the, I think what allowed me to start seeing the limitations of this paradigm really early as well, because, you know, I spent, I talked to hundreds of Zapier customers trying to deploy this like AI technology in the middle of their automations to do stuff. And, you know, these like, Zapier has been deploying AI agents now for two years. so I've just gotten to hear like what what what do people want from this tech what and what where does it not work and the number one problem that they all tell me it's consistent across the board is like the promise is there I get what it can do for my business but I just don't trust it enough yet to go hands off and again Zapri's an automation product it's different from ChatsUPT right where it's like you're typing on a keyboard and you get a response you can audit it Zapri was running on a server offline you're not watching or monitoring it and that was the feedback I just don't trust it because the reliability is not high enough in how it's operating yet to not put a human sort of in the loop. And, and, and this was kind of like this, this feedback was so loud and unchanging from GPT, like three and a half to four to four. And it was in the same contrast of like all this, you know, dogma around scaling hype that was crazy for 2023 into 2024 and just wasn't matching up with my lived reality. And I, and I was like, okay, how the heck do I explain this? You know, I have two sets of facts that like are incongruent. And that's when I kind of rediscovered Francois' podcast on Lex back for during COVID, which is when I think I first listened to him and learned about the ARC benchmark more. I'd actually like been thinking about it a little bit, but really dug in and read the On Measure of Intelligence paper that he published in 2019. And that was kind of my aha moment, because I thought that paper did a really good job of articulating the third promise of the technology that we've seen and why it's important and impressive, but also where it's fundamental limits on just scaling up pre-trained memorization we're going to hit to. It was leading to all the facts that I was seeing from customers. And once I got to that conclusion, I was like, well, clearly the ARC benchmark is the number one most important benchmark in the world. More people should know about it. And going into last summer, at least, it was a relatively obscure benchmark at that point. And then you made a prize for the benchmark. Yeah. Do you want to talk about that? I think the Zappry was where I kind of, in this phase is where you and I, like, you made the introduction to Francois, which thank you for doing that. I think that's hopefully been a very helpful thing for the world. I think you got hopefully high leverage out of that introduction here. Well, I got you on the podcast, so that's great. The world thanks you, Lucas, for doing that. And, yeah, so I'd been maybe surveying, researching the Bay Area at least for a couple, maybe like a year or six months at that point on like, hey, I think this is the most important benchmark. have you ever heard of it? And the awareness rate on ARC was relatively low. It was like 10, 20, 20% of people I met had heard about it. Most of the people who thought they'd heard about it actually heard about the old bad version from like the Allen Institute that had been like long beaten by language models. And it was just like this, you know, relatively obscure thing. You know, it had a Francois and this other sort of Swiss lab, like Lab 42, been running the small version of a contest for several years going to that point. So there was some evidence that it was like a robust benchmark. It wasn't like, you know, completely obscure, but certainly in the AI industry, it was a under, uh, you know, the awareness was quite low. And, and I was like, well, this is direct concrete evidence. It's probably the only major concrete public evidence we have that like, there are fundamental limits to prescaling. Every other benchmark in the world was like saturating up faster and faster except for ARC. And I was like, awareness is clearly the problem. And so after your introduction, I flew up to Seattle and I got lunch with him and I was like, you know, I was pitching him on my ideas of like how to beat ARC. I had some ideas and And it was a fun chat. But I also had some pretty critical questions of like, well, why do you think awareness is so low? Why aren't you working on it more? And he had really good answers for all the key questions I had. And I walked out of that. I had one question at the very end of my notes when I was flying up that I put together, which is like, pitch ArchPrize. And I was like, okay, I think that it is true that this is just like the reason awareness is low is because it's legitimately hard. Every Frontier Lab has given it a try. And that increased my confidence. It was a really important benchmark that we should grow awareness around. So I was like, well, one way I know we could probably do that is with a prize. I had just seen Nat Freeman and Dan run the Vesuvius Challenge the previous year which was super successful at growing awareness around this kind of obscure problem and growing the status growing the interest in it to get people to kind of shift gears and work on it And so I was like, I bet we could do something similar. And that's where ultimately Ark Prize came from. And so what happened when you launched the Ark Prize? Was it successful in getting engagement with the, like, new engagement with the problem? Well, I think going into June 2024, again like 10% of AI researchers had heard about ARK I think coming out of 2024 the year like as of December 31st everyone in tech has probably heard about ARK so like I think we solved it on honestly my true response is like I have been continually surprised at how much energy there has been around the benchmark you know we like I'll give you a concrete example of this we we were going into the end of the contest. It was in like early November. And, you know, we had a bunch of teams on the leaderboard. One of their, you know, requirements for winning the cash prize was that you open source your progress. This is an idea of trying to rebaseline the progress in the community each year. And the number one team, you know, who was, you know, it was kind of the runaway winner. They had been on the top of the leaderboard all summer long. You know, had this like 55% score or something like this. and they emailed us the week before, like five days before the contest center, I was like, hey guys, we're not sure we want to open source our solution. And I was like, oh man, shoot, is this going to work? Did we get something fundamentally wrong about how to structure this contest to help make progress towards the GI effect go faster? And in the next 72 hours after that call happened, at the close of the contest, there was like two other teams that shot up the leaderboard from like, I don't know, 10th place up to second place. The number two team actually was like right at neck and neck with the number one team. There was like three different papers that got dropped within 24 hours of the contest that they just timed their ARC research papers to drop kind of like at the same time to like time it and enter the paper contest. And it was just this phenomenal amount of like energy right at the end of the contest that it was kind of just hidden from us. We just didn't see it all. and I think that happened again when OpenAI's O1 model came out there was just this incredible outpouring of demand for us to test O1 on Arc that I didn't expect I mean we had like thousands of people on Twitter begging us to go test this thing so we did and it was really important I'm glad we did it but it was just like there's been these moments of surprise I guess for me in terms of its relevancy and I think how much awareness we've been able to go around it. What kind of insights do you think have come out of wrestling with the prize? or with the challenge. In terms of structuring contests like this? No, no, actually, I mean, in terms of in AI, like, you know, like what approaches have worked? Have we learned about how to build intelligent systems? So the classic way that people have been trying to beat Arc for four years was with program synthesis, pure program synthesis. The idea is to, you know, you build a domain-specific language. Think about it as like a little Python functions, Python transformations. And you build this DSL typically by human, like, looking at the puzzles and trying to make sort of guesses about what the transformations could be. And then you have a brute force search, effectively, over the space of all possible combinations of those transforms to look for ones that match your input and output, and then you apply them on the test. This is kind of the classic way that you tried, people tried to beat it up till, you know, this year, basically. And it doesn't really work well. It's just very slow. It's very inefficient. but like and it's very brittle right because it's like all the sort of generalities built into the DSL that the human is sort of going after and ArcGrey 2024 there was a couple brand new approaches that were really interesting to see that I think the sort of conscious popularized you know I think the first major one was early on about using I guess like induction based approach where you're trying to generate lots and lots of like Python programs using language models and then searching over and like kind of hinting and sort of informing the language model and how they generate a Python program based on sort of the input from the puzzle in order to kind of guide the language model program generation. And this is like Ryan Greenblatt's thing. He got a really, really early score, like around 40% or so using a technique like this. It was very inefficient. Like you need to generate hundreds of millions of Python programs to do it, but it kind of like showed some promise of this like program induction style method. How do you then like pick among, like you generate those programs, but how do you pick which one looks promising? You run them. It's RL, right? It's like symbolically verified. When you generate a program, you can run the program. But then how do you evaluate if it's, is it easier to evaluate if an answer is right than generate a response? Yeah, the data has the answer, you know? And so it's like, and this is like true of all, you know, math coding. This is like where 01, R1, 03 all really dominate, right? It's in these sort of easy to verify domains where you have an answer and you can check it quickly. If a computer can check the answer quickly and with strong, you know, with exacting correctness, then these types of things will work. And the other big one was test time training. This is the other really big kind of novel approach that we saw come out where, you know, both of these basically are trying to adapt the novelty. It's like what they're trying to get. And the way test time training kind of works is the ARC data set has a public data set and there's a private data set. And this is the one in the Kaggle contest, the actual money prize is attached to. It's a private data set that no one has seen. Very few humans have ever seen in the world. And you're not allowed to see it when your test gets it in order to have strong guarantees around ability to adapt. We try to reduce the chance of cheating or we're fitting on the private data set. And so what people figured out is that they can take the private data set inside Kaggle and use that data as a starting spot to generate lots and lots of similar data using data augmentation. They might change the colors, mirror the grid, things that don't change the semantic rule, but things that generate lots of like nearby permutations, they then fine tune a model using these like, you know, several thousand, tens of thousands of like, you know, locally trained things, and then they inference it. And that actually was working. I think that's what got one of the top 50% scores was this paradigm of test time training. And I do think that like, you've kind of got two really broad points of evidence now, even with O3 beating ArcV1, two broad sets of ways that you can, that we know about how to use language models to adapt to novelty. One is COT search, which is like what O1 Pro and O3 do. They're doing lots of sampling, lots of search in their per COT step. And then you've got this test time training paradigm where you want to take the situation and try to create data augmentation around that in order to feed that back into the manifold and do inference on it. This is a form of knowledge recomposition, I guess, is a way to think about it. And Arc shows that both these approaches are pretty promising in terms of getting computers to not just strictly memorize what they've been working on before or trained on before. Do you think if Arc was broader and not just sort of in this domain of sort of like pixels and changing colors, if that would kind of break that approach? It seems like you have sort of contained the domain of Arc by just the format of it. Like I remember reading years ago. There's pretty obvious ways you'd... Oh, God. I just like, I remember reading Douglas Hofstetter book as a kid, I think, where he like makes these sequences of numbers and you just try to like guess what the next number would be in the sequence. Like that sort of seems like a variant of like an arc like reasoning challenge, but in a different structure. This is something that's underappreciated about arc, I think, or a miscon, like a common misconception about arc is that it is a visual benchmark. you know I get why this misconception exists because like we render it visually for humans to take see the intuition is like ah you know our intelligence systems aren't good at ARK because they just aren't like good at dealing with visual domains yet ARK is should more be thought of as a program synthesis benchmark than it is thought of as a visual benchmark and the intuition here is that like classic program synthesis is exactly what you just said it is given a sequence of integers in a sequence out figure out what's the rule how do you map the 1d line of numbers into another sequence of 1d array of numbers arc extends that into 2d for the first data set it's just like it's a matrix instead of an array and but it's still at its heart a uh it's a program synthesis like challenge it's like okay given a given a matrix of numbers now instead of a vector of numbers what's the transformation rule the exact same problem that statement still still surrounds um and now to your I think your kind of curiosity here is like, isn't this kind of like domain like constrained? Could you get into like non-formal domains with this kind of stuff? I think this is where like pure program synthesis is not going to be able to do that. You are going to have to, in order to like kind of solve ARC the hard way and build a system that has a lot of domain generality, you're going to need to merge deep learning and program synthesis together into the system. Although, PPS is just going to be sort of too, I think, fundamentally brittle in order to represent everything there. You're going to need some degree of being able to create model descriptions and be a bit fuzzier on the edges around how to sort of capture that model. But ultimately, yeah, I think ARC is today at least a better articulated, at least if you want, as a program synthesis benchmark that we haven't made progress on. Okay, so that actually, so that makes me think of a question, which is, is the 2D nature of ARC, like, essential to it at all? I think it's an important design principle because it's trying to make us, I think one of the magic, this is all credit to Francois. One of the beautiful things about ARC is its ability to be, to capture your curiosity as Kima. right you look at the puzzles and i i've given them luck to a lot of people my friends and family and everyone you look at the puzzles and based on all the hype you hear about ai and all you know coming up last year um and all of the sort of even individual intuition of working with these systems you look at them you're like you take the puzzle like oh okay that was pretty easy and then you're told like yeah i can't do that yet you're like wait what like are you are you sure like what if i just what if i just like pasted it like i'll just take a screenshot and throw it are you really really totally it just like invites this curiosity moment of like what's going on here and that that i think is it's like i think it's really standout thing and why we render the benchmark visually is to like make a strong not just a claim about something that like is intellectually interesting about you know capability we don't have yet but its goal is to inspire people to work on it you know benchmarks aren't useful if no one actually cares about them and I think ARK has done a really good job, France might do a really good job from the design standpoint of making it something that captures your imagination, makes you ask why, and it pulls you down in the funnel of like, all right, let me just try, I have some ideas. Let me just go try it. And its goal is to inspire folks, right? Like this is the whole point of launching ARK Prize was we launched it to raise awareness on this important unbeaten benchmark and raise awareness of this like definition of AGI, this design principle we have around the benchmark and to inspire more AI researchers to go try new ideas. You know, my sort of macro philosophy at this point in my life is basically, AGI is the most important technology in the history of humanity. We should be doing everything possible. Anybody who has any unique idea of how to go beat this thing to create AGI should be trying to do it. Myself included. And this is like, I just got really frustrated around how much, when I met researchers, talk of venture investors, like how much funding, energy, mindshare, just going into this one paradigm. And like, even if it's the right one, The world deserves a few counterbets just to make sure to increase our overall chance that we're going to get there. And so I stand by that. I think we want the strongest global innovation environment we can get in order to get to AGI as quickly as possible so we can live in that future and use to actually accelerate the future that we all want. Okay, so your new organization, is the idea that you're going to make some counterbets? Yeah, yeah. So, India, we launched a few weeks ago, officially announced it. Francois and I started to do, I'll call it like, you know, intelligent science lab, basically. And our view is that the way to get to AGI, that it removes most of the bottlenecks that we currently see, even from systems like O3, is a combination of deep learning and progness synthesis. We've actually been talking about this for, like, since we started launching ARC Prize. Like, we gave, like, we went on this big university tour and sort of espoused this viewpoint. You can go find those talks if you're curious to get the gist of the idea. But this is big for the idea. Deep learning and program synthesis are two completely different paradigms. And program synthesis, by the way, program search, is what's making O3 work. So I think we have very strong evidence now that this is the new paradigm that is actually unlocking this new set of capabilities. And my belief is that I think in retrospect, we're going to look at this O3 kind of beating ArcV1 in December moment as the sort of starting point of another five to 10 year scaling journey on program synthesis. In a similar way that we look back in 2012 at like, you know, AlexNet beating the ImageNet contest. Funny enough, it was literally a contest they're running. And that's kind of viewed as one of the shotgun kickoff moments of deep learning. And I think we're kind of at that point today where we're looking ahead at the next five to 10 years. And, you know, what don't we have yet? Like we don't have, you know, and this is sort of a sort of colorful rhetoric, but like we don't have the, you know, the transformer for program synthesis yet. like there's a lot of technology that has not been invented. If you even just look at the fields here, you've got like probably a million deep learning experts now in the world today, or at least engineers in the world today. You've probably got in comparison like a few hundred folks on the programs on this side. And if this is truly the path to AGI, we're going to need to grow that field. We're right at the starting spot. And that's our basic view is that we're going to need to merge these two paradigms in order to get extremely efficient AGI technology that doesn't have human bottlenecks in our terms with its learning process. Okay, so what is program synthesis? The best way I can articulate program synthesis is probably by example. So it's actually in a really old field. It's older than deporting us. It goes back to, you know, 70s, 80s, and 90s and stuff. And classing program synthesis is dealing with trying to figure out a program that maps a sort of integer sequence of numbers to another integer sequence of numbers. This is typically the benchmark that researchers have cared a lot. There's even this really cool thing called the online database of integer sequences. It's like literally researchers have made hundreds of thousands of these sequences that you can use for this kind of research. And you're given an input sequence and output sequence, and your goal as an engineer is to create a computer program that can automatically figure out what that sequence is. You might be surprised, oh yeah, this is actually a very hard challenge. And the reason is because, you know, depending on the complexity of the program and how much hidden state there is, how many hidden variables there are, programs can be quite long or quite complex. And the longer the program gets, you get into this problem of trying to basically enumerate, this is kind of how people classically try to do it, is they try to brute force it. They try to just search over every possible program that could potentially exist, and they just, you know, check it. Like, let me plug in the input and see if I get the output. It's extremely inefficient. It's exponential. It's scaling. It's O of X to the end. It's like nuts. It's just like you're never really going to be able to like make this work in a reasonable amount of time. But that's the rough form of the problem, I'll say, is that you're trying to create programs. And I'll map this into O3 because I think like how O1 and O3 work because you can actually use these systems or at least you can in O1. You can kind of get a better feel for it. You know, with these chains of thought, right? Oh, I give O1 a question. It thinks for a while. It builds a big chain of thought. And then it gives you an answer. A way to think about that chain of thought is like a program. It's a natural language program, but it is a program. It's got like individual steps, and each step is kind of like a transformation of the latent space from the thought before to the step after it. And you're constructing this program. And what O1Pro and O3 do is actually program recombination, where they actually search over the space of all possible programs, very similar to how classic programs said this, where you're trying to enumerate through options to sort of find it. Again, the big challenge with program synthesis is, and if I contrast deep learning at PS, is that program synthesis can learn out of distribution, or you can find programs that generalize out of distribution. This is in contrast to deep learning, which cannot, right? Deep learning is like a paradigm where you need to give it a lot of data embedded on a high-dimensional manifold, and you can make quick proximate judgment calls, intuitions for inferencing new data off that manifold, but they're not going to be exacts. There's no guarantees of exactness, and you need a lot of data. and you're going to get like in distribution accuracy. Program synthesis, you're looking for a program, you only need probably a couple of examples. You don't need a million. You don't need a hundred thousand. You need like three in order to like find the rule contours of a program that does that. And once you find the rule and you have the program, it's going to work for any input, no matter what input you give it. So it requires very little data. It generalizes out of domain, but how natural explosion is your problem? You just can't, it's just interaction with the search for all possible programs there. Or in a once case, search for all possible reasoning chains. So how do you, so like, how do you, how do you do this? Well, the insight is to use the sort of pros and cons for both sides to merge them together. You want to use you know the upside of devorting which is to make quick approximate sketches you know to inform a search process in order to make sure the search process is not just brute forced You don want to brute force it That not how humans work Humans don sit here and think through every you know a thousand Python programs to solve an arc puzzle. We use our intuition to generate, like to do the sort of, you know, sketch of a couple possible answers to a solving puzzle. And then we symbolically in our head verify it. We run through the steps. We say, okay, is this right? Is this right? Is this right? And if it's not, well, we go back to our, you know, deep learning part of our brain and say like, okay, give me some more ideas. and there's a smooth back and forth between the two systems. And we think that's sort of the fundamental substrate here that we can construct around. It's funny, you know, my co-founder, Sean, has been doing a lot of work on Sweebench. Like he recently, he got the highest score there. And it's a, yeah, it's funny. He talks a lot about the same stuff you're saying. Like I think, you know, he's, it's a very different domain, but he's using, you know, O3 and O1 to generate programs and then run them and then try to figure out like which is the best one. And I guess the runtimes are expensive here, but... Efficiency is going to be a big factor. I think this is something in the field of AI research is not reckoned with fully yet. You know, when we launched the 03 news, like, we had to report that on a 2D graph where the x-axis was effectually like token cost or like cost per task because we're getting in this paradigm now in AI where you can spend more money to get higher accuracy and higher reliability. Now, it's still on a logarithmic curve. like so you know it's not like it just feels linearly up here but like you can spend more money to get a better answer um and a lot of demands and this means that you can't report a single benchmark number anymore in fact arcs that we still need to fix this for arc for next year like we need to like have our leaderboard with like kind of an efficiency dimension in it somehow um because like you can't just say oh well okay that system got 75 on the benchmark well okay well how long did it take them to get that how much did it cost them to get that uh these are types of questions I think we're going to have to like be able to answer and make assertions around in order to inspire and guide research attention to continue to drive efficiency because I think that's like there's a couple of human bottlenecks remaining in these types of reasoning systems. Once we get past those, efficiency is really going to be the main key thing we have to figure out. Although it's, I mean, if you can, you know, use compute to get better answers, if we compute cost reliably goes down, right? So the shape of that curve really matters too, doesn't it? I'm not sure I follow. Sorry, if you can, like, the slope even in log scale of the trade-off between, like, cost and performance, I think also really matters. Like, I would love to spend, you know, more money to get better answers in lots of domains. Yeah, there are definitely a lot out there. One of the major ones that's going to start working this year is agents. Coming back to the Zapier anecdote, you know, the number one blocker for deploying automation with agents right now is reliability. Trust isn't high enough. This is an underappreciated point that I don't think most people have figured out quite yet is that what does increasing adaptate, what does the ability to adapt to novelty actually mean in practice? Like, oh, sure, it's a keep-to-the-art benchmark. It beats that. What it really means in practice is you can more consistently get the same answer, not even necessarily the right answer, just more consistently get to an answer from these systems. And that now gives humans the ability to steer there and control their behavior more precisely than we had before. This is going to raise the reliability bar. And I think a lot of use cases where people want to use agents and it was not a cost problem. It was just like, they just don't work enough. I'm willing to pay up the human labor rate for this system to work. Now we'll start having those use cases get unlocked as a result of plugging things like 0101, 0101 Pro 3, R1 into some of the planning steps for all these agent systems work. I guess one of the things that's been in the back of my head as I do this interview with you is we're sort of taking for granted that this kind of like reasoning is important. And I kind of wonder if you're, we're like losing people by showing these sort of toy-like problems in the sense of like, what does that actually translate to in the real world? Yeah. You kind of said it, but I guess, do you have an intuition? Like what, like where does like the arc price failure show up? Can I give you a concrete example? Perfect. Fantastic. Yeah. I'll give you a funny story, which is what kind of led me down a lot of this, like a benchmarking path a couple of years ago. So when you were first building some of our AI agents at Zapier, We had one that I had set up. Zapier has been a long-time strong partner with OpenAI. I've been a part three now, like major OpenAI launches. It's been sort of fun. And so we have, both of us use Slack. We have a shared Slack channel. And when we were building some of our early AI agent prototypes, we had built a system where we wanted the AI agent to automatically send a message into Slack from some lead management stuff coming in from HubSpot or maybe our sales team was doing transcriptions and feeding into this process. And when we were testing it, we said, hey, let's have the agent, it's got two main functions it needs to fill in to do this automation. It needs to pick a Slack channel to send the customer information to and it needs to write the message to the body, the Slack message. And we had the agent guessing both from the inbound message. Our system allowed us to, and this is how it's app or say, you can give hints to the agents on how to fill in those two fields. Like you could say, oh, for the Slack channel, I'll use the testing, hashtag testing channel. And this is a plain text description. And for the message, you know, grab the lead, first name, last name, phone number, email, likely to buy, and build a lead, like, you know, little widget and put it in. Yeah, that'll be the body of the message. And when we first turned it on, like we had a Slack channel called hashtag testing. The name of the channel we had with OpenAI was called like open AI dash, like partner dash testing. And the agent just like picked a couple of times that partnership channel to like start sending, you know, like information to. And it was like, oh, shoot, that's not good. You know, this is like a production system. You know, it's a good, it's a strong, it's an important partnership. We don't want it. And like, we don't want customer information being like shared like that. Totally. So immediately turn it off, scrubbed it, fixed the situation. But it was my first realization moment that like, oh, reliability and overrides and control are going to be extremely important things for building trust. Because like the first time you see that, you're like, and it's like, it's a user who saw that. It's like, nope, turn this thing off. Like get that thing 10 feet away from me. I can't do this. This is, you know, it's putting my business in jeopardy. And it was very real. It made us realize early on that in order to start deploying AI agents like a year and a half ago when we started, we needed to offer users hard-coded control over constraining guessing. So in this new case, when you use Zapier's AI agents, you can allow it to guess if you want to, but the default is just choose a channel. And that allows you to have some amount of certainty and guarantees around, well, okay, it's not going to go off the rails too much. Or you can say, here's the three channels I want you to guess from. And it allows you to build in that sort of hard guarantee. Again, all this was just early insight that reliability matters a ton. It's probably the number one thing that matters for deploying agents that can automate tasks that businesses and users care about. And that is the type of thing we will now start to see get reliably solved using 010 and 3. You're not going to see this sort of as much stochastic randomness in doing the wrong thing. If you like build an agent and say, send it to the testing channel and it works the first three times, you're going to expect it's going to keep working and that's going to be the truth. Whereas in the past, with the sort of very, you know, stochastic nature of how these LM peer systems have worked, you really couldn't make that same guarantee without some extra guardrails on top that developers created. And so why did you start India as its own organization versus work with OpenAI if you have this close relationship with them? Why do you feel like they needed a new organization? I think the reality is that if you look at the sort of frontier of what most of these companies are working on. I don't think they share the view that we do about the importance of program synthesis. I saw a lot of Tom Fogham folks, Frontier Labs, OpenAI included, and even showing them, like, you've got these over three results now. And I think there is a view that this is still, we're still in a deep learning paradigm. This is deep learning with a little bit of special search stuff on top, but that's not really that important. I do think there are people, by the way, in these organizations that get it. But I think the broad view still in industry today is that, like, this is, you know, deep learning is still scaling. And I, we fundamentally disagree with that perspective. My view is that program synthesis is at least 50% of the equation. Maybe not from a compute budget standpoint, but from, like, if you were to measure, like, where are the ideas in the future, highly efficient, no human ball in the QGI system? it's gonna be half of that is gonna be coming out of the discipline of like program synthesis in some way and like i said before i think agi is such an important technology anyone who has like a unique differentiated idea i think should be going trying it um i think it's important for the world like this is why we launched ark prize we want to inspire more people just to like go try new stuff again you know try to get back to the ai industry research model we had in 2018 2019 where everyone was trying new and different ideas. And I do think ArcPrize has helped shift the Overton window from a narrative standpoint. I'm starting to see it now. I'm very excited. We're starting to see more fundamental innovation now coming out of a lot of the like small startups. In fact, one of the biggest things that surprised me about ArcPrize, you know, I sort of expected that we were going to get a lot of individual researchers and maybe some big lab attention to like work on the benchmark. The biggest surprise I probably had coming out was about seven or eight startups came up to be some point throughout the contest or afterwards and told us that they had changed their, these are like AI companies, they had like changed, they had pivoted. They changed the research roadmap to go work on ARC. And I thought that was a really exciting thing to hear because I was like, okay, I think it's starting to have the impact that we hoped it would, which is to get people to at least explore more and increase the sort of overall probability that like we figure this thing out. And, you know, I think we've got exciting, long-term ambitions about what we want to use with the technology that I think also differ quite a bit from any of the other sort of major players and companies that exist out there. But purely, if you're just looking at the technology of like, why did we start India? I think it is the fact that we just had a, you know, I think we have a differentiated view that I think has a high degree of chance of success that increases the sort of probability of getting to AGI quickly for the world. And we're going to put our full effort and attention into trying to make that happen. And are you also oriented towards making a product and making money and things like that? I have somewhat of an interesting view here. So to ask you a quick question, like, no, this is a research lab. Like, we are, like, there's no off-ramp for products in the near term. We're like, this is get to AGI, by my definition, our definition, you know, to reduce this gap to zero between easy for computers, hard, easy for humans, hard for computers. You shouldn't expect to see us really make products ahead of that. Now, we may start using some of the prototypes of the technology to start trying to advance some of the frontiers of science. This is actually one of the reasons I mentioned besides just building AGI. Building AGI is step one for us. The thing that gets us really excited, me really excited, is to start leveraging the technology to accelerate the sort of pace of innovation we see at the frontier of a lot of different fields of science. Like, AGI is going to solve a lot of problems. No doubt about it. It's already solving problems with App Store customers. Like, it's not even AGI yet. Like people are going to use this technology to solve a lot of problems. And that's great. I'm fully supportive of that. The thing that gets me like more excited about the technology is it's sort of like accelerating into this like unknown, unknown future. And I'll make this like kind of illustrative story, you know, around like the printing press, you know, back in the 1400s when it was first introduced. You know, was it 600 years ago now? You know, like what was the reaction? You know, it's like obviously lots of fear and skepticism, but also a lot of excitement around sort of this knowledge proliferation of, you know, sharing knowledge globally, being able to exchange freely ideas. I think if you went to anyone, though, and like ask them, hey, make some predictions about what the year 2025 looks like. You'd probably be pretty hard pressed to get anyone to imagine a future of like Wikipedia and AI that got trained on Wikipedia. And like I can talk to computers. What's a computer? like you know we're just so far down the technology like tree from these genesis moments of like the catalyzed everything and i think that's what really gets me it's almost more like an adventure than it is problem solving motivated you know there's going to be really exciting cool things that come in the future technology that i can't even tell you what they are yet um but i think we're in order to get there faster the main constraint is like we need computers that can work on autonomous ways to do innovation. And that is bottlenecked on creating AGI. And that's what I'm really excited to sort of help be a part of and help accelerate if we can. Do you have a point of view on timeline for AGI? To the definition I shared before, it was like you've gotten rid of all the gaps from, you know, easy for computers, hard for AI. Okay. One other thing I think it's actually added to that definition that is worth considering is the efficiency. You know, I think we'll probably reach that definition before we reach it at a human efficiency level, would be a guess. So let me go ahead and just, like, add a little asterisk on that and say, like, we're able to sort of, you know, there's no remaining tasks that are used for humans, and we're doing it at human level efficiency. you know. Can you put a dollar in human efficiency? Is that like your salary? This is literally something we're debating right now, how to measure efficiency. And I don't think the industry knows it. Arc is still, like ArcPrize, we're trying to figure this out too. You kind of want something that you can contrast between humans and computers. And like flops aren't really great. You can maybe do dollars because you could like pay a computer to do X work for X dollars. You could pay a human to do X work for X dollars. Time is kind of interesting from a wall clock perspective because you can make time faster with computers by paralyzing, and that's something they're good at, right? So honestly, I don't know the answer. I probably make the advocation that dollars are probably the best one today based on what I know in order to make sort of comparisons because dollars kind of, everything fits into dollars, you know, how much you pay a human for a library versus how much compute costs. Dollars will track the price performance of compute getting more efficient as well, which is kind of nice. We don't know. So let's say you like, yeah, level, human level efficiency, we're paying humans the same we're paying computers to do these easy tasks that humans find. Something like that. I'll add that sort of asterisk on. My expectation right now, if I just had to make some wild guesses, we've got ArcV2 we're working on for this year. I expect it probably will be durable for let's call it 12 to 18 months. That's my best guess today based on how it's doing on all the frontier systems we've been testing it on. It's not going to be another five-year benchmark though, like the V1 was, because ArcV2 is kind of in the same domain. And it's just like, it's raising the bar of difficulty for computers without raising the bar of difficulty for humans. And this should be an interesting capability. I think it will still be interesting. I think it will still be a good gradient, a tool that will point research in an interesting direction. But the thing I'm really excited about is this V3 that we started designing and kind of prototyping for next year. And our design goal for V3 is that it would be durable for three years. That's our hoping goal. We'll have to see, obviously, make contact with reality and all that stuff. But, you know, my sort of anticipation is like we will not have, you know, easy for humans, hard for AI solved at that level of efficiency, you know, based on today for the next at least three to four years. Honestly, I'll tell you this. One of the hardest things about making predictions is it's very easy to make predictions around smooth scaling curves. It's extremely hard to make predictions around step function changes in capability. because obviously curves you can sort of track out, right? Make future guesses around. Capabilities just joints. Like if you were looking at the arc, if you were trying to make progress about when is ArcView 1 going to reach 85% from a general purpose system, it's like going for five years, it went from zero to 4%. And then in two months, it went from 4% to 85%. That is an extremely hard thing to predict because two things about this. One, you don't know if the technology exists in the world to do it yet. You don't know if the ideas are in the world yet to do it yet. And then if they are in the world, you just don't know if someone hasn't put them together into a system to demonstrate it to you. So you have a lot of unknown variables in trying to make guesses about when do these step function capability jumps sort of happen. I am not a, I guess to make my concrete views, I do not believe we're in a pure sort of scaling regime where you can make a nice smooth curve over these like system sizes and cast a way coin out into the future and say, we're going to have all this stuff solved, will have a job at that point. I treat it a lot more empirically. And I think, you know, you can make informed guesses about when you think step functions might happen. But I think it's in the public's interest to understand that, like, this is actually the reality. We're still looking for step function capability leaps, and we don't know when those are going to show up. Well, awesome. I think that's a good stopping point for this interview. Is there anything... You got through a lot of content. A lot of content. A lot of wide-ranging content. Thanks so much for listening to this episode of gradient descent, please stay tuned for future episodes.

Share on XShare on LinkedIn

Related Episodes

Comments
?

No comments yet

Be the first to comment

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies