World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

Latent Space • swyx + Alessio

Saturday, December 6, 2025

Spotify Apple

Latent Space

0:000:00

What You'll Learn

✓GI's vision-based agents can predict human-like actions from pixel inputs alone, without any game state information
✓The agents display capabilities like navigating environments, interacting with in-game elements, and maintaining consistency even with partial observability
✓GI has scaled these models by training on large datasets of gameplay footage and transferring the models to real-world videos
✓The world models developed by GI can capture physical properties like camera shake and rapid camera motion, which are not present in the original game footage
✓GI has also been able to distill the models into very small sizes while maintaining reasonable performance
✓The goal is to push the capabilities of these models beyond just gaming applications

Episode Chapters

Introduction

Overview of the guest, Pim, and the startup General Intuition (GI) that he co-founded

Vision-based Agents

Demonstration of GI's vision-based agents that can predict human-like actions from pixel inputs alone

World Models

Explanation of GI's approach to developing world models, including pre-training from scratch and fine-tuning open-source video models

Model Capabilities

Showcasing the impressive capabilities of GI's world models, such as handling partial observability and capturing physical properties

Model Scaling and Transfer

Discussion of how GI has scaled their models by training on large datasets of gameplay footage and transferring to real-world videos

Future Potential

Insights into GI's goal of pushing the capabilities of their models beyond just gaming applications

AI Summary

This episode of the Latent Space podcast provides an exclusive preview of the world models developed by General Intuition (GI), a startup spun out of the game clipping company Metal. The guest, Pim, demonstrates GI's vision-based agents that can predict human-like actions purely from pixel inputs, without any game state information. The agents display impressive capabilities, including navigating environments, interacting with in-game elements, and maintaining consistency even with partial observability. The episode also covers GI's approach to scaling these models using large datasets of gameplay footage and transferring the models to real-world videos, showcasing their potential for applications beyond gaming.

Key Points

1GI's vision-based agents can predict human-like actions from pixel inputs alone, without any game state information
2The agents display capabilities like navigating environments, interacting with in-game elements, and maintaining consistency even with partial observability
3GI has scaled these models by training on large datasets of gameplay footage and transferring the models to real-world videos
4The world models developed by GI can capture physical properties like camera shake and rapid camera motion, which are not present in the original game footage
5GI has also been able to distill the models into very small sizes while maintaining reasonable performance
6The goal is to push the capabilities of these models beyond just gaming applications

Topics Discussed

#Vision-based agents#Imitation learning#World models#Spatial intelligence#Model scaling and transfer learning

Frequently Asked Questions

What is "World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI" about?

What topics are discussed in this episode?

This episode covers the following topics: Vision-based agents, Imitation learning, World models, Spatial intelligence, Model scaling and transfer learning.

What is key insight #1 from this episode?

GI's vision-based agents can predict human-like actions from pixel inputs alone, without any game state information

What is key insight #2 from this episode?

The agents display capabilities like navigating environments, interacting with in-game elements, and maintaining consistency even with partial observability

What is key insight #3 from this episode?

GI has scaled these models by training on large datasets of gameplay footage and transferring the models to real-world videos

What is key insight #4 from this episode?

The world models developed by GI can capture physical properties like camera shake and rapid camera motion, which are not present in the original game footage

Who should listen to this episode?

This episode is recommended for anyone interested in Vision-based agents, Imitation learning, World models, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-500-million-startup-videogame-data) and raising a $134M seed from Khosla (https://techcrunch.com/2025/10/16/general-intuition-lands-134m-seed-to-teach-agents-spatial-reasoning-using-video-game-clips/) to spin out General Intuition, Pim is betting that world models trained on peak human gameplay are the next frontier after LLMs. We sat down with Pim to dig into why game highlights are “episodic memory for simulation” (and how Medal’s privacy-first action labels became a world-model goldmine https://medal.tv/blog/posts/enabling-state-of-the-art-security-and-protections-on-medals-new-apm-and-controller-overlay-features), what it takes to build fully vision-based agents that just see frames and output actions in real time, how General Intuition transfers from games to real-world video and then into robotics, why world models and LLMs are complementary rather than rivals, what founders with proprietary datasets should know before selling or licensing to labs, and his bet that spatial-temporal foundation models will power 80% of future atoms-to-atoms interactions in both simulation and the real world. We discuss: How Medal’s 3.8B action-labeled highlight clips became a privacy-preserving goldmine for world models Building fully vision-based agents that only see frames and output actions yet play like (and sometimes better than) humans Transferring from arcade-style games to realistic games to real-world video using the same perception–action recipe Why world models need actions, memory, and partial observability (smoke, occlusion, camera shake) vs. “just” pretty video generation Distilling giant policies into tiny real-time models that still navigate, hide, and peek corners like real players Pim’s path from RuneScape private servers, Tourette’s, and reverse engineering to leading a frontier world-model lab How data-rich founders should think about valuing their datasets, negotiating with big labs, and deciding when to go independent GI’s first customers: replacing brittle behavior trees in games, engines, and controller-based robots with a “frames in, actions out” API Using Medal clips as “episodic memory of simulation” to move from imitation learning to RL via world models and negative events The 2030 vision: spatial–temporal foundation models that power the majority of atoms-to-atoms interactions in simulation and the real world — Pim X: https://x.com/PimDeWitte LinkedIn: https://www.linkedin.com/in/pimdw/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and Medal's Gaming Data Advantage 00:02:08 Exclusive Demo: Vision-Based Gaming Agents 00:06:17 Action Prediction and Real-World Video Transfer 00:08:41 World Models: Interactive Video Generation 00:13:42 From Runescape to AI: Pim's Founder Journey 00:16:45 The Research Foundations: Diamond, Genie, and SEMA 00:33:03 Vinod Khosla's Largest Seed Bet Since OpenAI 00:35:04 Data Moats and Why GI Stayed Independent 00:38:42 Self-Teaching AI Fundamentals: The Francois Fleuret Course 00:40:28 Defining World Models vs Video Generation 00:41:52 Why Simulation Complexity Favors World Models 00:43:30 World Labs, Yann LeCun, and the Spatial Intelligence Race 00:50:08 Business Model: APIs, Agents, and Game Developer Partnerships 00:58:57 From Imitation Learning to RL: Making Clips Playable 01:00:15 Open Research, Academic Partnerships, and Hiring 01:02:09 2030 Vision: 80 Percent of Atoms-to-Atoms AI Interactions

Full Transcript

Hi listeners, as you may know, I recently wrapped up the AIE Code Conference in New York, and while I'm traveling, I do like to visit top AI startups in person to bring you interviews that you don't find on any other podcast that just does a Zoom call. General Intuition, or GI for short, is a spin-out of a 10-year-old game clipping company called Metal, which has 12 million users, but in comparison, Twitch only has 7 million monthly active streamers. Metal collects this data by building the best retroactive clipping software in the world. In other words, you don't need to be consciously recording, you actually just have a lot of metal on in the background while you're playing and you hit a button to clip the last 30 seconds after something interesting happens it's very similar to how tesla and self-driving does bug reporting if you have ever done a self-driving bug report in teslas the result is that metal has accumulated 3.8 billion clips of the best moments and actions in games resulting in one of the most unique and diverse data sets of peak human behavior actively mining for the interesting moments. They were also very prescient in navigating privacy and data collection concerns by mapping actions to these visual inputs and game outcomes. As you saw on our Fei Fei Li and Justin Johnson episode with World Labs, and with the recent departure of Yan Le Koon from Meta, there's a lot of interest in world models as the next frontier after LLens to improve on spatial intelligence and to work on embodied robotics use cases. DeepMind has been working on this with Genie 1, 2, and 3 and SEMA 1, and 2. And this year, Okunai and AI seem to finally agree because they have been betting on LLMs a lot, and they made the news by offering $500 million for Metal's video game ClipData. Our guest today, Pim, turned down that money and instead chose to build an independent world model lab instead. Kostler Ventures led the $134 million seed round, which is Vinod Kostler's largest single seed bet since OpenAI. We were able to get an exclusive preview of GIs models, which unfortunately we cannot show you directly. But I can confirm they were incredibly human-like and we chose to include the first 11 million minutes of the demo discussion, even though I couldn't show it to you. It may be hard to follow, but I tried to call out what was noteworthy for you to know as your likely reaction if you were watching along with us. Now enjoy the world's first look at my first look at Genuine Intuition. So what I'm about to show you is a completely vision-based agent that's just seeing pixels and predicting actions the exact same way a human would. And so yeah, what I'll show you here is what this looks like four months ago. So again, this is just an agent that's receiving frames and it's just predicting action. So you can see it has like a decent sense of being able to navigate around. It tabs a scoreboard, just like gamers always tab the scoreboard. So these are purely, these are pure imitation learning. I see. The Z is slicing the knife. Yeah, exactly. So it's doing everything that like humans would. In this case, here was the first interesting part that we saw, like it gets stuck and then it has, they have memory as well. So you see it can get unstuck. How long is the memory? Four seconds. Yeah, four seconds. Okay, so this was four months ago. This was maybe a few weeks after that. So you can see there's like, it's still doing the scoreboard thing, but they're still quite like, and these are bots too. So you can see that. It's very human. Let's just say that. Yeah. And then, right. So this was really like the early days of research where you can see, right. It does one thing and then goes for another. And then we've been scaling, right. On data and compute. and also we've just been making the models better. And this is where we are now. So what you're seeing is, like I said, pure mutation learning. This is just a base model. There's no RL, no fine-tuning. This model sees no game states. It is purely capable, not sequence, etc. It's purely predicting the actions from the frames. That's it. And this is playing against real humans, just like a human would play. And it's also, it's running completely in real time. So there's absolutely, everything here plays exactly like human. Do you give it a goal? No. It just figures out it's a goal because obviously it's trained on the same. Yes. And I picked, right, I picked a sequence where also it doesn't do well initially. So you can see like, this is just like a sequence, a random sequence. But this is the, I mean, it looks like it's very well. So. Oh, okay. Yeah, watch. Yeah, this would be good. Maybe too good. This is my favorite part. So you can see it does something that like, here, like, human would never do this. Then gets unstuck. Then has four, realizes, switch. And then in the distance. so you're saying one it makes a mistake that a human will never make but it unstacks itself and two what we just saw is it is doing superhuman things yeah okay yeah um i mean there are things that that demons said obviously um but because it is trained on on the highlights of things that all the exceptional things it's inherited yeah so it's not like move 37 where we are all their way into something that is replicating superhuman or like peak human. The baseline of our data set is peak human performance. Okay, so that's the agent. So now what I'm going to show you is we then are able to take those action predictions and we're able to label any video on the internet using those actions. So and so this is this is just frames in actions out yellow is the model prediction or sorry yellow is ground truth purple is the model prediction and then bottom left is a compound error over the entire sequence and then this is reset per prediction reset meaning not every non-any reset yeah so this just means it resets the baseline um and so this basically a single error in the entire sequence compounds here but it doesn't compound here if that makes sense yeah um so and again this is just seeing frames, right? It's not seeing any of the actions. And so what we did, right, is we trained it on less realistic games, and we transferred it over to a more realistic game. And then, and this is where it gets really exciting, we transferred it over to a real world video, which means that you can use any video on the internet as pre-training. What was it predicting? It's predicting it as if you were controlling and using keyboard and mouse. So if you were you're basically playing this sequence as a human. Is there some sense of error? So that's why you transfer to more realistic games first, and then you transfer to real-world video, because you can't get a sense from ground truth from the real-world video yet. Let's see. And then, so let's show you here. This one is also, this is the same agent that I just showed you. This is playing against other AIs. This one's playing against bots, yeah. The previous one was against players. But with the sniper, it doesn't really matter that much, as you'll say. It's like... So one thing that's really interesting is you notice that it behaves differently as it has, like, different items, right? That makes sense. Yeah. Yeah. I think there's also a question about egocentricity versus, like, so the third person. Yeah. Doesn't matter? The third person I think will be very, very helpful if you're, for instance, trying to control multiple objects in an environment later on. Right now, I think having fully in perception first person is quite helpful. This one's also, this is the policy itself. What do you mean this is the policy? The agent. Yeah, same for the strengths that I just told you about. Yeah. Like this, where it hides, that to me was just incredible. like just from knowing, being able to predict. The appearance is also high when you see it. Yeah, exactly. Yeah, yeah. And it needs the spatial intuition to go, well, this is hiding and that's not hiding. Exactly. And right while it was reloading, yeah. Okay, so that's the policy and this is a completely general recipe, meaning we can scale this to any environment. Is this work closest? Okay, now let's keep going on demos until I was going to go into research. Yeah, yeah, sounds good. Okay, so what I'm about to show you are the world models. There's a few really, really interesting parts about our world models. So the first is we actually made the decision to transfer, sorry, we made the decision to pre-train world models from scratch, but also we've actually been able to fine-tune open-source video models to get a better sense of physical transfer. And so one of the things that you'll notice here is our world models have mouse sensitivity, which is something that gamers absolutely want, right? So you can have these very rapid movements, which you couldn't do in any other world model. And so this is a holdout set. So this clip was never seen before at training time. As you can see, it has a spatial memory. This is about a 22nd-ish generation. And here's what's fascinating. This is an explosion that occurs. and you can see that in the physical world, right, the camera would shake and in the game that would never happen. So you see the world model inherits the physical world camera shake, but the actual game never does that, which is sort of, that to us was quite fascinating, right? Also to the models that I just showed you that we used to transfer over from video, the two of those combined will allow us to like push way beyond games in terms of training. This is another interesting, So this is a world model. This is rapid camera motion. So again, this is stuff that we're literally just taking one second from here in the context and the actions and replaying it here, right? And so you never essentially have, like what we're saying is the skill that you see in the clips, that like the speed and the movement, that also pays off at training time when you're doing world models. This is my favorite example. So this shows that the world model is capable of performing with partial observability. So what you're going to see is, again, you're replaying the actions from here in here, just using one second of video context. Everything after that is completely generated. So what you're going to see is the model is going to encounter, in this case, smoke. Normally now models break down. What you actually see comes out in the same place. And so it's capable of, even with partial observability, still maintaining its position in the world. and then here it is also interesting so this is sniping so this gives you like a reaction time? like the fact that it can do depth and like sequences in completely different views right so this is a completely different view than if you were to be outside of that view right and so it's able to maintain consistency while zooming in yeah exactly and so yeah so you can see so even while this goes out of scope watch and then it comes back and you'll see it's still there yeah and so this is the work that Anthony has been working on I'm just wondering how much game footage you have to watch in order to find these things we can ask Anthony I'm sure he's not going to be too excited to play these games afterwards you're not playing it, you're just watching yeah yeah yeah um great okay so those were the models um see these are interesting so we also were able to distill into like really really tiny models um so this is for instance a um a long sequence on a very very tiny one you can see it makes like a bit more stupid mistakes uh like it does things that are not as optimal um but i haven't seen anything yet At the beginning, it was running into a wall for free. Exactly. I mean, I do that too. Yeah. I mean, it's doing pretty well. Yeah. And again, all these models are running completely in real time. There's no... I was thinking your main model does real time anyway. What's the goal of this setting? Is it cost or... Yeah, parameters. Yeah. Yeah. This is the interesting one. It peaks the corner. That's what we mean by the space and the poor reasoning aspect. is humans actually, they sort of simulate the optical dynamics of their eyes and how to actually space. It's interesting all the data, right? You've seen all this. Yep, exactly. And so even in real, this is kind of interesting. Even in the real world with, for instance, YouTube data, right? You have to first solve for pose estimation. Then once you have pose estimation, maybe you do something like inverse dynamics, right? Where you basically are able to somehow label some of the actions that you're seeing. And then you still have to account for optical dynamics So like, where are your eyes actually looking before the decision? Because there's three levels of information loss. When you're playing video games, you're actually simulating the optical dynamics with your hand, right? And I think that's why I think why games are a better representation of switch support reasoning initially than YouTube videos, for instance. Okay, we're in the GI offices with the CEO. Welcome. Thank you. Thanks for having us in your office. Yeah, it's nice to be here. If I'm in New York and you're one of the hottest races of the year, I have to come and visit in Vicer, take me some time on the weekends. Yeah. So you've raised 133 million C. So general inspiration. Most people don't care about you. I guess because GI is new, but more gamers were there for the middle. Indeed. And before that, you ran probably Waste Week Summer. Yes. The largest Waste Week Summer. um what's your reflection on just that journey of life now you're an AI founder yeah you started off like Rootscape yeah I think um so I grew up with Tourette's uh I uh spent most of my time as a teenager coding and playing video games uh so in that sense it doesn't feel that much difference um but I think for uh so yeah so I started the largest privacy world RuneScape worked at Dr. Subwriters for three years first and Ebola and then on like satellite satellite-based map generation for disaster response, which was already very AI-related adjacent. I built some models back then and then started Metal, which became one of the largest social networks and video games. I've always been kind of like AI adjacent. I'm a self-taught engineer. So for me, the modeling itself always felt a little foreign. I actually had to take a ton of classes over the summer and early this year to get better at it because it still felt Like, I was really, really good at the infrastructure side. And I had written, like, our transcoders for Metal myself. So I was very, very familiar with CUDA and, like, the GPU side and all the video infrastructure that we were using for this stuff. But the modeling side itself was still quite foreign. Luckily, obviously, we have really, really good co-founders. But they essentially put a bunch of coursework together for me to go complete, to get really, really good at understanding the fundamentals better. I think for me, I had seen inside of the labs that I had really, really good leadership with fundamentals on top and also the ones that didn't. And I think the ones that did were just like much better. And so for me, yeah, I want it to be more like that. So in that sense, it was a bit it was first very foreign. And then now I feel pretty comfortable with everything. And but yeah like I think for there a lot to be explored starting in video games and also reverse engineering like i think the interesting thing about reverse engineering is it kind of teaches you to look at problems very differently it's like the ultimate form of deductive reasoning in a way uh and so um uh so this is just how i think how i operate and so for me it's been a really really interesting journey uh you know i don't claim to have any of the credentials or or skills that some of the other guests do have had on but hopefully it will make for a good time yeah well your co-founder is definitely bring a lot of that definability and you bring a lot of the I guess gaining inserties mostly with two trees mostly what I bring to the table just a little bit of history of metal let's establish metal for those who don't know the lady Twix yeah the year that's you have more active users concurrent users in Twitch something like that yeah on the creator side I think and the reason is because metal is a lot more like Instagram than it is like Twitch so people so the way the way to think about metal is it's it's a native video recorder like unlike something like twitch where you actually have to use other software to record and stream to twitch um it's not a streaming software it's actually a video recording software and a lot of gamers love to put things like overlays on top of their videos um and as a result of that we have sort of the largest data set of ground truth action labeled video footage on the internet by maybe one or two orders of magnitude yeah what was an example of an overlay play and the only overlay i usually think of is in the case had yeah yeah also um controller overlays for instance if you're playing um like let's say you're playing uh console yeah like flight simulator you get like you know the joystick and all the things so you get the actual actions that people take inside the games as well as the frames of the games themselves which is a loop right because it's essentially you perceive then you act and there's a state update and then you perceive again you act state update which is like roughly precisely what you use in order to trace to train these agents yeah it's it's almost perfect training data. You were showing me in the demo and we showed some B-roll here on how you don't love Q. It's very important to use a lot of action. When did you figure this out? Maybe starting a year and a half ago. Yeah. And we realized that figuring out the side of the research for us was we very much never wanted to be in a position where we eroded privacy or something like that. So we never wanted to actually log a W or A or S and a D, which for researchers the fact that we don't do that like often it sounds strange like why wouldn't you do that but i think for us the privacy when we get the data yeah i i think you know a lot a lot of the the um the researchers they did they hadn't quite understood yet that you can actually just get away with just doing the actions um and the reason is like at training time having the actual keys is noise anyways like if there is text on the screen and you would want to in theory uh make that um part of training then like reading text from a frame is like really easy and so for us if we So we convert, basically you hit the input, we convert it to the actual action. So we had thousands of humans label every single action you can take in every single video game over the past year and a half, which is an enormous amount of action labels. Yeah, so when you act, we get the actual action itself. And then it being said at training time, you can, for the general set of that game, convert back into computer inputs if you want to, but you can never do it for any individual person. And so that for us, from like a design perspective, was important. So we figured all that stuff out. Then we actually started pushing, like we already had features as well with this. So for instance, like gamers already love to be able to navigate their clips by like things that happened. So we have an events capture system. And then we also have the overlays where you actually just want to overlay and render the actions on top of your clip. We developed kind of in tandem with the feature set itself. And then obviously when World Bottles became a thing, and it's very, very clear that all the all the data for this was precisely like that sequence yeah we were able to sort of be first to market recruit the best researchers and start a lab yeah that's uh that's terrible uh one more question on metal before i renew photos of the di it's been 10 years yeah what is the i don't even know how you bro something like this you know right i'm just kind of curious and yeah and like the opportunity to ask you what really worked yeah um that you became so so huge because i'm you're I'm not the only one. Yeah, but I'm sure it's performance and everything. A few things that really worked. I think the first was a lot of our competitors were focused on solving the social network and the recorder at the same time. And that never, like our bet was really that we could get so many people to record with us that we could bootstrap the network on top of that. And that worked. So while everyone was sort of distracted trying to bootstrap a social network, we were just focused on building a really, really good capture tool. And then we got tens of millions of people to use that, which then we were able to bootstrap a network on top of the share behaviors. We already had like the profile behaviors and the share behaviors, obviously, but the actual content consumption piece and the sharing piece really only came after we hit critical mass. It was actually early days during COVID when like the network really accelerated. Fortnite happened, which was really important. And I think also the fact that Discord existed made it quite a different time than when other types of networks of these types had launched because Discord essentially was like the connective tissue already between gamers that like never really existed before. and so i think those combinations of things really really made it i think we also built a product that for instance with with most video recorders you have to remember to start and stop the recorder so you have to go into the application then hit start then start your game and then um you know maybe you'll play games for three hours and you'll close the game then you have to close your video application then you well then you have to process like a multi-gigabyte file uh then you have to upload those somewhere and so like this was a pain for people and so what we did is we just ran this kind of recorder when you hit that button it does a retroactive video record so all the recording initially is in memory and then when you hit that button it exports only that sequence to disk and syncs it to your phone and so that that became super popular it also what was interesting about it also means that you're not sort of behaving or acting differently because it's always there and you can just export whatever happens which is also very very helpful for for trading obviously um i think you weren't the first to be there yeah the thing you were explaining just before this was similar to how Tesla does the bug reports. You're driving from the having disengaged autopilot. They're like, well, tell us what happened. Exactly. You're driving. Tesla doesn't want to train on the 10 hours of you driving through a desert where nothing interesting happens. You have the clip button on the steering wheel. Something interesting happens either while FSD is engaged. I'm not sure if you can use it without FSD as well. But you hit the clip button and it basically uses that precise sequence to mark, which is then more helpful for training. because it's more unique, et cetera, anytime. Yeah, yeah. I mean, so one thing we're going to introduce on the agent side, one thing that does pop up is a lot of life is boring. A lot of life is going from a lot of life. A lot of playing games is doing the boring stuff that is not capable. Somehow you see the generalized fight. Yeah. Yeah, yeah. It makes you think, right? It makes you think. Yeah, yeah. It's also quite interesting. Like I showed you the models, like what happens when you increase the size of the context window and how behaviors actually are largely shaped by the size of the context window. That to me was like one of the most interesting parts about the research. Made me think about our own behaviors in a way. Yeah. Let's talk about also the forming a team. On your website, you're 12. I don't know if that's changed now. I meet four, three co-founders. Yeah. And let's talk about how this team comes together because you may not, if you're self-taught, you don't have that at the end of the network where you manage the elements, people. Yeah. I started reading all the research papers. By that time, I was already pretty deep into having a decent understanding of not world models, in particular LLMs and Transformer-based models. And so there was Genie, there was Sima. Those two were really, really interesting. in Sima in particular was interesting because what they do is they basically take 10 games and then they they have a graphic in Sima uh where you can see kind of the precise actions that are inside of those games that they mapped and I believe they found something like 100 um which are actually actions that also exist in the real world and what they did was they then I believe it was specifically for navigation they did a 9-1 holdout set so they they trained um an agent on the nine games and then they had to play the 10th game, the holdout game. But then they also trained a specialized agent just on a 10th game and they compared how good they did. And if I recall correctly, it did roughly as well playing the 10th game on navigation specifically on the holdout on the nine game agent than it did on the one game agent. And that's what was really interesting because that's precisely the type of data that we had. Right. And so for us, the thinking was, okay, what if we did exactly what LLMs did? What if we used, right, this, right? So LLMs were trained on predicting like text tokens on words on the internet. What if we predict action tokens on essentially what is the equivalent of the common crawl data set, but for interactivity? Vision input? Yeah, action output. Correct. That's it. But what I think, well, actually I'm going to double back a little bit to you. Thanks. A question I had, which is one of the reasons why I thought you would want to prefer keyboard and mouse over actions is the action series is potentially unbounded right you can jump walk left walk right but then also look up look left the bench it's unbounded so it's huge isn't it yeah i think yeah there's there's benefits to the action space being small to start with so i think we're going to start with anything you can control using a game controller but yeah long term we want to actually predict maybe like action embeddings and have models sit inside a general action space to be able to transfer out to other inputs as well yeah okay and then let's see going on the research side so uh genie sima yeah and then the co-founders yeah so there was the diamond paper there was genie and then there was sima the diamond paper for me was really interesting because they had actually managed to get this world model called diamond running on a consumer gpu i believe it was a 4090 at 10 FPS and you could play it. And they did that on like 90 hours of data, like 95 hours. I think it was 87 hours and I think eight in the whole that set or something like that. That was just incredible, right? That they had something playable on that little data. So I actually cold emailed the entire group of students and I told them, hey, I think we have this thing. And then it was pretty interesting. So like right when that happened, a lot of the labs also started understanding what we had. And so we started very aggressively, multiple labs tried to bring us in in various ways. And they were part of that. They basically were seeing that happen. And I think for them, that also kind of solidified how real it was. And then when we chose to do our own thing, initially, we thought that we were going to have to just work on role models. So we thought, okay, the main benefit of this data set is like Gini, is role models. What we didn't realize at the time is that we have so much of this data, so we can essentially do these role models in parallel and take the equivalent of the LLM bet, mostly on imitation learning and then use the world bottles after that to get into like our off stage right and so for us and eventually get rid of the world bottles this is something like evening i mean ideally you get rid of the imitation yeah the imitation learning but yeah we essentially realized that that we could get so far on just imitation learning the way to look at it is we essentially like let's let's take the element analogy we essentially have sort of the internet or like common crawl if you will and every single lab is trying to simulate that, right, in order to get similar data in order to train their agents. And so for us, the reason why we say independent, and we just said our own thing was we think we can essentially leap every single company that's forced to either be consumers of world models or build world models and take this foundation model bet for spatial-to-boreal agents and be in a place where, you know, we have a lot of customers years before any of the labs even get there. And maybe the most similar um comparison is like when anthropic did with code right anthropic just focused really really hard on nailing the code use case their models are incredible for a lot of their customers use it for it so we just want to become incredible at this spatial temporal agent use case and likely that starts in like game simulation and then using world models we can then start expanding out to to other um areas so would you show me a little bit of how you take does generalize our yeah things um but although games is kind of the common player yeah games and simulation um i i I would specify it as game engines in Vertikiller. So even if you're, for instance, simulating human behavior in Omniverse because you're trying to create better training data for factory floors, you can use it. Yeah. Maybe Meta has a similar dataset because of the quest. I never really asked them. I never really looked into the Meta quest specifically. So you need a few things. You can't just, like, there's lots of companies that have, like, maybe recorders, but you also need the public graph. Otherwise, you can't train on the data, right? You can't train on people's, like, private videos that they have saved somewhere. right and so i think you you you need the social network graph components um because these videos need to be on the internet to rank no to train on them yeah i i mean i think i think generally people don't like people don't want to train on like because these things they live on your device usually right yeah um and you can't train on anything that lives on your device like you actually need to go and upload and do your thing right for meta specifically i think also vr the scale of VR is still pretty small. The amount of environments in VR that have consumption at scale is probably in the hundreds, whereas on PC, it's probably in the tens of thousands. And so you get a lot less diversity. The three-dimensional input space of VR is pretty interesting. We see some of this too, obviously. And so, yeah, I do suspect Meta starts using these types of things, but it's unclear to me whether they can get to a similar scale of data or diversity on the environments as we can yeah there's a lot of challenges there yeah um okay i want to take this in and i'll see the through things but i guess let's let's fill up the papers maybe one more to mention is tire yeah which uh i actually i interviewed the dia offers but that too seems like the particular uh insight that they brought it overseas yeah so so anthony too who led the um research on gaia too is also one of the engineers that joined our team uh so it's all the Diamond, the core contributors for Diamond, and then Anthony. And we just had three more researchers showing this week. It's been a good week. And yes, I think a lot of the approaches in Gaia, too, were heavily inspired by Diamond. And then Vin Sa, who was one of the authors of Diamond, also already was at Wave by the time that I emailed them. Anthony also realized what this was and realized that, you know, you could scale world models to a much larger, like, scale and decided just to make the leap as well. So I think everybody that sees the data set makes the leap. because it's but it takes a while to wrap your head around it because it's like oh it's video games right like intuitively it doesn't make sense and then when you actually understand and you see right how we've been able to transfer it to physical world video and things like that then it makes sense and then everybody tends to jump don't call the video dance follow it around so then yeah if i lived in san francisco maybe i would yeah uh just a quick note because we actually cover all these papers in the latest day club uh sima 2 did not since had as much in technical one and i don't really know why they did it a lot more word g3 had a ton of impact and but i also felt like because you couldn't play with the model or people it just seems they're an extension of all those days i guess like any quick takes on sima 2 gna3 which were both his years yeah i'll talk about sima 2 the steerability of sima 2 was to me the most impressive part because lighting up the action sequences and the text conditioning is is quite hard to do right and so that and the fact that they were like it's also quite interesting that that means that they can sort of use gemini as as part of the flywheel right where um where you can sort of scale the scale this orchestrator as like an independent almost like a puppet master if you will and then like in theory gemini could orchestrate many instances of sima right that to me is the most interesting part is where I tend to agree with this where like I think our models will initially be used as like like you'll have like an orchestrator VLM of sorts that's kind of like managing instances and instructing them. And I think sort of SEMA showing that you can do this was fascinating. Also, the fact that you could, they didn't just have text conditioning, but they also were able to do like drawings and markings of where to go. They really took an interesting end-to-end approach to me that I look forward to seeing a lot more of. Are you talking to them? Is that the one collaborating? Yeah, I think that we're very friendly with DeepMind. We like them a lot. I just saw the team not too long ago. And I think, you know, big fans of their work. The fin line that I shared from Alice Heath's coverage review is you're the biggest bet in Vino Cross-Lazs made since OpenAI. yeah how did that conversation start okay so what now it's style and maybe i'll get slapped in the fingers for revealing this or whatever but uh forgive me if i have to do that um is he asked you to like draw a 2030 picture of your company and i think he just picks n plus five years whatever i don't know i did the same to you yeah um he asked you to like walk that back from first principles all the way from today and and and he has he expects you to do that flawlessly where he can challenge any assumption any part of the vision that that and he asks the questions right he has a very technical background he also has a bunch of technical people on fc and he truly backs people that have these like very large visions on that vision and the ability to defend it alone um and that's what he did for us um and i think that's why i made that bad so i think also through this question, he gets to know a lot of things about how technical you are. He gets to know how well you think from first principles, because if that vision is not connected to something real, it's very easy to suss it out by asking good questions. And then he just backs fully, I think. He really gets in your corner if it's the right fit. And yeah, they've been incredible partners. They've opened so many doors for us. I had to ask the question. I think it's a very notable story. Obviously, a lot of work went into it, but it's also worth it. Yeah, for sure. One of the things I also wanted to, I think I asked this question out of sequence, but one of the things that are exciting about talking to you is there are a lot of people like you who are founders of business and businesses that along the way have a ton of data and yours happens to be highly valuable you pursued before deciding to do an independence journey they also talk to other companies about potential licensing or acquisition and stuff like that what is your learnings from those periods also like one version of this is very simply how do you value data yeah i don't think you can value it unless you actually model it yourself and see what the capabilities are that's my that's my real outcome you say model but chain the model yeah but that's obviously like not doable for for everyone um and also i think my general advice would be as model capabilities increase you and models are also like you know these vlms they're very very good at labeling as well generally right what i was afraid of when i was having some of these conversations was okay like you know as the capabilities increase you're just going to need less uh ground-free data and like you can do more model-based data generation or synthetic data generation i would recommend if you're going to do large data deals like just try to get like a large chunk of equity in the company that you're doing it with um if you can now a lot of them won't do this but i think uh that to me would or just go do the research figure out what's actually possible in our case we were quite lucky in the sense that this is actually the foundation data, right? And I think, right, like that's not true for every data set. I think, you know, we just happened to hit a particular gold mine. But you also did the RedCriperity, you did the action thing one or five years ago. Yeah, so you did work. Yeah, that's the thing. Like you have to be grounded, right? And I think a lot of the, and I think that's the hard part. And I think a lot of what's interesting is you can also kind of look for if like scaling laws already exist on your data type which like for video there were some but for these like input action labeled uh sets there there really wasn't any the other question is like does it go into lms does it go into uh world models does it go into like what type of model is it going to be used for and i think that's an important thing to know and so i just want to you know if you're having these conversations with labs about data just like make sure that you actually understand like what it's going to be used for because that's a very very good way for you to like make that decision yourself about what are you going to pursue that now a lot of them won't tell you that and I think you know I think in that case you generally just don't want to do it because like I think for our case like we really cared that like for instance there weren't going to be competing products with game developers built right because we didn't want to like bite the hand that feeds us and I think we are part of the games industry so those questions I think are normal and then we eventually decided you know you just have the data we're just going to go do it ourselves and that's when the rest happened yeah and he assembled the team and then uh think about it i feel like that's you've aligned a lot of stars in order to make gi happen yeah that other data founders they're at the beginning of the training yes one data founder founders who happen to have beta but they have a main business right i don't know if you're there there's two sides to this right there it's really easy to be super naive about it and like i had a lot of people tell me initially oh it's not that valuable you're just like making this up and and and so for me like doing the work and actually understanding it myself was a really really big part of building that confidence and go start the company but a lot of times it is true that like model capabilities increase so quickly that like the certain data you just don't need anymore yeah um and so i think it is it's really important to like get people to do the work such that you can make these types of distinctions yeah and and and So my recommendation would be go build miles with your data, see if you can create any sort of capabilities that aren't clearly already there or on path to being there and then figure out where you go. Yeah. I did want to ask this earlier, but you're giving me the opportunity to, we say do the learning, do coursework and all that. And your co-founders gave you some homework. Yeah. Is this like some books? I mean, Coursera. No, this was Francois Fleurais. So he has a little book of deep learning and then he also has a full course that he's published on his website. I went through the entire course over the summer. I believe it's like something like 30 or 40 lectures, which also take home projects and things like that. And I would recommend anybody does this. It goes through, right? History of deep learning, like the topology. It takes you through the linear algebra, the calculus, eventually end up with like chain rule. And by this time you've done like all the more important concepts, it takes you through how do you create neural networks using these concepts that you've learned. Wow, this is super first principles. This guy, and I've had the opportunity to spend some time with him as well. He is one of the most first principles people I've met in my entire life. I'm convinced, like I actually asked him, why did you do this course? He said, oh, because I thought all the other courses weren't right. And because he is so first principles and he can only explain things from, like everything you see and how he explains this thing, it's everything is from first principles, including like the history of deep learning itself was part of the course. And yes, he goes through everything. And by the end of it, I now have a pretty good intuitive understanding of how everything works. But obviously still, I like to describe it as I'm like the guy who just got his driver's license. I can drive the car. And my co-founders are like the F1 drivers that have done this for years. They know where all the gaps are. And so I enjoy getting to learn from them. The cool thing is also that world models is just like a very, very new space. And so, you know, I got to bring ideas to the table that I get one thought of and not because I'm great at this. It's because it's such a new space that like people just haven't tried it yet. So let's get a hit on definition. Yeah. What are world models to you? You know, in a video model, you might predict the next likely sequence or the next most entertaining frame. frame um what world models do is they actually have to understand the the full range of possibilities and outcomes um from the current states and based on the action that you take generates the next states right so the next the next frame and so it is it is a much more sort of complex problem than than traditional video models so to me it is it is a world that is accurately generated based on the actions that you take as a result of what's already been generated and just a fact check uh that is it needs to understand physics it needs to understand if i'm building a type of material you need the power it drives with some type of material yeah i think the interactions is the most important part i think the reasons why role models are so fascinating one of the things that i did when i was studying over the summer was i tried to actually build a super rudimentary um pytorch physics engine which i would not recommend writing a physics engine in pytorch for obvious reasons, but I wanted to be able to, because it's differential, so you can generate this. Yeah, exactly. And then you can train. And so I wanted to, you know, I got so many people asked me about, you know, why aren't you just using, why aren't you just simulating or generating this data? And I really wanted to understand from first principles, why? And I think the most important thing that I figured out was the compute complexity of simulation goes up really, really rapidly with three variables first the numbers of agents in an environment second uh dare doff so their individuals are freedom yeah and then third the information that each action reveals um so like um for instance if you if you have a if you have a text action or a speech action the environment can change so much based on whether you say right water or fire that the outcomes are going to be completely different of like how a human would behave in that type of situation and so it goes up so quickly with those three variables that at some point you just hit a point where you just want to maximally bet on either video transfer or generation of these environments using world models because that type of stochasticity is just incredibly difficult but it's already very very present in a lot of the video pre-training that goes into into these world models right and so i think for for us it is more so about making a maximal bet on video transfer and interacting with things that are difficult to simulate and the steerability is also really interesting with text than it is on betting against simulation or something like that and so i think there's still a large market for for traditional simulation engines it's specifically in areas where video is really hard to get is this the thing what the big lads are also same when they're talking to that i honestly haven't talked about the big to the big labs like since we started working on them ourselves i think people are more reserved with what they share with us yeah of course it makes sense that's from you yeah question how would you contrast your version of or models less that they need yeah yeah i'm fluent yeah so i don't know exactly what young is doing today my understanding it's based on le fee japa like le japa approach which is so i'll start with feifei lee i think what's really interesting about feifei lee's approach is that you in some way are able to reuse the the um the spots right in game engines and in things that let you stay in verifiable domain um which i think is a really interesting approach However, my understanding is they're currently not interactive, which in my opinion is like the whole point of world models, right? It's environments. They're great environments. And I think from a business perspective, I think they picked a really important part of the tool chain. But to me, that's not really a world model. But my guess is they'll get there, right? They'll start generating. Yeah, they just haven't been reused. Yeah, exactly. Exactly. And I think, right, Fei Fei is one of the founders of the entire space. um uh so i think it's going to be really interesting to me on on on what maybe that interactive piece looks like for me to really judge their approach i i think we interviewed just before we moved to yon uh we interviewed her with justin johnson uh her co-founder he was he was more focused on the physics side of things and the interactivity and he just haven't finished it yet but i i do think that basically that the splats if you just add more dimensions on i guess the forces acting on them then then you get attracted to the out of the box because you are basically these are virtual atoms that then has all the global physics applied to them yeah i'm uh i'm excited to see what that looks like when they actually release it it's really hard really hard for me to comment on anything. I really like the frame-based approach because all of our video or all of our training data is in this format. Yes. Yeah, so we actually asked them about this and they were like, yeah, it's possible, but we're choosing the spider form. Yeah, and you can also go from splat to frames, right? I'm sure you can write like at some, it wouldn't be easy. Like you'd have to actually render out the environment. Sure, it's not going to be a simple problem, but like in theory, it has to be something that you can do if you really wanted to. So I could, cause it's almost like having a more sort of grounds for three dimensional representation of the underlying world. Yeah. Right. So I think it's an interesting approach. Um, it might be overkill, right. Uh, uh, you're also dealing with like a much larger, like degrees of freedom on the output space. Right. So, so who knows how well it scales. I like the fact that like, I think these video models also use things like auto encoders, right. You can actually have the world models predict like much smaller, um, uh, maybe like a prison or size. Yeah, exactly. And then you can use diffusion upscaling or methods like this to actually enrich. And so I think that world models just allow a much more, or world models in my sense, for a much more controlled space that we know really well. I'm not suggesting their approach is wrong. I'm just, you know, this is, I think, what we really like about it. Honestly, Jan's podcast that he did, I don't remember which one it was, but a long time ago where he basically proclaimed LLMs to be a dead end. was one of the things that inspired me to do this. I think this is very consensus around world models. Basically, everyone that has this is like stops with their LMs and just goes through to world models. I would say that the main perspective I ask this exact question to Norman Brown from OpenAI and do is like, well, they learn invisible models. So it's basically that we didn't say that we didn't say that we didn't say that. Or what do you want to put down here, everyone? Yes. So, yeah, I'm not one to proclaim LMs or dead ends personally i think um i think they're actually quite useful in particularly as orchestrators like the way i think about is as humans right we had sort of a three-dimensional worlds then we invented text as like a in a way in compression method right so you had we invented text in order to communicate with each other in in a common way uh in a way that actually compresses all this information that we are perceiving in three-dimensional space into just like a single sequence and i think that allowed sciences to emerge right it allowed so many literature like so many uh parts of the world that we that we charge so i think it's a critical part of uh of the whole picture i also agree that that uh it's very very clear that they do build sort of the internal implicit world models inside lms um and so i think they'll be very helpful as things like orchestrators um the problem is when it comes to the generalization I think text as a generalization backbone. When most of the pre-training is text, right, or largely text sequences, then I think you want that backbone to be kind of more split-temporal in nature and then also just have text as part of that And I think the actual argument of LLMs is also for instance the autoregressive nature of the prediction itself So the fact that it running the entire output through the transformer and then in order to predict the next token, which doesn't, like, the environment in the real world is continuous, right? It's always changing. And LLMs kind of just forget about that, right? I think a lot of the argument is in the first, right? So I think the fact that text doesn't necessarily generalize well to sufficient apporal context and then the autoregressive nature of the prediction and using text for that, right? So I think those are the two main arguments. I think text prediction is just one of the actions that is going to come out of these policies and world models. I think speech and text generation will just be one of the actions that can be a part of that. I think that there will just be labs coming at this problem from both sides. And everyone ends up in roughly the same place. And the same place will be whatever people think is cool. Right? Like whatever the consumer is closest to EGI. Yeah. And so I don't think there's like a clear answer. I think it's really interesting to come at it from the world modeling side. But it's also because we have to. Right? Because like text is largely commoditized. We can import all the text. I think it's interesting and tempting. Yeah. A lot of tempting, it makes sense that you can probably recover. It's sort of like you're taking a step back, you're starting your branch of the ML research sheet, but you might actually just end up recovering all the other tech stuff emerging. Yeah, yeah, we can import a lot of that research, right? A lot of that is... That's really cool on the research side. Let's talk about the stuff that GIS is producing more, like the... I guess the research and products output. you mentioned the word customers what are your turn customers yeah so we're already working with some of our largest game developers in the world yeah we're also working with game engines directly and so really what we're doing at the moment is replacing essentially the player controller inside of a game engine so anything that you're currently that maybe like behavior trees or things that you're deterministically coding we hope to replace with a single api which is just you stream us frames and we predict actions and that can be inside an engine or it can be eventually even inside the real world hopefully those are then also steerable so the models that you saw weren't text steerable yet but i think we want to get to a point where they're fully text steerable but you see steerable means like well i want you to both to share if you're anything else i don't agree yeah i think it's it's sex conditioning on the generation so yeah the ability to to you're right we want to get to a point where you can generally and that's why it's called general intuition where we can sort of can mimic the intuition of all these gamers into human-like behaviors in any situation um as i mentioned also lab is named after the demiscus of this quote from alpha fold which is wouldn't it be amazing if we could mimic the intuition of these gamers who are by the way only amateur biologists um on his path to um he tried to get an ai to train fold it to generate a lot of data for for alpha fold and so for us really the north star right what we hope to get to one day is being able to represent scientific problems in three-dimensional space and then have a space in the world agent capable of perceiving that space and using hopefully also the the right the text reasoning capabilities that lms have today in addition to the space and for capabilities to be able to work on the other side of that problem so that for us is sort of the north star that's why you know we're sort of trying to be hyper-focused space and the world workloads the same way that anthropic was hyper-focused code and use that to then get into organizations and expand from there yeah just as a side note since you mentioned entropic um any idea what they did on this to solve for it yeah no out of any lab i probably know entropic at least yeah i admired him though yeah well the the current working theory is that they had a super lucky um role of the ducks but well all right and then he compounds from there that sounds like a nice story i'm sure it's not that yeah okay so um why do the game developers want this so if you're a game developer how well you're actually retaining players it's like um if you have a game that's already at skill is like decently dependent on how good your bots are so if you're logging in at an obscure time let's say 3 a.m in america and your player liquidity is low then you need really really good bots to keep those players engaged this is known is this a thing yeah for sure for like for that and whatever a lot of human works it is yeah um and so if you're like as a human do i want to play against bots usually it's not just bots it's like players mix in with bots because you don't want to play just against bots but it's better to have a full game than to have like an empty game yeah um and so i think as long as it's part of the environment i think it's okay that means you also have to sort of grade that skill level yeah yeah which we can do um because we have we know exactly how good people are at these games yeah yeah i think for us is kind of like step one, right? So what I was showing you is we're building a general agent that can sort of play any game in real time. But really that extends into all of simulation, right? Like in GTA V, for instance, people are generally role-playing real life, right? And so they're actually behaving in quite aligned ways with the goals they set for themselves. So you have all these examples represented in video games, right? You have truck simulator, power wash simulator. Power wash simulator? Yeah, power wash simulator, where actually the behaviors that you'd want uh an agent to be able to perceive they're all there okay yeah it's really like how seriously some gamers take truck simulator um did he haven't seen these tips you should watch it yeah they buy the whole and truck driving set and they're doing the job of a truck driver yeah what i mentioned to you we have more people at any given time on metal playing with steering wheels and like truck simulator and these types of games than waymo has cars on the road um it's a ridiculous stat but it's true yeah i mean it's so you know i i used to think that well to solve self-driving you kind of just the interplay on a gt5 um yeah i mean it's not bad for this yeah yeah our bet is not that we can zero shot any of these things it's just that like the next self-driving company can maybe have collect one percent of the data because right also for instance clips already self-select into negative events and adversity right and so like a lot of our data set because already highlights is really precisely what a lot of these companies spend their last 20% doing. I think that's the main argument. If you're another company that's looking at what we're doing, I think the thing that people won't understand is that anything that you're currently doing in pre-training, as long as your robot can be controlled using a game controller, we hope that we can move that to post-training for you. So our bet is not that we can create the next self-driving car company. It's just that the next self-driving car company hopefully only needs 1% of the data or maybe 10% of the data. I don't know, right? To be able to deliver a really good product. Yeah. It's also the term that comes to mind a lot is active learning. I don't know if you've used to identify with that. It got less cool for a bit and now it seems like the only uptrend, which obviously you have the best data set for the high intensity. You said negative, but I feel like you found negative. It could be a negative ballpark of it. Yeah, for sure. I think negative events is just because it's the most common term that people use for like, if you're Tesla, you want the crashes, you want like right um yeah right right right but but it's only gaming it's both yeah so yeah you know the model that you saw obviously had really really incredible moments and and that was largely right yeah yeah that um uh that it had a large representation of people at their best yes yeah and worst yeah yeah yeah amazing okay cool uh and you have anything else on the customer development side that you want to sort of flinch off yeah um we're also already working with robotics companies but again that and manufacturing but the key is that the robot has to have gaming inputs so we're like our bet is not that we can transfer over to like higher doff robots and the keyboard and mouse it's it's really just that we can move the hard work of of pre-training hopefully to post training yeah it's kind of like the foundation model that is a very good basis yeah you're going to straight you're going to give us frames and and likely some text or you'll license the model because they've been a wonder post training yeah our our business model is initially going to be an api i got like the anthropic api um but you also saw for instance and some of the video labeling models that we've been able to develop. So the goal is for any company to be able to take in their video data as well. And we can create first, obviously, custom versions of the policy for you, the agent. If that doesn't work, then we've already working with a customer that is doing, we distill a model and they turn that into a product for themselves. So people can engage with you on the agent level, API level. People can engage with you on the model level. Can you also buy data? No. we don't sell data okay cool so that's the that's the business um and is there a world in which i mean i i think this is on you any page if you are you know frontier labs for for world models is there a world in which there is a more sort of application layer thing that you that comes out like a chat gpt for whatever yeah you're going to see us launch a few things on on metal itself that are going to blow your mind uh as a result of this this um this agent i'll i'll leave the imagination for now if people took a great out you know and yeah on the world modeling side like i think one people underestimate is that metal is already one of the largest you know video consumption platforms as well people watch millions and millions of videos a day um whistle so um world model based entertainment and things like that well it's not like a focus for us right now i think we'll be like on the consumer side we have the ability to move very very quickly here um and get it integrated in a way that i don't think anyone else can yeah you could theoretically do a video gen like the asura like uh and what is what is that is about what's the middle one and middle meals not not reis um yeah you could theoretically generate clips that nobody play but you know it's a device i think for us the games being so human centric is like a really big part of what makes it special like i actually i actually just don't think that would work like one thing that we are really excited about though i'll give you one sneak peek of what we're thinking about is what if you could literally replay any of the clips that you have inside a world model or your friends can play them like i showed you a model that already took part of your clip as a context seems to replay enter that world but it's also how we go from imitation learning to rl right because like it's part of our research robot anyways to make every single every single clip on metal playable um so uh yeah who's who is to say that that doesn't apply to just the actual clips that you take yeah yeah just can you see more with the rl potential we describe metal as the episodic memory of humanity and simulation. So when you take a clip, really the way to think about it is you get the highlight of what is maybe three hours of playtime, right? You maybe get like two to three minutes of the things that were the most out of distribution, right? It is genuinely your episodic memory of that playtime and simulation, the things that you most want to remember and share. We want to be able to load, and this is the work that Anthony Hu is doing, the reason why we built world models, is every crash that you run into in your truck simulator or american truck simulator or a driving game we want to be able right and again these are ground truth labels so we know precisely the actions that lead up to the negative events um they're also title labeled when people upload it onto their platform they say oh good it's a crash right and so we can select all these events and if we can put them inside a world model we can go into right we can um we can train reward models to then reward based on how you perform in clips that actually contain negative events for example and so for us it's it's very much about um uh right we can we can create this this this like lm moment on i think an invitation learning but actually making every single clip on the platform playable um at billions of clips scale is how we go from invitation learning to rl cool uh we covered a lot of it uh is there anything else that you want to do before we to the grab on with the the long-term vision stuff yeah yeah i think i think for us um this is a very very ambitious long-term vet we need the best researchers in the world that that want to work on this stuff. It's really exciting not being extremely data constrained. Like we really get to, like we get so many learnings every week that we didn't think were possible and it makes it for a joy working here. Also, the other thing is because we have such a large data mode, we don't have to be as concerned as the LLM companies about publishing because we don't need the ones to be able to. Exactly, no one can replicate the models, right? And so for us, we really want to bring back like the original culture of open research, which is why we did the partnership with Qtai in France. I actually didn't. Yeah, we just announced our partnership with Qtai in France, which is an open science lab in Paris, one of the best research labs in the world. Eric Schmidt, I believe, funded in addition to some French people. They are essentially acting as the partner that's currently doing a lot of open research on the data. We also want to partner with universities because we do believe this is the frontier, but it's so data constrained that really everyone has their hands tied behind their back right now. And so we want to help fix that. So for instance, We want to work with universities to build like negative event prediction models for maybe like trucks in India on all the truck data where all these crashes occur. We have all these things that we know we can do that we just have it at the time to do. And so if you're listening to this and you're maybe an academic institution or something and you want access to some of this data and a research in educational research fashion, I think we're quite open to doing that because we want to educate people. and yeah and other than that we just want to work with the best infrastructure and research engineers on the planet as we're going into scaling you know runs that have thousands thousands eventually hundreds of thousands of gpus yeah yeah amazing i primed you this as like the closing question yeah on flight it's a little bit that we know cost that three three thirty percent i didn't know yeah so what does gi become by the risk yeah in 2030 we want to be the gold standard of intelligence. And any sequence long enough is fundamentally spatial and plural, right? Which I think is... So by nailing spatial and plural reasoning, you go after the root killer problem of intelligence itself. What the world looks like is we want to have eight... So I sort of group the sequences of AI in three stages, and I credit Andre Gaparthi for teaching this. Bits to bits, atoms to bits and bits to atoms, and then atoms to atoms. In the atoms to atoms stage, I want GI models to be responsible for 80% of all the atoms-to-atoms interactions driven by AI models. And the reason for that is because we were able to unblock intelligence so quickly, and robotics, like intelligence, is the bottleneck, that supply chains actually converged on gaming inputs as their primary input methods. And they converged on essentially simpler systems that let us do a lot more, a lot quicker. So we are essentially the 80% market approach. and then you have lots of companies that have kind of like specialized, maybe humanoid robot OS stacks that are the other 20. And then so I want to be responsible for 80% of all the atoms-to-atoms interactions driven by these models and be the goal center for intelligence and maybe 100x more in simulation. Because I think simulation will actually be the larger market initially. So I think in simulation, because you have very little constraints, also from a safety perspective, simulation is much easier. So I think a lot of the takeoff initially sits in simulation. So a lot of the simulation use cases, like what I mentioned, scientific use cases, I'm really, really excited about. And so, yeah, 80% of atoms-to-atoms interactions coming downstream from these types of spatial and the world-foundation models, and then 100x more in simulation. Yeah, it reminds me a lot of what Mark and Priscilla from the Chaz Zephyrberg Institute are doing with virtual biology. Because you can do a lot of simulation and you can do it a lot faster with this. amazing thank you for inviting us to your office and thank you for sharing a little bit while you're turning thank you yeah

Share on X Share on LinkedIn

Related Episodes

#228 - GPT 5.2, Scaling Agents, Weird Generalization

Last Week in AI

1h 26m

⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

Latent Space

Why Physical AI Needed a Completely New Data Stack

Gradient Dissent

1h 0m

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner

Latent Space

⚡️ 10x AI Engineers with $1m Salaries — Alex Lieberman & Arman Hezarkhani, Tenex

Latent Space

Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures

Latent Space

Comments

No comments yet

Be the first to comment

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

What You'll Learn

Episode Chapters

Introduction

Vision-based Agents

World Models

Model Capabilities

Model Scaling and Transfer

Future Potential

AI Summary

Key Points

Topics Discussed

Frequently Asked Questions

Episode Description

Related Episodes

#228 - GPT 5.2, Scaling Agents, Weird Generalization

⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

Why Physical AI Needed a Completely New Data Stack

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner

⚡️ 10x AI Engineers with $1m Salaries — Alex Lieberman & Arman Hezarkhani, Tenex

Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures

AI Curator

Ask me anything about AI