DeepMind Genie 3 [World Exclusive] (Jack Parker Holder, Shlomi Fruchter)

Machine Learning Street Talk

Tuesday, August 5, 202558m

Spotify Apple

Machine Learning Street Talk

0:0058:22

What You'll Learn

✓Genie 3 is a generative AI model that can create interactive, photorealistic virtual environments from text prompts
✓It offers significant improvements over previous versions, including longer simulation horizons, diverse environments, and the ability to introduce dynamic events
✓The hosts see potential applications in areas like robotic simulation and interactive entertainment, but note that Genie 3 is still a research prototype with limitations
✓The hosts express concerns about tech giants like Meta/Facebook potentially trying to acquire the Genie technology
✓The hosts discuss the challenge of modeling the full creativity and unpredictability of the real world within a constrained AI system

AI Summary

This episode of Machine Learning Street Talk provides an exclusive look at DeepMind's latest AI model, Genie 3, which can generate interactive, photorealistic virtual environments from text prompts. Genie 3 builds upon previous versions by offering longer simulation horizons, diverse environments, and the ability to introduce dynamic events. The hosts discuss the potential applications of this technology, particularly in areas like robotic simulation and interactive entertainment, while also acknowledging its current limitations as a research prototype.

Key Points

1Genie 3 is a generative AI model that can create interactive, photorealistic virtual environments from text prompts
2It offers significant improvements over previous versions, including longer simulation horizons, diverse environments, and the ability to introduce dynamic events
3The hosts see potential applications in areas like robotic simulation and interactive entertainment, but note that Genie 3 is still a research prototype with limitations
4The hosts express concerns about tech giants like Meta/Facebook potentially trying to acquire the Genie technology
5The hosts discuss the challenge of modeling the full creativity and unpredictability of the real world within a constrained AI system

Topics Discussed

#Generative AI#Interactive virtual environments#Robotic simulation#Interactive entertainment#AI research and development

Frequently Asked Questions

What is "DeepMind Genie 3 [World Exclusive] (Jack Parker Holder, Shlomi Fruchter)" about?

What topics are discussed in this episode?

This episode covers the following topics: Generative AI, Interactive virtual environments, Robotic simulation, Interactive entertainment, AI research and development.

What is key insight #1 from this episode?

Genie 3 is a generative AI model that can create interactive, photorealistic virtual environments from text prompts

What is key insight #2 from this episode?

It offers significant improvements over previous versions, including longer simulation horizons, diverse environments, and the ability to introduce dynamic events

What is key insight #3 from this episode?

The hosts see potential applications in areas like robotic simulation and interactive entertainment, but note that Genie 3 is still a research prototype with limitations

What is key insight #4 from this episode?

The hosts express concerns about tech giants like Meta/Facebook potentially trying to acquire the Genie technology

Who should listen to this episode?

This episode is recommended for anyone interested in Generative AI, Interactive virtual environments, Robotic simulation, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

<p>This episode features Shlomi Fuchter and Jack Parker Holder from Google DeepMind, who are unveiling a new AI called Genie 3. The host, Tim Scarfe, describes it as the most mind-blowing technology he has ever seen. We were invited to their offices to conduct the interview (not sponsored).Imagine you could create a video game world just by describing it. That's what Genie 3 does. It's an AI "world model" that learns how the real world works by watching massive amounts of video. Unlike a normal video game engine (like Unreal or the one for Doom) that needs to be programmed manually, Genie generates a realistic, interactive, 3D world from a simple text prompt.**SPONSOR MESSAGES***Prolific: Quality data. From real people. For faster breakthroughs.https://prolific.com/mlst?utm_campaign=98404559-MLST&utm_source=youtube&utm_medium=podcast&utm_content=script-gen***Here’s a breakdown of what makes it so revolutionary:From Text to a Virtual World: You can type "a drone flying by a beautiful lake" or "a ski slope," and Genie 3 creates that world for you in about three seconds. You can then navigate and interact with it in real-time.It's Consistent: The worlds it creates have a reliable memory. If you look away from an object and then look back, it will still be there, just as it was. The guests explain that this consistency isn't explicitly programmed in; it's a surprising, "emergent" capability of the powerful AI model.A Huge Leap Forward: The previous version, Genie 2, was a major step, but it wasn't fast enough for real-time interaction and was much lower resolution. Genie 3 is 720p, interactive, and photorealistic, running smoothly for several minutes at a time.The Killer App - Training Robots: Beyond entertainment, the team sees Genie 3 as a game-changer for training AI. Instead of training a self-driving car or a robot in the real world (which is slow and dangerous), you can create infinite simulations. You can even prompt rare events to happen, like a deer running across the road, to teach an AI how to handle unexpected situations safely.The Future of Entertainment: this could lead to a "YouTube version 2" or a new form of VR, where users can create and explore endless, interconnected worlds together, like the experience machine from philosophy.While the technology is still a research prototype and not yet available to the public, it represents a monumental step towards creating true artificial worlds from the ground up.Jack Parker Holder [Research Scientist at Google DeepMind in the Open-Endedness Team]https://jparkerholder.github.io/Shlomi Fruchter [Research Director, Google DeepMind]https://shlomifruchter.github.io/TOC:[00:00:00] - Introduction: "The Most Mind-Blowing Technology I've Ever Seen"[00:02:30] - The Evolution from Genie 1 to Genie 2[00:04:30] - Enter Genie 3: Photorealistic, Interactive Worlds from Text[00:07:00] - Promptable World Events & Training Self-Driving Cars[00:14:21] - Guest Introductions: Shlomi Fuchter & Jack Parker Holder[00:15:08] - Core Concepts: What is a "World Model"?[00:19:30] - The Challenge of Consistency in a Generated World[00:21:15] - Context: The Neural Network Doom Simulation[00:25:25] - How Do You Measure the Quality of a World Model?[00:28:09] - The Vision: Using Genie to Train Advanced Robots[00:32:21] - Open-Endedness: Human Skill and Prompting Creativity[00:38:15] - The Future: Is This the Next YouTube or VR?[00:42:18] - The Next Step: Multi-Agent Simulations[00:52:51] - Limitations: Thinking, Computation, and the Sim-to-Real Gap[00:58:07] - Conclusion & The Future of Game EnginesREFS:World Models [David Ha, Jürgen Schmidhuber]https://arxiv.org/abs/1803.10122POEThttps://arxiv.org/abs/1901.01753[Akarsh Kumar, Jeff Clune, Joel Lehman, Kenneth O. Stanley]The Fractured Entangled Representation Hypothesishttps://arxiv.org/pdf/2505.11581TRANSCRIPT:https://app.rescript.info/public/share/Zk5tZXk6mb06yYOFh6nSja7Lg6_qZkgkuXQ-kl5AJqM</p>

Full Transcript

Close your eyes, exhale, feel your body relax, and let go of whatever you're carrying today. Well, I'm letting go of the worry that I wouldn't get my new contacts in time for this class. I got them delivered free from 1-800-CONTACTS. Oh my gosh, they're so fast. And breathe. Oh, sorry. I almost couldn't breathe when I saw the discount they gave me on my first order. Oh, sorry. Namaste. Visit 1-800-CONTACTS.com today to save on your first order. 1-800-CONTACTS. This episode is brought to you by Diet Coke. You know that moment when you just need to hit pause and refresh? An ice cold Diet Coke isn't just a break. It's your chance to catch your breath and savor a moment that's all about you. Always refreshing, still the same great taste. Diet Coke, make time for you time. By the way, look at this dog. This is amazing. This is insane. What was the prompt to create that? Today is a world exclusive of what is, in my opinion, the most mind-blowing technology I've ever seen and the most poggers I've ever been. You're not going to believe what Google DeepMind showed me in an exclusive demo in London last week. This technology might be the next trillion dollar business and might be the killer use case for virtual reality. Google DeepMind has been slaying so hard recently that even Gemini DeepThink can't count the number of wins in the context window. Let me explain. Today we're going to talk about a new class of AI models, which are called generative interactive environments. They're not quite like traditional game engines or simulators or even generative video models like VO, but they do have characteristics of all three. They're basically a world model and video generator, which is interactive. You can hook up a game controller or any kind of controller for that matter. DeepMind say that a world model is a system that can simulate the dynamics of an environment. The consistency is emergent. There is nothing explicit. The model doesn't create any explicit 3D representation. How do you square the circle between like a stochastic neural network and yet it has consistency, right? So I look over here, I look back, I look there again. The thing is back. Like, isn't it a bit weird that a sub-symbolic stochastic model can give us apparently consistent, like solid maps of the world? Do you remember the quake engine in 1996? It required explicit programming of the physics and rules and interactions. But this new generation of AI systems learned real world dynamics directly from video data. You can control an agent in the world in real time. The move towards generative world models was born from the limitations of hand-coded simulators. Even their most advanced platform, X-Land, which was designed for general agent training, it was the frontier for embodied agent training with curriculum learning, but it felt far from the real world. It was almost cartoon-like. It could model 25 billion tasks, but it was still handcrafted, it was constrained to the rules of that particular domain, and it was janky. Imagine if you could just generate any interactive world you wanted to train your agents on with a simple prompt. Now, cast your minds back to last year when I interviewed Ashley Edwards at ICML. This was the first version of Genie, which was trained on 30,000 hours of 2D platformer game recordings. When we're generating next frames, the objects that are further away are moving. We're slowly moving objects that are closer. And this is a sort of effect that you would often see in games such that you can kind of simulate depth. It's something that we also have it, you know, like when we observe things moving, we see things moving slowly when they're further away. So, yeah, the model learned that. Just being able to be that good at understanding the physical world was not something we were expecting it to be that good at that quickly. The core innovation of Gini1 was a spatial temporal video tokenizer that converts raw footage into processable tokens, and a latent action model that discovered meaningful controls without labeled data, and an autoregressive dynamics model which predicted future states. The latent action model, a form of unsupervised action learning, was the core innovation. Genie discovered eight discrete actions which remained consistent across different environments, purely by analysing frame-to-frame changes in game recordings. This means it knew what jump meant or what move left meant without being explicitly trained on those actions. This was an OMG moment for me. I mean, how was that even possible from training on offline game episodes? Even more surprising was how it seemed to have emergent capabilities, like 2.5D parallax. Just 10 months later, Genie 2 arrived with 3D capabilities and near real-time performance. The visual fidelity was much higher. Now it can simulate realistic lighting, like the Unreal Engine. You know, things like smoke, fire, water, gravity. Pretty much anything you might see in a real game. It even had a reliable memory. You know, you could look away from something and bring it back into view and it would remember the thing. This is GigaChad, Jack Parker Holder. He's a research scientist at Google DeepMind in the open-endedness team talking about Genie 2 with Demis, no less. This is a photograph taken by someone in our team somewhere in California. And what we then do is ask Genie to convert this into an interactive world. So we prompt the model with this image and Genie converts it into a game-like world that you can then interact in. Every further pixel is generated by a generative AI model. So the AI is making up this scene as it goes along. Exactly, yes. Someone from our team is actually playing this. They're pressing the W key to move forwards. And then from that point onwards, every subsequent frame is generated by the AI. Around the same time last year, you'll probably remember this by the way, DeepMind's Israel team led by Shlomi Fructa showed diffusion models simulating the Doom engine. The system was called Game Engine. It's almost a meme at this point how Doom runs on calculators and toasters. But here is a neural network confabulating a Doom game frame by frame in real time. Look at how it just knows what the health is. You can shoot characters, you can open doors, navigate around maps. Occasionally it was slightly glitchy, But this is just unreal. You know, you could just simulate doom at 25 frames a second on a single TPU. The only limitation, of course, was that it could only do doom and nothing else. So last week, we waltzed our way into London, and Jack and Shlomi gave us a demo of Genie 3. Honestly, I couldn't believe what I was seeing. The resolution is now 720p, which is firmly in the good enough territory to suspend disbelief. It's real-time. It can simulate real-world photorealistic experiences, which can continue for several minutes before running out of context. Shlomi had his hands all over VO3, by the way, and they seem to have combined elements of the Genie architecture with VO, producing something I can only describe as VO on steroids. Unlike Genie 1 and 2, the input is now a text prompt, not an image, which they argued is a good thing from a flexibility perspective, but it does mean that you can no longer take a photo of a real place and generate from there. One of the main features of Genie 3 is that it has a diversity of environments, a long horizon and promptable world events. Now on the world events, let's take this ski slope example. We might type in another skier appears wearing a Genie 3 t-shirt or a deer runs down the slope and there you are, things just happen in the world. they say that this might be very helpful for modeling things like self-driving cars where you can simulate rare events but i was left thinking that this is just turtles all the way down how can we write a process to prompt the potentially infinite number of rare things which could happen in a scene there was an example they showed of flying around a lake and it was amazing but i was like thinking where are the birds mate like can you can you type the birds into the prompt. The team believes that we haven't yet had the Move 37 moment for embodied agents, you know, where an agent discovers a novel real world strategy. They see Genie 3 as the key to enabling that, but the real world constantly surprises us because the real world is creative. Creativity simply means that the tree of things which can happen keeps growing. New branches and leaves just keep appearing. Perhaps in the future we might have an outer loop which makes the system more open-ended but right now in my opinion Genie 3 like all AI gives you exactly what you ask for in the prompts and isn't creative on its own. Currently the system only supports a single agent experience but imagine how cool it would be if you could extend that to a multi-agent system. Apparently they are working on that. I mean personally I'm most excited about a new modality of interactive entertainment. You know, just imagine YouTube version two. DeepMind sees the main use case of being able to train robotic simulations as being the real game changer. This seems plausible to me. I mean, like the miracle of human cognition or in brains is that we have evolved to simulate the world without direct physical experience, which is expensive. This is basically the same idea, right? Why train in the real world if we can just simulate any possible scenario in a computer just like that Black Mirror episode. Here's a couple of examples they gave of using simulated environments to train an agent to do some specific language tasks. Now with Genie 2 they said they were happy if it was consistent even for 20 seconds but now when you notice something inaccurate it's very surprising. The key thing is that it now extends beyond the prediction horizon of the average human and the glitches are getting harder and harder to spot. They said that Genie 2 wasn't actually real time You had to wait a few seconds between taking different actions. You know, it was low resolution, had limited memory. You know, I mean, it was superficially really good, but it didn't look particularly photorealistic. Genie 3 changes all of that. So Genie supported around 10 seconds of generation. Genie 2 around 20 seconds. Genie 3 is able to simulate interactive environments for multiple minutes. This time around, they were a little bit more tight lipped around the architecture. they wanted to focus on capabilities in the interview. And that's fair enough. I mean, it's understandable, given that this is potentially a trillion dollar business and Zuck will be sniffing around like a truffle hound. My biggest concern with this is that as soon as Zuck gets wind of this, he is going to be getting out his checkbook. He's going to go straight to Jack and Shlomi and he's going to be like, come on, boys, $100 million, come and work for me. Zuck, mate, seriously, no, don't do it. these guys they're doing god's work over here you need to just let them let them do what they're doing you can make it yourself if you want suck leave them alone i should say i did joke at the end of the interview that if you are learning unreal engine right now you might want to pivot to a different career but the google guys were quite grounded they argued that this is a different type of technology there are pros and cons you know which is fair i should stress that as amazing as this technology is it's still a neural network and it still has many important limitations Certainly, though, just imagine how easily you could generate an interactive motion graphics with this technology. You know, that's something that Unreal Engine has been leaning hard towards in version 5.6. So do I need to fire my motion graphics designers, Victoria? Will users be able to use this? Not anytime soon. This is still a research prototype. And given the obvious safety concerns, they're going to open this up progressively through their testing program. One question did come up in the press conference yesterday, though. like could it generate an ancient battle and shlomi said that it's not trained on that kind of data wouldn't be able to do that yet so i mean certainly not a specific historical battle anyway so it does sound like there are still some limitations how can a system like this ever be fully reliable well they did say that with better models the trend is that they get more and more accurate the glitches become fewer and they expect to see further improvements you know there's this annoying phrase like this is the worst the model will ever be but even as they said they can generate it's some edge cases using a whole bunch of prompt augmentations but it might just be turtles all the way down you know how do you come up with all of the rare black swan events that might happen so what data was it trained on they were quite cagey about this as well it's probably safe to assume that it's been trained on all of youtube and lots more besides that how much compute does this thing need well i asked them that and they were a little bit vague about it um they said that it ran on their TPU network. So I'm inferring from that it needs a crap ton of compute. However, I can say that it was demoed in front of me. It was very responsive. You put a prompt in, it thinks for about three seconds, and then you're just in, and it just works. They also mentioned some cool stuff about how, you know, like Genie can be used to train agents, as we said, but the agents themselves could be used to better train Genie 3, creating this virtuous cycle of iterative improvement. If you're in a world walking around and say you go to cross the street, you sort of check the queues of the drivers, for example. Maybe there's not a crosswalk and you need to know when to stop. You can see that they're slowing down. So that's when you would go. And the other agents should be simulated in that fashion. Genie 3 and other similar models would be impossible without at least some human feedback in the training loop or the data curation or the evaluation. Prolific is a human data platform and they are sponsoring this video today. My name is Enzo. I work at Prolific. I'm the VP of data and AI. I support everything from AI, data, research and the likes. For those unfamiliar, Prolific is a human data platform for working with everything from academic researchers, but also small and large players in the AI industry. Visit prolific.com. Yes, this is a demo where they've got Genie 3 memory test on a blackboard. You see there's like an apple and a cup. And then you kind of like go out, you look out the window, you see there's like a few cars. And the purpose of this test is to kind of say they've got such a long context window, you know, similar concept to a large language model, that it still remembers all of the things that it generated. You know, even if it's minutes ago, we've got the blackboard over here. We look up and there it is. It remembered it. Genie 3 memory test. I've also noticed that this model is even better than VO3 at things like text. it's because you would think that they would they would dumb down the model to make it interactive and to make it this sophisticated but even as a video generator model it seems almost better than vo3 for doing a whole bunch of stuff all right so i shlomi fuchter i a research director at google deep mind i the vo co and i basically working at Google for about 11 years recently on diffusion models and the various modalities image video and we'll tell you more about what we're working on right now. Hey, I'm Jack Parker-Holder. I'm a research scientist at Google DeepMind in the OpenEndness team, originally working on open-ended learning and open-endedness, and more recently working on world models. We are here at Google DeepMind in London, and you guys have just demoed to me something which I think I'm more impressed with this, I think, than anything I've seen probably ever before. I think it's a paradigm-changing moment. Shlomi, can you tell us a little bit about this new version of Genie? Sure. So Genie 3 is our most capable world model. And by a world model, what we mean is basically a model that is able to predict how an environment would evolve and also how different actions of an agent would affect this environment. So with Genie3, we're able to basically push the capabilities of a world model to a new frontier. That means higher resolution, much longer horizon, and better consistency, and all that in real time. basically allowing whoever if whether it's an agent or a person that interacts with the system to walk around it navigate it affect it while the generation happens in real time genie 3 is just ridiculous right it's just on a completely different level but maybe we should just contextualize that around genie 2 so what was genie 2 that's a great question um so genie 2 was sort of the culmination of two years of research in what was quite a new area which is foundation world models we called it at the time so essentially in the past world models had modeled a single environment so the canonical world models paper in 2018 from david harren jagen schmidt huber modeled the car racing environment which is a majoko environment and it could just model that one environment it could predict the next states given any actions in that one world we've seen with the dreamer series i'm also from from google's eat mine danage or hafner with atari games and other kind of environments as well but no one had ever done something that could create new worlds so with genie one the real novelty there was that we had a model that for the first time could be prompted to create completely new worlds that didn't previously exist but that being said they were fairly rudimentary like they were low resolution you could only play with it for a couple of seconds and so agents couldn't really learn long horizon behaviors that wanted to and the diversity was still fairly constrained and it required some form of image prompting with genie two we really pushed that um um to the next level right so we trained it on a much larger distribution of 3D environments. We moved to 360p from, I think, 90p before. So it was more close to what we see now, but it was still sort of scratching the surface because we didn't really know that this approach could scale the way we've seen other methods have. So we wanted to really test this from a research standpoint. But then I think for this year, we wanted to really take that to the next level, and that's what we think we've done. Yes, and it's now 720p. It's interactive. So Genie 2 wasn't interactive. It wasn't fast enough. And you know, Steve Jobs said, there's something magic about the touchscreen, right? There's something magic about it. And of course, the magic happens when it's interactive. And some of the demos that you showed me were just insane, right? So photorealistic, I mean, it's kind of like a fusion of VO, I suppose, that you can now understand the real world and you can build essentially a foundation model for the real world, which is interactive. that's mind-blowing. And just tell me about some of the examples you showed. Yeah, so I think what you said about Veo or more generally about video models is right. There is a way, we can think about them as somewhat a world model, but it doesn't allow us to actually navigate or interact with it completely interactively. And I think that's one of the limitations of video models that with Genie Free we're trying to address. And basically in the examples that you've seen, We're able to, because Genifree generates the experience and what we see frame by frame, it lets the user or the agent that is using it basically control where it wants to go in a very low latency. That allows basically exploring the environment and creating new trajectories that are not predefined like video models. So in the examples that you've seen, for example, you can see the character or the agent in this video moving around, maybe going back to the place they've already been in before, and everything remains consistent. And I think that's a very remarkable property or capability of the model, the ability to preserve and the consistency of the environment along very long trajectories. Yes, and even Genie 2 had some kind of object permanence and consistency, but nowhere near as much as we have now. But we'll come back to that in a second. We can't say too much about the architecture for Gini 3, but in Gini 2, there was an ST transformer, so a special temporal transformer, which was conceptually quite similar to like a VIT. And there was a latent action model, which means even from like non-interactive data, you could infer some low cardinality action space. And then those went into a dynamics model. I think what we can say about the architecture that might be interesting is that definitely because of the interactive nature of the problem or the setup, then the model is not regressive. So what it means is that it means that the model generates frame by frame and has to refer back to everything that happened before, right? So if, for example, we're walking around some auditorium or some other environment, basically, if we revisit a place that we've already been to, the model has to look back. and understand that this information has to be consistent with what's happening in the next frame. So I think the interesting point here is that everything here, like the consistency is emergent. There is nothing explicit. The model doesn't create any explicit 3D representation. Unlike, you know, methods, other methods like NERVS and Gaussian splatting. so I think that emergence kind of like capabilities are very interesting and surprising for us. Yes and even Genie 2 had emerging capabilities like Parallax and you know it could model certain forms of lighting and so on but this just blows my mind Shlom you were involved in that Doom simulation last year and even that just blows my mind so we all played Doom in 1993 it was one of John Carmack's finest and And now you're saying that, I mean, certainly the work that you folks did last year, you've got a neural network model, which is subsymbolic. So there's no explicit model of the world. You don't know where the doors are. You don't know where the lakes are, where the maps are and so on. You just kind of take a, you know, a sample, a traversal through this space and it just produces the game in pixel space. I mean, that's, yeah. Yeah. You know, this is really, you know, I've been playing games, obviously, including, you know, Doom and others. And I also worked on game engine development at some point very early in my teens. And I think what I really like about this project is that we now are now able to run models that actually generate consistent 3D environments, as in, you know, game engine and the Doom simulation. And they run on GPUs or TPUs, while in the past we were running, you know, this like game engines on the same hardware. So I think it's really something very interesting, and it kind of closed this circle for me. And in particular, in the case of Game Engine, we tried to push on the real-time interactive aspect. So we basically said, okay, would a diffusion model be able to simulate a game environment end-to-end with nothing explicit, no code, nothing except for actually generating the pixels and getting the inputs from the user? and we weren't sure if it's going to work. So I think there is like with this kind of research, we try and it doesn't work and then all of a sudden something happens and we see that it does work and that's a very rewarding moment. I mean, I think in this case, once people saw, I think this was really, even the reception of that was a bit surprising because there is something about the real-time interactive capability that really sparks the imagination of, oh, I can actually walk into this environment, maybe generated environment and actually experience it right so i think this was a moment that i'm later when i think about it like we were kind of like um excited about the the real-time nature of of the simulation and we really wanted to bring it to higher quality more general purpose simulations this episode is brought to you by state farm listening to this podcast smart move Being financially savvy? Smart move. Another smart move? Having State Farm help you create a competitive price when you choose to bundle home and auto. Bundling. Just another way to save with a personal price plan. Like a good neighbor, State Farm is there. Prices are based on rating plans that vary by state. Coverage options are selected by the customer. Availability, amount of discounts and savings, and eligibility vary by state. This episode is brought to you by LinkedIn. Healthcare professionals can lead their careers with LinkedIn. Discover jobs by specialty, preferred shifts, and even desired salary. From mental health therapists to radiology technicians, it's now faster and easier for healthcare professionals to find the right fit. Learn more at linkedin.com slash healthcare. Side effects may include faster job placement, improve work-life balance, and increase career satisfaction. So Jack, I mean, one of the million-dollar questions is, you know, like even with a language model, it's stochastically sampled, you know, with this temperature parameter. Same thing here. I mean, with Genie2, the dynamics model is using this must get, and it was run iteratively. And how do you square the circle between like a stochastic neural network and yet it has consistency, right? So I look over here, I look back, I look there again, the thing is back. Like, isn't it a bit weird that a subsymbolic stochastic model can give us apparently consistent, like solid maps of the world? That's a really good question. I think probably similar to language models that there are some fundamental things about the world that you want to remain consistent. So with a language model, I think even though, as you said, they can be stochastic models, if there are things that are stated as facts in their context, they will still probably recall them correctly, right? Whereas new things are where they maybe have more degrees of freedom to change things like that. So I'd imagine in a world like a genie 3 generated world, if you were to move around, then maybe new things would have some degree of stochasticity to them, right? But then once they've been seen once, then they should be consistent from that point forward because the model knows when to use this stochasticity. And this is kind of an emergent property from the scale that we train at. Yes, and we'll save the emergent discussion. I was just telling the guys about my conversation with David Krakauer the other day, but maybe you won't go there. But the other really interesting thing is, so, you know, you said David Haar, you know, 2018 with Schmidt-Huber, the world models thing. And Shlomi, in the presentation, you defined a world model as essentially being able to simulate the dynamics of something, right? If a world model simulates the dynamics of a system, how could you, for example, how could you measure that? So I think it's very hard to exactly measure the quality of world models in general. And I think when it comes, especially to visual generation, if it's image models and generative models in general, it's very difficult to measure their quality because some of it is very subjective, right? So I think for LLMs, actually, we're in a better place because we can measure their performance. First, of course, there is perplexity, just the next token prediction problem. But later on, we actually care about how they operate for the tasks that we care about, right? So we measure, for example, a downstream performance on various tasks. But when it comes to world models, and today we focus mostly on the visual aspect, right? So it's important to highlight that the world is more than just visuals, right? But again, for Genie Free, we're focusing more on that because a lot is captured in the visual interaction of the world. So measuring how well a model is doing really depends on the context and also on how we want to use it later. I think that's something we have to keep in mind when we evaluate things in models. So we have in mind one particular application that we think is really key, and that's to be able to actually train and let AI agents interact with simulation environments. And I think that's something that, you know, I'm coming more from this kind of like maybe simulation background, but not so much from the maybe training agents in simulation environments. That's not so much wasn't my original background. But, you know, I think through the interaction with, you know, other people in DeepMind that are kind of like exploring that for a long time. I was really, you know, over the last few years, I kind of came more and more to realize how much potential there is in that. that because if we really think about it, AI would be limited by the ability to perform physical experiments, right? Because imagine that you want to develop a new drug or a new collect treatment. You cannot really do it in the real world if it takes months for every step in the way, right? And the same, we can think about, you know, if we want to learn how to assemble something, then again, if I have to train the robot in the real world, it might take very long. So that's why the simulation of the real world is really key. And that's what we hope we kind of push a bit further with Genifree. Yes, very exciting. I spoke to a startup recently and they sketched out this future where we'll have essentially a model platform where people doing robotics can download the policies. You know, so I'm in a factory and I need a policy for doing this particular thing. But of course, they imagined that it's so scarce, it's so difficult to get real world data that there would be a marketplace and everyone would train their own policies and they would sell it to other people on the market. This is a slightly different vision. You're saying that now we have a world foundation model and essentially I could say, well, in this situation, I need to have a robot policy for doing this particular thing so I can just spin off a job I can create the policy and away we go So is that roughly correct I think that is kind of the vision that we have So I think in robotics in particular there a lot of focus on deploying robots in somewhat constrained settings right So it might be, for example, in someone's apartment that's very staged, right? Almost a stage as a podcast recording, you know, got like all this support staff watching around this robot achieve one goal, right? And from a control perspective, it might be very impressive. But in terms of the stochasticity of the world that it's in, it's very limited, right? And if we look at simulation environments, they might accurately model physics, but they definitely don't model things like weather or other agents or animals, these kinds of things, right? Whereas a model like Gini 3, because it has world knowledge, that world knowledge extends beyond physics, actually, to also the behavior of other agents. And as we showed you in that example at the beginning, with the world events that we can also inject, right, You can actually prompt to have, you know, another agent crosses in front of you or like, you know, we had a herd of deer run down the ski slope or something like that. And I think these are the kind of things that for robots to be deployed at large scale in the real world, the real world is fundamentally populated by people and other agents. And this is something that we can gain from training on this general purpose world model that we just have no other approach, I think, to scalably get this data in a safe way as well. right because the safety is a critical element of this that we can simulate things uh in a realistic way without having to actually um deploy agents in the real world yes and that was a very important detail so you can put a prompt event in and you gave me an example of there's a skier going down a slope and then here's a guy with a gemini t-shirt and i guess what i'm thinking about here is if we did train these robot policies we would need to do probably some kind of curriculum learning and some kind of diversity so you know we would start off with a simple environment and then we'd add the guy with the Gemini t-shirt and then there'd be a car coming along and maybe in reality there would be some kind of meta process you know creating some gradient of complexity and you know diversifying environments I love that Ken Stanley paper you know the poet paper doing something like that but is is that like a fairly reasonable intuition so I think it's still early to say exactly how world models like Gini Free will be actually used for AI research I think we can only kind of directionally say. In general, I think we still, we see it also in other generative models that there are some capabilities that we actually discover, right? And we don't necessarily know that they're there. And then through the interaction development, we're actually seeing them emerge. For example, you know, with Veo, we just recently, like a few days ago, we kind of shared that you can write like some text on a photo and provide it to Veo and it just like it reads the text and it follows also the spatial instructions right and I think that's that's for example something that we didn't necessarily explicitly train them all to do but it's capable of doing and I think here as well the capabilities of Gini3 that we're exploring are still we still discover new things and i think that's something that we hope um that by um first by having more like you know testers and external testers that we already shared some some uh like we basically previewed the model to and give us feedback so we hope that through this kind of engagement with the community um and we can better see how those models will be useful um and that's something that i expect to take some time um as we basically try and understand the best application you know I'm a huge fan of open-endedness, for example. And certainly at the moment when we prompt models, if we're quite generic in what we put in the prompt, then we tend to get quite simplistic answers. So a lot of people doing computer graphics when they prompt image models, they have so much specificity and they deliberately take it onto the tail of the distribution so they get something that's novel and interesting and so on. And the real world just always produces a sequence of artefacts which are novel and interesting. You know, like you get random NPCs walk onto the screen and cars go in and so on. And is my intuition correct that at the moment, as good as it is with Genie 3, you tend to get quite a specific scene and you don't have like random kind of planes flying over and sort of, you know, just random things happening? Yes, that's a really good intuition, right? So it definitely is the case that the model is very aligned with the text prompts that it's given. So therefore, there is a lot of emphasis placed on the quality of the text prompt to describe the scene. But I actually wouldn't see that as a limitation. I would see it as a strength. So firstly, it means that there's actually a lot of human skills still involved to create really cool worlds. And you see some of the examples we showed you. We have some very talented people that can do amazing things with these models. And there is actually a lot of value add there to do that. So it actually is a tool that can really amplify already creative humans in new ways. I'm definitely not the best at doing this, right? And I can tell you that it is really impressive when someone is able to do that. But on the flip side from the agent perspective as well, right? So when we're talking about designing environments for agents, and you reference Poet, which was for me, like Poet and Dwell models were the two papers that I just thought were eventually on a collision course, right? And that's basically when I started my research career. And I think Poet was fundamentally limited because of the environment encoding being an eight-dimensional vector, but also the fact that there was not really any notion of interestingness as well, right? And in your recent interview with Jeff, he's obviously talked about how this problem is largely now solved with foundation models, right? So these foundation models can not only define what's interesting based on standing on the shoulders of human knowledge, right? But they can also steer the generation of worlds in things like OmniEpic to do this. And in that case, it's done through code. But here we have text as a substrate as well so in theory this these kind of open-end algorithms that use language could actually be quite strong places to have these kind of like notions of interestingness and agents steer tasks through that space as well yeah i think i think this is the fundamental thing because certainly with creative models at the moment um like weirdly counterintuitively you need more skill to make it do something interesting than you did before like the average creative process now for someone designing a thumbnail on youtube is they they mix together you know they might use the contact model they might use an upscaler they might then like you know use another image generator model you get this huge kind of compositional tree of operations that happen and it's very very highly skilled because a lot of the kind of structure for constraining the generation of these models still comes from our own abstract understanding of the world and this is kind of what Kenneth Stanley was saying in a sense he was saying that we have this understanding of the world which is constrained by things like symmetry and you know various different rules and we then kind of we hint to the model we constrain the model in the prompt using those things would the models ever be able to do that without the humans needing to prompt them so i think what's interesting is that eventually what we find like what humans find interesting and worth you know maybe watching or investigating or researching it's it's eventually being defined by people and and i think in the case for example if it's video if it's it's a video generation for example then we see that people go and find ways that maybe we weren't like they use the tool that we put in front of them to generate new things. So, for example, we have people try to cut like we have the ASMR videos of people cutting, you know, fruits made of glass, right? Which is not something you can do in the real world. And the novelty comes from the prompt, basically. I think that's what you're alluding to. So I think in the case of we're still in a similar place, I would say, when we think about the world models, because you have to provide the description of the world that you want to maybe walk into an experience. But some elements are not like would kind of emerge from and will be inferred from the prompt that you provide. Right. So you can maybe write a very short prompt, but still the world will have much more richness. So I think there's a question of where does this richness coming from? And I think we're at different maybe levels of the ability of models to bring this richness into your experience. But I think it's kind of like over time we see that it becomes more and more higher and higher and little information provided by users can actually generate very rich videos or experiences. so I would say it's not like it's a bit of a evolving answer I would say like over time I expect that more inputs to the model or you can think about it like the person is providing a seed and from that seed we can maybe generate more like more elaborate descriptions and finally an experience so I don't think about it like a one-step process but more of like a series of creative steps each one of them can be done by a person or by an AI model and together they generate maybe something new and that's what we're seeing play out on Twitter because the creative process is like you know generate, discriminate, generate, discriminate and we mimetically share all of the prompts that work and that's why we've just created this beautiful phylogeny of creative artefacts that are exploring the space of these models, which is beautiful. And I'm thinking about the future. I mean, I know you probably can't speculate about this, but this could be the next YouTube. It could be a new form of virtual reality. You know, in philosophy, there's this thing called the experience machine where you plug yourself into this better than life matrix simulation and no one wants to leave the experience machine because it's better than real life. But we could co-create something like that, right? We could have, it could be on a phone or a virtual headset and we could create these worlds and portals between the worlds and it would just be a never-ending simulation? Yeah, so that's a great question. So, I mean, going back a few steps, I think another really inspiring sort of thought experiment in this space before the generative models really became capable was something like Pick Breeder, right? And so in that case, it was a very simple idea, right? It was just, you know, evolving some images, basically. And some quite surprisingly creative things emerged from that experiment that I don't think many people would have expected, right? So you had these beautiful, beautiful images basically emerging, to use the word emerging again, emerging from just evolving, evolving user preferences over time. Right. And we definitely see modern analogies of this, like you described with, you know, social media platforms sharing prompts and people generating ideas. And then it emerges in different ways or goes in different ways, like the VO ones with people generating stand up, for example. And then suddenly there's tons of exciting content in that space. and I think that it's definitely fair to say that what we've done with Genie 3 is create another form another platform or type of model where this kind of creativity could happen and it could also lead to some unexpected exciting things but I don't think we can speculate too much at this point exactly what those will be other than say that it should be interesting and humans will likely do cool things with it. Yes I was discussing with Kenneth the other day whether because he's a big fan of um you know neuroevolution and i think he's leaning towards the the evil you know like creating an algorithm that represents evolution in of itself as being the way to you know explore interesting phylogenies and for me pick breeder was like a kind of um supervised human imitation learning so it was almost like a reflection of the constraints and the cognition that we have and i lean externalist a little bit so i think that a lot of some semantics is about this embodied physical interaction with the world and that you know just via osmosis perhaps gets represented in our brains but do you have a position on that you know do you think that just just pure neural networks simulating the world could could understand the world in the same way so maybe first to like about kind of the immersion or the the maybe potentially using you know this kind of kind of models for actually you know being immersed in it like i think this is We're still very far. I think it's really, I said before that I think the visual aspects are pretty much primary, right? We're generating pixels and, you know, we're free. We also edit audio. But our embodied existence is so much more than that. And I think sometimes we, you know, it's getting lost, right? Because eventually we, like as people, we feel a lot. We walk around. We have other senses. We have this sense of like where I am right now. And I think that's, and of course, the physical interaction, which is also applicable to robots, right? So there's still a large gap between where we are right now and where, and building, you know, a real full simulation of the world that can actually provide all of the information to an embodied agent. So I think there is definitely a gap there that is interesting, but it does show that we're still very far, you know, in that regard. but I think as Jack said basically building those kind of like experiences we do see people try to build experiences together and kind of like explore worlds together and I think that's a very interesting direction for us Yes, so many things to talk about there, I suppose one important step is this multi-agent simulation thing quite a few people have spoken about this certainly David Krakow, he said that a lot of you know, emergent intelligence is about coarse graining when you have these systems that can, you know, through a variety of tricks, accumulate information over time. So, you know, eventually, like we developed a nervous system and culture and language, and that allowed us to accumulate information sort of like, you know, transgressing the hardware, the DNA evolution speed. So it's evolution at light speed. And Max Bennett spoke about that in his brief history of intelligence how you know like a lot of the evolution of the brain and culture was about the propagation of information without needing to have direct physical experience so we can implicitly share simulations with each other so um you know when we start to build these multi-agent simulations do you think that similar things might emerge where you know um almost irrespective of the lifespan of an individual agent that the system could accumulate information and develop forms of agency and dynamics that simple systems could? That's a really good question. So I think the way I would see it from the standpoint of Genie 3 where it is right now is that it's a multi-agent world, but it's only controllable in a single-agent setting, right? So a lot of the multi-agent-ness about the world is kind of baked into the simulation around you. They're almost like additional characters in the world rather than being controllable agents. You can control them if you wanted to with the world events, right? So you could actually control what the other agents are doing. But otherwise it always kind of implicit in the weights And what you see is that there is some sort of like natural behavior of them so if you walk through a crowd people will move out the way for instance um if you or if you if you create a driving world then when you drive around the other cars move in a sensible fashion um and go to go back to your actual question i think you're saying almost the system can almost like bootstrap from itself and learn um to learn to learn um sort of across the different agents in the system i think the way i would see it right now is more that the the model's sort of knowledge of of human behaviors can distill into the the ecocentric agent and that's actually something quite powerful that we haven't really got with any other simulation tool right because if the other agents are moving around sort of in a way that we do then i think it might even be a way of of our embodied agents learning things like theory of mind because they know for instance that if you're in a if you're in a if you're in a world walking around and say you go to cross the street you sort of check the cues of the of the drivers for example maybe there's not a crosswalk um and you need to know when to stop you can see that they're slowing down so that's when you would go and the other agents should be simulated in that fashion so actually you can learn these kind of cues that you can't really learn any other way other than being deployed in the real world and that obviously has safety um safety risks and probably wouldn't be an advisable thing to do with an agent that's learning from its own experience um so what we think with this kind model is that agents can really learn these sort of social cues things like theory of mind how to operate within human like other agents but it's not the case that the model itself is then learning back from the agent that's collecting experience that might be a future step but not something we've really considered in this work yet yeah it's fascinating i mean shalomi what do you think about that i mean certainly we we use tools you know we have sextants and gps's and computers and calculators and all these different things and i mean do you do you think about the the locus of intelligence being in our brains? Or do you think if we built rich multi-agent systems, or maybe even if we look at humans and LLMs now, where do you think of the locus of intelligence in that system being? So I think there is different types of intelligence eventually. And as we make progress towards understanding intelligence and building intelligence, we end up building initially separate models that can accomplish different tasks along different dimensions of intelligence so as i said before i think there is like in a way if you really think about it like generating and simulating a world is not necessarily something that the person can do right we're not actually that's something that you know some people say okay we have a world model but definitely we don't have the same world model or an ability of like veo or genie free right we cannot really simulate like if you tell me a sequence of events i want output pixels right I can maybe imagine at the lower level of detail what would happen if, for example, you would get up or if something happens in the environment, right? And I can plan accordingly. So I think there is like, it's not completely parallel. Like we cannot just say that, okay, those models are exactly how we operate. But I think what we do see is that some capabilities that we wouldn't, you know, a few years ago, if you would come to me and say, you know, we'll be able to generate videos from text. I would say, okay, it doesn't make sense to me. I don't think that's going to happen in a few years. But it did happen. And other things that people thought were going to happen way, way before, like maybe self-driving cars, now we have much better progress towards that, but they didn't happen as fast as people thought. So I think in this case, different types of intelligents made progress in different ways. And what I'm really interested in is seeing how those types of intelligents can work together. For example, if we have a model that can simulate the world in a different level than was possible before, and we have other models, for example, Gemini, that is able to maybe reason about the world in a different, maybe less visual way. When we bring them together, for example, what would happen? And so like the examples that we've demonstrated of the SEMA agent that is interacting with Gini Free, right? Those are two separate models, trained completely separately. But then when they are put together, they can accomplish maybe a new thing. So I'm really excited about that. Yeah, that's amazing. There's also this notion that 720p, Genie 3 was just creating these immersive, and I use the word immersive intentionally because as a video editor, I know that it's a bit of an illusion. so you are trying to create a creative artifact that is just beyond the predictive horizon of the consumer and then they suspend disbelief right and so in a sense like we are cognitively bounded as observers we see the world macroscopically we see chairs we don't see particles you know and so the world can have descriptions at different levels and when when you interact with um genie do you see it kind of like traversing the levels like if you zoom in on something does it have a different description or is it limited in some way how do you think about that um so in some of the examples we showed there's the one where you're controlling this drone in the like by a lake and there's some trees and it's like very beautiful scenery and you do notice in that one that when you focus your view in different areas it definitely hones in on detail more and so i think the model kind of learns that um sometimes you don't need all this detail right and actually it should focus its efforts um sort of with the focus of the agent and i think this comes a bit from i mean our emphasis on this with this model is to have an agent-centric sort of egocentric often um but you also can do third person but a model that really feels like it's it's your view of the world right rather than um as a contrast to vo videos right which are more a much more sort of like like cinematic in quality right the whole video is very high quality whereas genie 3 often it does feel much like a your own personal view in the world which i think is is quite a different experience right and it does have this different levels of detail to it right um in terms of things like um more more abstract representations i think we're still kind of exploring it to be honest um but it definitely has a slightly different feel to it um because especially for the first-person view that you get often when you're experiencing it. How do you think about that? I mean, it's so difficult for us to know how these inscrutable models work, but do you insure it that it's simulating the world at multiple levels of resolution? So, yeah, it's a really interesting question and way of thinking about it because when I first saw video models in the simulation of, for example, fluid dynamics and other aspects of reality, I was like, how is it even possible to do that? so little time or compute, right, compared to actually running the entire simulation. So I think that's first a surprising aspect of these models. But it does come with some limitations. I think what we basically see is that the models somehow find ways to simulate, as you said, in a way that looks good, looks reasonably realistic. I think we see it with video models. But as they get better, those approximations become even better. And maybe that's a good opportunity to think about the difference when we simulate the environment in an interactive way. It becomes much harder because if, for example, you want to just spill water, like a video of someone spilling water on some surface, right? So if the model is a video model, it can just think about, like try it and generate the entire video end-to-end, past and future can be modified at the same time, and eventually you get some video that maybe looks real. But with Genie 3, we have, because it's an interactive model, then the user or the agent that controls it can decide to intervene. They can maybe look from a different angle, and we have to create the entire simulation frame by frame in a causal way, and that makes the problem much harder for the model, Basically, it cannot change the past, right? Once the past happens, you cannot change it, like in the real world, right? And I think that's where we hope to see better physical simulation, but it also makes it more challenging. So your question about different levels, for example, of reality, would it work if I just zoom in and look at the molecules, right? I think this highlights the amount of computation that actually happens in the real world that if we actually had to simulate it completely that's probably be impossible but some of models find ways to um to approximate it to a certain degree that looks reasonable to the observer which is us basically yeah so another interesting thing is um we we do this thing called thinking and and we know that neural networks they they are sort of roughly computationally limited so they can be trained to do a certain amount of computation in a certain amount of time and and that means that we can you know we can do lots of things but there might be certain types of things like for example if you simulated someone solving a rubik's cube you might find that for whatever reason it just doesn't have enough computation to do that thing so would there be an opportunity to create a variable computation version where for doing certain types of things it could think more about it yeah that's a really interesting question i think some folks in the team were also talking about this so for example if in the in the future if you want to be able to even write code inside the model. At that certain point, maybe that requires some different approaches, right? And I guess then we already do have models that can write very good code, I think, quite widely available now. We also have models that can, you know, win a gold medal, the IMO, for example, right? And maybe eventually you want to be able to do this inside the simulation because that might be the next level is to be able to really develop embodied agents that can blend these two different tasks of physical tasks and thinking-based tasks. And so, yeah, at a certain point, I think we will probably need to kind of cross that gap. But for now, I think we're probably focused much more on the visual quality and more like physical simulation rather than the sort of like more math and code type problems, which typically have more thinking style models. But I definitely think it is an interesting question. I think it's also something where the model definitely has this like physical knowledge in it, but I don't know if the model itself could describe it. It probably just has it implicitly in the weights. So another agent could actually learn probably about the physical world from the model, but the model doesn't necessarily know it and it can't tell you it, but it just sort of implicitly has that in the weights somewhere. and there's this kind of interesting duality in a sense which goes back to the agent environment thing which i think is much more of our belief is that right now this is kind of nice setup to have models that can focus on different strengths and simulating the future versus thinking and understanding the present yes what's your philosophy on that on that shlomi because um in in a way you've built something which is even higher resolution than a language model so in in in principle all the things a language model could do as you were just saying jack could kind of emerge from a model like this so is your philosophy to kind of build a massive model that does everything i'm typically thinking about this more from a practical point of view i think there is definitely this kind of like a puristic approach or we should have just one model to do everything but i think when when you know a lot of of the challenges with modern machine learning comes from actually building, there is a lot of engineering and software and hardware design that actually, to build those things, right? Train them and run inference. And I think when we actually, we try to actually design those systems, there are a lot of constraints. And those constraints basically kind of like impose on us some ways in which we have to prioritize what we want the model to do. I think especially for GenieFree, when we're bringing the real-time capability, right? Real-time basically means that we have to generate frames very fast, right? Multiple times per second for the person or agent that interacts with it to feel like this is actually, you know, they can move around and feel the responsiveness of them all. So that sets some constraints on what the model, how much capacity we actually have. So I think when it comes to the point of basically to your question, can we have one model to encompass all of the aspects of intelligence that we discussed before? I think it boils down to what are the set of requirements that we have. If we don't care about real-time interaction, maybe we can do that. if we don't care about things like how expensive it is to run. But ultimately, we're trying to build models that are not just some, you know, they don't end up just being as a theoretical kind of exercise. We hope to actually bring them like other models for people to use and to advance actual applications. And I think that's where we kind of have to make those decisions. And ultimately, we pick the type of capabilities we want to emphasize. Very cool. And a 20-second answer, Jack. Is there a sim-to-real gap? Well, it depends how you define it. I think that there's currently, sim-to-real is actually a bit of a conflated term. It's more sim-to-lab, what people currently do. I think sim-to-really-real can only really be achieved with a photorealistic world simulation tool like Genie 3. Yeah, so you think this is actually a big step in the direction of? I think it's the only way to solve it, to actually get in the real world where there's people and other agents in general moving around, rather than just a very constrained lab-like situation, which has real-world physics, but nothing else that's real. Amazing. Guys, this has been an absolute honor. Thank you so much for coming on. And for folks at home, if you're developing on Unreal Engine, it might be time to, you know, yeah. Anyway, cheers. And Doug. Here we have the Lemo Emu in its natural habitat, helping people customize their car insurance and save hundreds with Liberty Mutual. Fascinating. It's accompanied by his natural ally, Doug. Uh, Lemu? Is that guy with the binoculars watching us? Cut the camera. They see us. Only pay for what you need at libertymutual.com. Liberty, Liberty, Liberty, Liberty. Savings vary. Unwritten by Liberty Mutual Insurance Company and affiliates excludes Massachusetts. Get a jump on next summer with Verbo's early booking deals. Don't wait to claim your dream summer spot, whether that includes a good porch swing or a poolside lounger. When you book early, you get the best places at the best prices. But back to the poolside loungers. With Vrbo, you don't have to reserve any loungers. They're all yours. In fact, the whole private home is yours. Book with early booking deals, and you can lounge around all summer long, however you please, at Vrbo.com.

Share on X Share on LinkedIn