Genie 3: A New Frontier for World Models with Jack Parker-Holder and Shlomi Fruchter - #743

TWIML AI Podcast

Tuesday, August 19, 20251h 1m

Spotify Apple

TWIML AI Podcast

0:001:01:01

What You'll Learn

✓Genie 3 represents a 100x improvement across multiple dimensions like generation quality, resolution, duration, and speed compared to previous versions.
✓World models aim to simulate environments and dynamics to enable planning and policy learning, evolving from early work in reinforcement learning.
✓The researchers see potential in applying world model techniques to visual domains like text-to-image and video generation, beyond just reinforcement learning environments.
✓The definition of world models has expanded from explicitly modeling Markov Decision Processes to more broadly simulating the dynamics of a world that can be acted upon and reasoned about.
✓The researchers' backgrounds include work on projects like Google Duplex, which explored building conversational AI agents, and their interest in the visual domain and simulation.

AI Summary

This episode discusses the Genie 3 world model, a recent advancement in the field of AI world models. The researchers, Jack Parker-Holder and Shlomi Fruchter, explain how Genie 3 has significantly improved in terms of generation quality, resolution, duration, and speed compared to previous versions. They also discuss the broader concept of world models, which aim to simulate environments and dynamics to enable planning and policy learning. The conversation covers the evolution of world models, from early work in reinforcement learning to the potential of applying similar techniques to visual domains like text-to-image and video generation.

Key Points

1Genie 3 represents a 100x improvement across multiple dimensions like generation quality, resolution, duration, and speed compared to previous versions.
2World models aim to simulate environments and dynamics to enable planning and policy learning, evolving from early work in reinforcement learning.
3The researchers see potential in applying world model techniques to visual domains like text-to-image and video generation, beyond just reinforcement learning environments.
4The definition of world models has expanded from explicitly modeling Markov Decision Processes to more broadly simulating the dynamics of a world that can be acted upon and reasoned about.
5The researchers' backgrounds include work on projects like Google Duplex, which explored building conversational AI agents, and their interest in the visual domain and simulation.

Topics Discussed

#World models#Genie 3#Reinforcement learning#Text-to-image#Video generation

Frequently Asked Questions

What is "Genie 3: A New Frontier for World Models with Jack Parker-Holder and Shlomi Fruchter - #743" about?

What topics are discussed in this episode?

This episode covers the following topics: World models, Genie 3, Reinforcement learning, Text-to-image, Video generation.

What is key insight #1 from this episode?

Genie 3 represents a 100x improvement across multiple dimensions like generation quality, resolution, duration, and speed compared to previous versions.

What is key insight #2 from this episode?

World models aim to simulate environments and dynamics to enable planning and policy learning, evolving from early work in reinforcement learning.

What is key insight #3 from this episode?

The researchers see potential in applying world model techniques to visual domains like text-to-image and video generation, beyond just reinforcement learning environments.

What is key insight #4 from this episode?

The definition of world models has expanded from explicitly modeling Markov Decision Processes to more broadly simulating the dynamics of a world that can be acted upon and reasoned about.

Who should listen to this episode?

This episode is recommended for anyone interested in World models, Genie 3, Reinforcement learning, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Today, we're joined by Jack Parker-Holder and Shlomi Fruchter, researchers at Google DeepMind, to discuss the recent release of Genie 3, a model capable of generating “playable” virtual worlds. We dig into the evolution of the Genie project and review the current model’s scaled-up capabilities, including creating real-time, interactive, and high-resolution environments. Jack and Shlomi share their perspectives on what defines a world model, the model's architecture, and key technical challenges and breakthroughs, including Genie 3’s visual memory and ability to handle “promptable world events.” Jack, Shlomi, and Sam share their favorite Genie 3 demos, and discuss its potential as a dynamic training environment for embodied AI agents. Finally, we will explore future directions for Genie research. The complete show notes for this episode can be found at https://twimlai.com/go/743.

Full Transcript

I think in Genie3, we really tried to push it to the limit across all of the dimensions, right? So we see that we have models that are more capable in terms of the quality of their generation. We see that, like, if you look at the resolution, the duration of the interaction, how fast the next frame can be generated, you get a very significant, kind of like if you multiply all of those dimensions, you get quite like a 100x improvement. All right, everyone, welcome to another episode of the TwiML AI podcast. I am your host, Sam Charrington. Today, I'm joined by Shlumi Fruchter and Jack Parker Holder, researchers at Google DeepMind to discuss the recent release of the Genie 3 model, which is an impressive world model that we first introduced to you here on the podcast in our conversation with Ashley Edwards almost exactly a year ago. I'm super excited to dig into Genie3. Jack and Shlomi, welcome to the podcast. Thank you. Thanks for having us. Excited to be here. This is an interesting interview to try to dig into because we covered Genie relatively recently, but I don't want to assume that people have listened to that interview or even know about Jeannie at all. So we're going to start a little bit from the beginning and dig into the project and where it comes from and why it's exciting. I'd love to have each of you just introduce yourselves before we dig in though and share kind of the highlights of your path to ML research and what you're most excited about, you know, what you research, that kind of thing. Jack, why don't you get us started? Awesome. Thanks. So without, you know, being, you know, too much of a suck up, my path was that I was working in finance just around 10 years ago and I did a master's part-time in the evenings after work and actually was listening to your podcast quite a lot around 2017 time. I got into uh um ML at the time research um doing actually evolutionary methods for reinforcement learning um working with some folks from Google Brain as well because I was in New York and there was an office there and then decided to do a PhD where I focused on open-ended learning um reinforcement learning still and then got a bit into world models and then by the time I finished my PhD I was increasingly convinced that the combination of these ideas would be the really powerful big thing to do um so after my phd i joined google deep mind and then worked a little bit on a project called adaptive agents um which was in the xland environment and then after that pretty much started genie so i've been doing that for a few years um i'm in the open endedness team which encompasses some other areas too but we've been focusing on this idea of using world models as a path to open-endedness as part of the Gini project. Awesome. Awesome. How about you, Salami? So I started actually, my first programming kind of experience was in my teen years developing game engines, actually. So 3D engines and coming more from this kind of like a world of trying to simulate effects and like lighting effects, liquid effects, et cetera. So I've been working in this space for some time. And I really like this visual domain. Then I joined Google and I've been on the Google Duplex team. So what we've done in the Duplex project is basically, it was pretty much very different from visual stuff because it was mostly getting stuff done over the phone. So the Duplex project was, if people remember from Google I.O. 2018, actually, It was like people thought, okay, wow, we have AGI. Yeah, it was one of the is it AGI moments, but I don't think, you know, probably not. But I think it was definitely the nice thing. And specifically, this was the project where Google was going to call restaurants and hairdressers to make appointments on your behalf, right? Yeah, so when we started Duplex, the goal was actually a question of can we build already today a bot that talk over the phone with people without them feeling this is actually a machine, right? And hitting this kind of like maybe over the phone Turing test, if you want. And what we found is that although having a completely general conversation over the phone was not achievable until very recently with LLMs, already then, and that was an era of RNNs, LSTMs, and definitely not transformers. And then basically, we were able to develop something that was, at least for this particular task, accomplished a lot of tasks. So that was kind of like my first touch with machine learning. And it was very interesting because it was also like it was very research oriented, but also had the deployment in the real world. We ended up scaling that to call hundreds of millions of calls over the overall like 15 countries. So basically, quite a lot of it. Yeah, it's less people are aware of it, but because it's not like it was mostly calling businesses to update Google Maps. And I think when you saw, I also, of course, followed things that happened on the LLM from GPT and internally at Google. We had the MENA and Lambda and other models very early on, and then we kind of integrated this technology into Google Duplex. But at some point, I just felt that maybe my visual roots roots kind of like came to play again. And I felt like the revolution in image diffusion models was really appealing to me. I just felt like this is a huge opportunity. It's really something like it's hitting some kind of a point of maturity. Yeah, since then I've been working on video models. And one incarnation of that was Game Engine, which was a bit of a side project. and we were very excited, me and a few friends. Actually, one of them was on Duplex with me, was actually the founder of Duplex, Yannick Levyathan. And together, we basically asked the question of, is it possible to simulate an existing game, but in real time, completely by a neural network, right? And then I've been also working on Veo, Veo 2 and Veo 3. but around the time of Game Engine I started talking to Jack I was very impressed with the Genie line of work and we can talk more about that of course One of the things that people are most excited about in looking at Genie is this concept of a world model maybe Jack you can kind of dig into what that means for you and how you see a world model fitting into kind of the broader trajectory of AI models, transformer-based models, however you think about that. What all is captured in this idea of a world model for you? Sure, yeah, that's a great question, which is, I guess, on one level quite easy to answer, but also could be a whole book probably of different philosophies. so for myself a world model the definition has actually slightly changed recently I think but what I would have said until about a year ago is a world model is essentially a model from the reinforcement learning paradigm that models an MDP so it models taking the state and action and it predicts the next state and this idea has been around for a while a world model essentially is the model in model-based reinforcement learning. It's modeling the environment, right? And so in the early 90s, I think, Jürgen Schmidhuber had a paper about what like recurrent models. And then also the Diner paper from Rich Dutton came out around that time as well. And that was kind of the starting point for this like direction of research, model-based reinforcement learning. And then for me, it was really the Harren Schmidhuber work in 2018 that was around the time I mentioned where I was getting interested in this field in general that really resonated. And I was looking at sort of reinforcement learning tasks, right? And they were at the time these Majoko tasks, like the little half cheetah and these kind of ones. And then looking at the world models paper, and it was basically saying that with a few offline examples, we could actually simulate an environment. So predicting the next state sufficiently well that we could then train policies in that model and then transfer them back to a real environment. And the policy had never trained in the real environment. And obviously that's super cool. But the setting that they had was basically that you had data from the real environment. So although it was really cool that you could do this, you could also just train in the real environment and get the same result in theory, right? Since you needed to collect that data. Exactly. Yes. But you could, right? You had the ability to do it. And that was kind of the paradigm that world models were in for a few years is that it was showing that we could do this at all for increasingly complex environments where we actually had the environment. So if we didn't want to do model-based reinforcement and didn't want to use world models, we could have just used like a distributed RL algorithm and probably got really good performance a different way. And there were still benefits to doing the world model approach. Maybe it was more sample efficient, but it wasn't solving something you couldn't otherwise solve. right um but then um kind of going a little bit back to to like sholomi's comment about this text to image models i mean uh i think we are both excited about that at the same time from different angles right and for me that was sort of saying look if we can do text image this well uh and i think it was what like four years ago maybe that like imagine came out um then okay video will happen at some point and then after video maybe we'll be able to get to world models right and we'll be able to simulate anything with large data sets right and and so the idea was basically can we actually take the same same concept and apply it to what i would view as a world model which is simulating an environment and therefore we can simulate any environments right and so that's why we came up with this idea of foundation world models so it was like a world model but rather than just being for one single environment it was any possible environment or new environments like a foundation model. And so this was kind of a hard line we stuck to in that a foundation model is this one thing, foundation world models is one thing, it predicts the next state given actions. And that's the kind of definition that we took. But then increasingly other kinds of models have been considered to be world models recently. So for example, video models, text to video. And at first I didn't really fit in with how I saw world models, but actually I think it kind of does at a different level of abstraction, right? So I think now I have a slightly broader view of world model, which is models that simulate the future given the past and some form of actions and simulate dynamics of a world, right? So it's not explicitly modeling every transition of an MDP, which is why I would have said before, but maybe just simulating the dynamics of the world in some way that you can act and intervene in it and then get counterfactual information. and that enables things like planning or like learning policies and simulation. And Salome, did you have the same kind of shift in the way you thought about word models? So yeah, I'm coming more from the, I think maybe simulation or I think about more the visual. I think what we call word models today are very visual specific, which I think is also one of the limitations of current, what we call word models at the moment. But if we go back to the, you know, WordMalls paper and Jack mentioned, again, the definition was pretty clear. And I think it's a good definition to start from and to kind of like anchor to because, again, the term is definitely used in many, many ways. But I think there is also a bit of an intuition to it and which, of course, can be formalized. but at least from my perspective is like when I started seeing, I think video models, but also image models. But when you write text, right, you provide a text prompt and then you get, you see some kind of like, it feels like there is a world behind those pixels, right? I think that's the, if you want the intuitive part. Like for things to really look like a realistic image or video, then the model probably has to have some internal representation of what's going on, how the world behaves, of physics to some extent, right? And I think that's kind of like maybe in layman's terms, the intuition behind the world model. The model probably has some understanding of the world, and it's visual, and I think that's key. But the word model doesn't have to be visual. It doesn't have to be something that generates pixels. And I think that's the broader, yeah. So I think in some of the, again, the literature, the world model can be in some latent space. It just should be something that we can use to make decisions and predict what's going to happen next and maybe perform planning in the RL and learned basically how to operate in a more optimal way in an environment. But again, going back to the visual domain, which is kind of like what I think happened is basically the visual domain really worked well because diffusion models just happened to work really well for images and videos and audio as well. And then this intersection between those fields kind of like became very obvious. And I think that's where we are right now, that we just have models that are capable of generating very realistic environments. And we're trying to push along this thing. Jack, you worked on Genie 1 and 2 as well. I'd love to have you talk a little bit about kind of the trajectory of the project. Maybe for a little bit of context what most amazing for me about Genie 3 is I remember the conversation with Ashley And in my mind I made the assumption that this world was real playable And it was like, oh, no, it takes, I don't remember what the specifics was, like 20 minutes between frames in order to do this. It's not real-time playable. And now I'm looking at this thing that in just a year is real-time playable. But, you know, talk a little bit about, you know, what the big, you know, both functional and research milestones have been between these iterations of the project. sure yeah um i definitely i think we did try to emphasize it wasn't real time in in both the works the challenge was like if you're announcing a new piece piece of research you're not going to say welcome to our like non-real-time breakthrough because why would i guess not like the best way to like you want to make it sound exciting you don't want to just open with the limitation but then if you put the limitation halfway down and you've got a bunch of cool videos people don't really get that far right so yeah so um we had definitely had this challenge um but i will say it was slightly less than 20 minutes i think it was only a few seconds um but if you think about like text to image model i totally over rotated on the conservative side there i like that because it's better to set the bar a bit lower and under promise and over deliver but um i think we did the other way maybe um but essentially like text image models i mean they would need to operate at like a 20th of a second essentially to be at the same kind of speed as Genie 3. So I think it is quite remarkable what we did achieve with that. But going back in time a little bit, so I thoroughly recommend all your listeners to listen to the Ashley Edwards version as well. Essentially, Genie 1 was quite a different beast to Genie 3. It was really the first kind of proof of concept of this foundation world model idea, right? A model that could generate new worlds, right? So there'd obviously been amazing progress in world models for single domains. starting with the higher-intro-reliber work, and then things like Dreamer, Dreamer v2, Dreamer v3, that could model increasingly complex individual environments and show that you could use them for agents to learn amazing behaviors that could solve really complex tasks. And that's like really important work in one axis, right? It's like single domain complexity. Whereas Genie 1 was basically saying, we don't care about the complexity of the environments per se. Can we train a model that can generate new environments and new worlds at all. And the challenge with that was on the data side because unlike training from existing environments, we don't have action labeled data from your target environment if you are going to be like trying to generate new things. So we basically collected a data set of unlabeled videos that had no action labels. And we had this kind of neat approach where we learned latent actions. And the really nice thing about that was that it was an example of people and I've been very fortunate with this a few times in my career but bumping in someone who had almost a perfect skill set for the thing I was excited about doing and the things I didn't know about so Ashley had worked on latent action learning for maybe like more than five years right she is basically one of the ones who really pioneered that direction but in a different context so she was working on it to learn behaviors from videos right so you take videos and you want to extract the actions so you can behavior clone or do imitation learning from those videos. You need to learn unsupervised latent actions so you can do that. Whereas we had the opposite kind of setup. We were instead of trying to learn behaviors from the videos, we tried to learn a world model from the videos so then we could use the world model to learn policies, right? So it's kind of a different framing. And there had been a paper at CBPR called Playable Environments, which did a similar thing. It was Menopass at all. And that came out as I was writing my PhD thesis. And I was thinking like, how am I going to do, how are we going to do these world models from internet videos? Or like, how are we going to make this possible? And then I saw that paper and I thought, well, that's at least one path. And then basically got chatting to Ashley and she was thinking, well, actually, that does make sense. We could definitely go for this approach and it'd be a really exciting way of getting the later actions to work at larger scale than she previously tried before. So for Gini1, it was kind of this quite new idea and could we do it at all, but it was limited in scope. So we trained actually two models, which people don't really notice as well. So we trained one model on this 2D platform again dataset and we trained another model on the robotics dataset from I think the RT1 paper, which was this like everyday robotics arm. and in both cases you could unsupervised learn an action space that meant that you could give the model a new image that wasn't in the in the training data maybe it was generated by a text image model maybe it was something it's a photograph you took yourself and then you could from that point play it like a world using the latent actions now that sounds amazing but there were some caveats so it was 90p so it's very small uh and it lasted maybe a couple of seconds before it degraded and then also it didn't really generate anything new when you went off screen right so we used this mask git approach and it tended to have like mode collapse when you moved away so if you look in the platformer examples when you move to the right it just kind of continued a flat platform um rather than generating like exciting new content so people did actually notice that uh but we were definitely aware of it uh it was more just like this is kind of cool that this remotely works. And it was a very early stage project, right? Like essentially when we started this, no one really thought it was worth doing. And lucky we were in a place like Google DeepLine where people encouraged these more exploratory projects. But it wasn't like a heavily resourced effort, right? It was kind of a few of us kind of scrapping together. So that was Genie 1. That one, we actually wrote a paper on it and we submitted it to ICML, which came out last summer. and so that's why I guess it feels very recent because there's a cycle between finishing the work and the paper being published, right? So it feels much like short-term housing. For Genie 2, the setting was basically, Genie 1 paper came out around February, March 2024. We had the results for a while and we'd already kind of started thinking about the future plans and actually got to work on it. But then around that time, there was a lot of progress on video models in general. and sort of they had their moment like text image had had a few years before and so it came became very clear to us that um for the next phase of genie we could go to something a bit larger scale and it would probably work right because we'd seen that video models had scaled pretty effectively um so we decided to go from 2d games to um a data set to any 3d world data set um and we scaled it up to 360p um and it had the ability to generate um environments again from image prompts that lasted maybe 10 20 seconds before they degraded so it again wasn't real time but um so you could play with it for a couple of minutes and wait a few seconds for each frame and it kind of worked for 3d worlds and so that was that was also at the time not completely obviously going to work I think because it was doing this like auto-aggressive generation and so it still wasn't clear that this would last like even 10-20 seconds so I think it was like quite an exciting result. And did Genie 2 overcome the mode collapse issue as you? Yeah to an extent so Genie 2 was a diffusion model so it had slightly different properties but it still was only supporting image prompting so it wasn't it wasn't anywhere near as expressive as um as video models right it still required you to select an image for it that worked which didn't always work uh and it couldn't generate its own it couldn't generate its own worlds right it required you to like generate an image that was in a certain kind of format where there's like a clear agent um in a in the right location for example um and then it would simulate from there um whereas i think what i'm hoping we spend a lot of time talking about with genie 3 is that it can definitely it can definitely do that um so yeah i mean to to kind of bookend this part um genie 2 at the time it was sort of a good sign of life for this general approach but it wasn't real time um and we'd seen with game engine that that was really impactful if you get that right um and the visual quality was was good but it still wasn't using text as input wasn't as and it was nowhere near the visual quality of state-of-the-art video models like vo2 which came out at the same time which was absolutely mind-blowing um for all of us so i think that's where my co-leads genie 3 shlomi came in and really had expertise uh i like to borrow people's expertise and and at that point it was pretty clear we had an expert in those areas yeah talk a little bit about how the how you incorporated like these other approaches you know game engine feel like how is that incorporated into the genie line of research and let me think about this because i think jack you make a good point which is like we've been talking about like this trajectory like we've been talking about you know genie one and two we haven't really talked about you know, Genie 3 and like, you know, when you pull up the demo, like what's super impressive about it? You know, maybe we'll let you do that as well, Shlomi, and then talk about how, you know, these other projects you've worked on kind of, you know, influenced the overall research. Sure. So basically, I think in Genie 3, we really, we really tried to push it to the limit across all of the dimensions, right? So if you see it as like, if we see the progression that Jack just pretty much just talked about, right? So we see that we have models that are more capable in terms of the quality of their generation. We see that GNE1 and GNE2 are able to basically generate more and more worlds that are more consistent, but still to a limit, right? So we wanted to push on how long this trajectory can remain consistent. So that was one of the dimensions we wanted to improve. And of course, the resolution, right? So I think all of those, like if you look at the resolution, the duration of the interaction, how fast the next frame can be generated, you get a very significant kind of like if you multiply all of those dimensions you get quite like a 100x improvement if you just think about it as raw compute right and I think it was very clear that this is not something that's obvious that's going to work so there is some risk to this project but we also felt that this is the time to go for it so I think what happened basically is that after we launched Vertu and basically it was very well received. We felt like the quality really improved, but it was definitely not real time and not interactive. And then Genie2 came out and definitely also pushed the envelope in this different direction. We just said, okay, let's try and combine those vectors of improvement and go to the next level. And that's pretty much what Genie3 is about, trying to bring the best of all worlds, pun intended maybe, And, you know, when you talk about this, like bringing the best of these worlds together, is there, you know, is it bringing together ideas? Is it bringing together architectures? Is it bringing together data sets or like some combination of all of the above? First, I think it's all, you know, this may be a cliche, but it's definitely about the people as well. So we had people from different teams kind of bringing their experience and their motivation and energy into this project. So I think that was a big thing. And in terms of the technique, there's definitely shared technical challenges, right? Basically, the output is eventually pixels. And we had to take the text as input. and we want to be able to generate something that feels consistent. So even if it's a video of eight seconds like Feoda generates, you still want to feel that it's consistent, right? If the camera moves around, things should look consistent. You need to get a feel that this is really taken maybe in the real world, right? And the same goes once it's interactive. We want to be able to generate the next frame based on the user's input, but still this consistency is really important because if we just, there were some early models that were able to maybe generate the next frame, like actually, game engine in a way was that, like it didn't have very long context. And the reason it worked is because it kind of learned specific properties of this game of Doom, basically. So it kind of remembered how the level looks like. So it wasn't really generating it as we would want, right? So I think that was actually being able to generate things from text. This is the core capability that you see from image models to video models. I think that was kind of the main innovation and breakthrough for all of this line of research that text became like you could start with text and text is such a compressed representation and it's such a strong way to learn concepts. So I think this is really, it was obvious that we start from text this time, we want to describe the world, and then we just want to drop you into that and then let the user or agent to just go around and explore it. So I think those are the similarities of the projects. And this is how we, of course, the infrastructure hardware, we have a lot of, in general, the approach we take at Google DeepMind is to try and understand the core mechanics of how those models scale. So those concepts are pretty much, we could leverage them across different modalities and definitely different trade-offs of latency and memory, for example. Do you have a favorite example from Genie 3 I like the lizard although it not photorealistic I really like the lizard that jumps you know the origami lizard I really like that it splashes a little bit of water once it hits the origami river. And of course, the puddles is, if you know, I think it was some of the examples were basically posted. on X by team members. And I think I actually really like those that people kind of played with the model. And then there is one that you walk around and the user is looking down to their shoes and basically sees them in a puddle. And it's pretty, it's very realistic. I think that's... Jack, your favorite example. There's loads of really cool examples, I think, that show different capabilities, right? But the one that I think was the most surprising was the one which is the inception sample where essentially it takes a minute to probably explain this, but essentially we can prompt our model with videos. And this is a really exciting capability because obviously with an amazing video model like VO3, you can generate really cool videos and then actually prompt Genie 3 with the video and then continue from there, right? And that's something that's really exciting that we were playing around with. But then one person in the team, Jakob, by mistake, actually didn't put the right caption. And so essentially what you realized was if you don't align the caption or text prompt with the video latents, then actually the model kind of makes it work. So basically what happens is if you're facing the world with the video prompt and then you'll look away and this other kind of magical world will be there. And so what he tried was actually prompting the model with a video of people playing the model demo. So we have this video that was also posted on social media of a couple of folks in Google DeepMind playing with the live demo and then in an office room. And then the prompt was like a jungle with a T-Rex. or whatever it was. And then during the Genie 3 generation, the screen that is showing what they were actually playing switches to this jungle world. And so does the laptop. So Genie 3 knows that it updates both, which I think is pretty incredible. But then it also is the case that when you turn away, you see outside is actually the jungle, just as in the prompt. And when you go into the jungle and turn back around, you see the office that they're in and you see them playing and you kind of have to see it to believe it but I think it's quite incredibly incredibly represents how the model kind of actually does have some understanding of things because it understands that like when it updates the screens it should do both right and then it also understands if you're in an office and you go outside and you look back then you should see some kind of building that they should be in and so I think that's really cool it's definitely not what the goal of the project was to be able to do that but I think sometimes when you pursue interesting objectives then unexpected things can arise so I think that's a really nice example of one I think one not necessarily a demo but the capability that I'm very excited about I think it's one of the demos we have is for the whiteboard that has an Apple and Gene Fury written on it and the tree. And I think the nice thing is that this is really demonstrated, I think, one of the capabilities of memory, right? And to me, this is like basically what makes it a world model that you actually feel in a world, right? You look at a whiteboard, you look through the window, you come back and it's there. It looks exactly the same. Everything is in place. And it just, you know, really, yeah, I'm really strong. I was going to say for me, for the same reason, the painting roller one is super impressive. Like it's, you know, maybe it's the simplest world of all of the demos. You're in a room and someone's like painting the wall, but like you see the viewport pan away from the wall that has been painted with these random strokes and then pan back. and the strokes are like perfect, you know, perfect memory from, you know, what was painted, you know, frames before for an autoregressive model to capture that so precisely. Yeah, super impressive. I think when we saw that, like generated by someone, it was kind of like a bit of a disbelief across some people on the team because it was just like, we didn't even know the model was capable of doing something like that, right? Like it's not just that the original visual world is maintained, but actually the actions you took in it and the consequences of the actions you took are maintained as well. It's also pretty cool because it shows you could use the model for sort of more vocational things as well. I think that's quite interesting as a use case that we didn't really think of as well. So that's a really, I think you have great taste in Genie 3 samples. Yeah, let's talk a little bit about the model itself And like the, you know, we mentioned, I think at a high level, you know, challenges like consistency, latency is clearly a challenge. I think, you know, we've talked about kind of resolution or the, you know, the richness of the produced visuals. And, you know, we've alluded to kind of the model being autoregressive in nature. We've talked a little bit about transformer and diffusion. Like, how should we think about the, you know, the model architecture, you know, the model and how you've used aspects of the modeling process to overcome these challenges? Yeah, yeah, no worries. So, you know, I think one of the key aspects of the model is it is basically autoregressive, which means in this context that the next frame is generated based on the long sequence, potentially long sequence of everything that happened before, right? So the model has to look at what happened before the particular frame, reason over this kind of like past, and decide which information is relevant to the next frame. and the key is that that has to happen very quickly. So that has to happen multiple times per second because we can never know what's the next actions from the user going to be, right? And I think that this is really what makes it real-time interactive, not just real-time, but real-time interactive. So it responds to what the... So I think this term is really, that's kind of what led our design of the system and architecture. the interactivity while being in real time and basically everything boils down to that kind of design decision and the interesting thing is that to get this to this very low latency and to be able to look back into what happened before that's kind of like basically made us look into how we can leverage and pick the right architecture and scale that enables both very high quality model, but also leveraging the best, you know, the best in class hardware that we have to actually build something that works. And it's not just kind of like, we don't end up with, you know, a theoretical system or something that might be a paper, but actually something that we hope eventually we'll be able to share with more people in the future. To go on the same kind of direction as Shlomi, I think we had to really set ourselves as a team, this goal of trying to be ambitious in all these dimensions, right? Which is something that if you don't commit to it right at the beginning, then it will be very hard to achieve all of it in one go, which is really the challenge and really the magic of the model, right? Is that it can have memory, high resolution, diversity of worlds, and also be real time. And I think in each of those dimensions, we had really amazing people in the team. from the the team obviously you can see from the acknowledgement it's a slightly bigger team than we obviously had before in the genie series i think i would consider it almost like a new model but with inherited name um and we really had like great people in all these different areas that worked really hard on on each individual component right um but also with awareness of the other of the other parts uh and i think that like it really is incredible to see the kind of things that people were able to put together. But each one was a challenge, right? So it wasn't the case that any one of those parts was easy to achieve. And when you talked about the challenges, you didn't specifically mention consistency in the blog post. Does specifically mention consistency as like an emergent property, suggesting that it wasn't necessarily something that you were designing towards. Is that the case? I think one way to think of it is like, we definitely, that was definitely our goal. And we definitely designed it in a way that to achieve this goal, like one of the, when we listed our what we want to like, you know, the spec of the model, definitely a memory of about one minute was in it, right? That's kind of like was our goal. but I think the key thing is that it's not an explicit there is no explicit representation of the world and that's for example there are approaches there are a lot of approaches to have you can implement for example a 3D engine that has a very explicit mesh and then this gets rendered and you can go anywhere you want right it has a lot of limitations but also amazing implications but if we go more into the machine learning world then you have NERVs and Gaussian splats, and that basically derives some representation of the geometry of the world, and based on that, you can just walk around and everything gets rendered, right? So these are all explicit representations, and we didn't want to do that. We think while they do have a lot of applications, they're also limited. It's much harder to have dynamics environments, and we kind of wanted the model to learn that on its own, And that was kind of like, if you want that, I think when, so while we did it, aim for that, we didn't think we should build anything into the system to achieve it. I think that's what we, you know, if we want to, I think we're good students of the bitter lesson. We believe that, at least I believe that many of those things can be learned from data alone. If you just kind of design the system in a way that it's set up to learn those capabilities. Yeah, which also, of course, means that every part has to be really done carefully, right? Because the model will learn what's in the data, right? So you have to have a really capable model that can do that, but you also have to train it on the right data so that it does learn the right things. And so it really has to be a lot of things coming together to get that without adding those other methods in. So one of the big features of the model that we've talked about is that it's promptability. It starts with text that generates the world. There's also an example in a blog post of where you're prompting the behavior in the world. Is that genie or are we talking now about an agent within the genie environment? Like, is there a distinction between these two and like, you know, both how do you see that today, but also where do you see that whole agentic interaction paradigm going? I think it's, so maybe just to elaborate a bit on, so what we call promptable word events, I think that's what you're referring to. So this capability is, it's not directly tied to the agent. So you can think about it as God mode if you want. So you just want to change anything in the world. You want to have a sandstorm coming. You want to drop a box. You want to, like, we tried a bunch of stuff, like dropping objects from the sky or changing anything. So you can just change anything in the world that you want. So there's promptable world events, but there's also walk to the, you know, the red rack, walk to this, which is different from, like, the up, down, left, right type. So maybe I can. Talk about both. You can start, you know, continue with the world events, and then we'll get to the other stuff. So, yeah, so we do have this promptable word events that allow to just make change in the world and inject some new information. So that allows basically control of the world beyond just the prompt that's provided in the beginning. Right. So this is more of like a temporally. Like you're injecting a prompt in the middle of your generation and time. Yeah, and it's quite a hard, like it's quite a deep capability, I think, because it's not obvious like some, there is not always the prompt makes sense, right? And I think like, for example, you want, you say, okay, a door opens and you're in the middle of the desert. What door should open? And that's, the model is like, you know, I don't know. So we see that sometimes it can make weird stuff because the model is trying to. But when it makes sense, we often see that it does work and we get very nice samples. And we have some like the dragon that appears out of the sky and lands in the middle of the tunnel. So I think there are definitely some cases where it works really well and it's a very powerful capability. And now if I can pause on that, Like I can think of that as like, you know, say the door opens in the desert. Like I can think of a model which like you generate your next frame, your expected next frame with the desert. And then you like, you know, use that as the input frame to generate a frame that will replace that frame in the continued generation. But I could also imagine you know that being like a you know a crude way of doing something that like more integrated into the model architecture Like can you talk a little bit about like how that is done I think is done I think basically what we wanted is to be able to we think what is an event, right? So if you think about, we walk around the world and things happen around us, not necessarily, they're not done by us. They're not the agent centric, right? So I think that's basically the distinction between what you mentioned about the agent in the world, acting in the world, maybe walking somewhere, which in the videos that you've mentioned, this was done by an external model, the SEMA model, we can talk about. I think it's very interesting. Maybe Jack can tell us a bit more because it's very, I mean, I think something that was also tried with Gini2 and worked, so it's really cool. And then we built on top of that. But if we go back to promptable word events, then then the ability to, it's not just based on a single frame, right? So it can be like that you want to see something in the world and then you look, you actually, it doesn't happen immediately, but then you look to the left and you see, for example, a person. So we have some of these examples where you have, you ski down the slopes and then you look to the left and you have a person wearing a Genie Frit t-shirt. So it's like it can just materialize things in the world, but it doesn't mean that it's like, you know, just pops in front of you. We want it to be like ideally something that's integrated and makes sense in the world, right? Because it's easy to just drop something and it looks very artificial. We want it to actually be integrated, to look real, right? The model eventually wants to make things that look like a training data, which ultimately should be realistic. So somehow like additional conditioning information that's, you know, integral to the next frame generation process, as opposed to we're just going to drop this thing in the middle of the view. Super interesting. It's like you're telling the model to do it. And it's like, I'll do it when I'm ready. Kind of thing. And it doesn't, that's not the technical term, but yeah, it does it in a way that feels natural. So Jack, talk a little bit about the SEMA agents. Sure, yeah. So we, as I said, going back to the history of the project, designed this to be an environment for agents. And at Google DeepMind, obviously we have lots of projects working on agents. And the one that's really focused on 3D worlds is the SEMA agent. And so they're trying to train agents that can achieve language goals in sort of 3D simulated environments. And so they have an announcement from, I think, or a blog post from probably around February 2024, where they sort of showed a bit about how they're thinking about this. And what they're doing right now is they're training in existing games. So they've got a really capable agent that can do quite diverse things in different game worlds. But ultimately, it's limited by only having access to those game worlds, right? So it can't train in any imaginable game world or in the real world because it's got just access to a finite set of environments to train in. And this is kind of the exact problem that Genie is trying to solve, right? It's to generate new environments. But also, the Simmer agent was surprisingly general, right? So even though it's been trained on a smaller set of worlds, you can kind of drop it in one of the Genie environments that it's never seen before, right? So you use text to create a Genie 3 environment or world to say you describe a scene that could be like a factory floor or something like that. And you could say in the background, there's a forklift truck and you generate this world. and then you say to the Simit agent, like, go to the forklift truck, right? Or you could even say to it, like, go to the thing that can lift things or something like that. And then the Simit agent kind of, like, from that point onwards, treats the Genie generated world as if it's any other environment. It doesn't know that it's a model. It doesn't know anything. It just sees the pixels and it says, I'm going to press this key to achieve this goal. And then all Genie sees is the key press, right? It doesn't know what the Simit agent is trying to do because if it did know that, it might make it happen, right? All it knows is it wants to go forward, right? And so then it simulates the next frame and then the sim agent sees the next frame and it says, okay, I'm going to keep going forward. And then these kind of happen in tandem, like back and forth. And then critically, if the sim agent does the wrong actions, it won't achieve the goal, right? If it does the right actions, it will achieve the goal. So then, of course, you can see that the sim agent can then learn from this experience to achieve the goal more often, right? and there may be some things it can't do yet but it could learn to do in these worlds so we essentially have signs of life that we have one agent interacting with with another one essentially to teach it new skills in these more embodied worlds right but and at a scale that hasn't really been done before and then to kind of like close the loop on this a really cool thing could also be integrating this with the world events, right? Because even a kind of benign environment, like walking down the street, might become much more interesting if you then injected, say, I don't know, a cat jumps out or something like that. So actually, you can teach our agents to be robust to all these different kinds of things, even in simple environments, because we have this additional lever to pull on the environment side to make it more interesting and challenging for the agents. interesting interesting um jack earlier we were talking a little bit about the limitations of genie one that you know and this kind of balance in you know communicating the the limitations leading with them versus you know having them further down in the page and they are further down on the page in your you know genie three blog posts uh but i'd love to have you riff a little bit about the limitations and then Shlomi, I'll turn it over to you to talk a little bit about, or at least start us off talking about kind of next steps and where you see the research going and then you check. So let's talk about limitations. You know, how do you... So I think what we... I can't remember all of the things we listed, but the one that I think stands up for me is that we talked about simulating other agents and we don't do this... Any multi-agent interactions within the world. Exactly. Yeah, I think we've mentioned that this is quite limited at this point. Ultimately, the model is predicting the next frames. It's able to get some kind of very basic simulation of other agents. Like if you walk in their way, if they're walking and you stand in the way, they might stop. Or if a car's driving and you walk in front of it, it might stop. But it's not the case that this is like very complex interaction. And so I think that's definitely a limitation. um and it's also something that you maybe is much more advanced in something like bio already right um so it's something that we don't have uh for sure um i think there's clearly the one minute limitation as well like um it's kind of funny to say it because uh like genie one and two a minute was sort of like would have been seen as incredible um but i think it's a bit like the pace of our field is absolutely crazy right so um this is the kind of thing that will seem embarrassingly short i'm sure in the future um uh so um right now we we kind of say we have visual memory for a minute um and i guess that is an individual uh interaction or um play of genie can span multiple minutes so it's just the the context length essentially yeah then this is an important distinction like you can play for multiple minutes even and it doesn't degrade or it becomes very blurry like, you know, previous generations of those models. But again, the memory is around one minute. Yeah, exactly. And then also there is, I guess, like real world physical accuracy is not perfect. So if you say like, I want my exact street in London, it won't, it doesn't know my street. So I think that there's some, I guess, elements of that that could be improved if you wanted to. In text, if you describe sort of like a very abstract world, it will almost certainly get it on the money. If you describe a specific geographic location, you might notice that it's not what you hoped it would be in some way. So that's another one I think that is a limitation. So let me, are there anything that jumps out at you in terms of limitations? Yeah, I think a bit similarly to what you've asked before about the agents being able to take actions. So while currently the action space is relatively constrained, right? So while we do, we can navigate, you know, the agent can navigate. We have some actions like, you know, maybe jumping or like some opening doors, but relatively basic in terms of the semantics of the action that the agent is taking. So promptable word events give us control over the world, but they're not necessarily agent-centric actions. So I think this is definitely something that we hope to improve and expand in the future, but because it's such a, it is a real limitation. And we think that especially for making the, so it's quite challenging, but it's for anything that agent, like if we want the agent to take more complex actions, not just walk around, but actually be able to, for example, pick up things, maybe, maybe, I don't know, to type in some code or maybe talk to a different agent. There is a lot of that. Many things can happen in the world. And it's quite a challenging problem because there is no, we operate in the world as people or as in a very physical way, right? We use our hands, we use our feet to walk. We have an embodied presence. And when we take away all of this and we are left with the visual, well, with pixels only, then it's much harder to define what actions should actually happen, right? When you open the door, for example, you go, and it's not just open the door, right? You go and you grab the knob of the door, and you move it, right? Or you pull it towards you. There's a sequence of micro actions that are being taken place. And I think there is a challenging question of how to model this space of actions. But it's definitely a limitation, and I think an opportunity to expand the capabilities. So digging into next steps, I get the sense that you're both pretty excited about the agentic aspects of this, no surprise coming from DeepMind. But what's like most exciting to you or most obvious to you, like our, you know, present for you in terms of like where the project goes? We could start with you, Shalami. So to me, the ability to be able to step into a world, right, that you created or someone else created, but then you can actually perceive it, see it, interact with it. I think it's huge. It can be really applied to so many things from, you know, entertainment, which is very obvious, right? You said like, you know, street view, interactive street view, for example. So it can be somewhat anchored in the real world, but take you somewhere else. and it actually reminds me of a startup I worked for like a long time ago and we had like this kind of like a game placed in downtown San Francisco. Of course, we had to model the entire downtown San Francisco and like it was a lot of, but the key idea of this startup was to actually have games happening in real world. So I think in real world locations, right? So that's like, for example, one thing you could have any interaction in some, but not a realistic one placed in a realistic location. One, just one example among many. And I think there are other really interesting applications for people to maybe not necessarily entertain entertainment. It can be like education, for example. It can be helping people see themselves accomplishing something they wouldn't have expected themselves to. Like there is something very strong about seeing yourself or maybe accomplishing something. And then it kind of makes you like it's a bit of a psychological perspective. perspective, but I think it's very powerful. And it's something that's not enough, like there is the personalization aspect of being able to go into the environment, walk around, maybe prompt it in a way that looks very similar maybe to your house, for example. So if you're afraid of, I don't know, doing something, afraid of spiders, maybe you can see yourself walk next to a spider at your home, and then maybe you can, like, your brain says, okay, I can do it. So, I mean, I think it's very... That's a real world potential application. Yeah, exactly. So it doesn't have to be. My point is that we don't necessarily know how people will or where this technology will go. And it's very early days for that. And that's why we had some trusted testers or academics interact with the model. Initially, we wanted to get some feedback and we hope over time to learn more about the capabilities and how on the applications that people are excited about. Jack, next up Shreya. Yeah, so I mean, there's so many exciting ideas already mentioned. I think the one that really excites me is teaching agents to interact in visually realistic, embodied worlds with people in the world. I think that's a really missing capability for any of our current agents is to interact in the physical world with humans as well. And I think that models like Genie 3 could enable that. And I also don't really think there's any other way to achieve that. So I think that's something really exciting, especially with the world events, right? Which could enable really generating like diverse scenarios that we wouldn't be able to get data for any other way. So I think this is still fairly early in the journey, but I think this is like a big step that really will open up a lot of use cases there. Well, Shlomi, Jack, thank you guys so much for jumping on and updating us on Genie 3 and everything that you're working on. It's been really great to dig into it. Awesome. Thanks so much for your time. Yeah, thanks, Sam. All right. Thank you both. Cheers. Thank you.

Share on X Share on LinkedIn