

The Engineering Behind the World’s Most Advanced Video AI
Gradient Dissent
What You'll Learn
- ✓Runway's latest video AI model, Gen 4.5, has achieved the top spot on the video generation leaderboard, a remarkable feat given the limited resources compared to larger tech companies.
- ✓The key to Runway's success is a highly committed team, creative efficiency in training and inference, and a focus on fundamental research and experimentation rather than just chasing the latest hardware.
- ✓Video models are evolving into universal simulation engines that can grasp the physical world and causality in a more consistent way, going beyond just generating entertaining media.
- ✓Potential applications include using video models to train robots, create interactive gaming experiences, and enable non-linear narratives that blur the line between pre-made content and real-time experiences.
- ✓The goal is to move these models beyond just language understanding towards a more holistic grasp of reality, which could unlock new frontiers of artificial general intelligence.
Episode Chapters
Introduction
The host discusses Runway's latest achievement of topping the video generation leaderboard and interviews the CEO, Chris, about how they were able to accomplish this.
Competing with Larger Tech Giants
Chris explains how Runway has been able to outperform larger companies with significantly more resources, emphasizing the importance of a dedicated team and creative efficiency.
Advances in Video AI Capabilities
Chris highlights the key improvements in Runway's latest model, including better physical world understanding, consistency, and the ability to simulate complex scenarios beyond just video generation.
Expanding Applications Beyond Media
Chris discusses the potential for video models to be used in robotics, gaming, and interactive experiences, moving towards a more holistic grasp of reality and the possibility of artificial general intelligence.
AI Summary
This episode discusses the engineering behind Runway's latest and most advanced video AI model, which has recently topped the video generation leaderboard. The CEO, Chris, explains how Runway has been able to compete with larger tech giants despite limited resources, emphasizing the importance of a dedicated team, creative efficiency, and fundamental research. He highlights how the latest model can better understand the physical world and simulate complex scenarios beyond just video generation, opening up possibilities for applications in robotics, gaming, and interactive experiences.
Key Points
- 1Runway's latest video AI model, Gen 4.5, has achieved the top spot on the video generation leaderboard, a remarkable feat given the limited resources compared to larger tech companies.
- 2The key to Runway's success is a highly committed team, creative efficiency in training and inference, and a focus on fundamental research and experimentation rather than just chasing the latest hardware.
- 3Video models are evolving into universal simulation engines that can grasp the physical world and causality in a more consistent way, going beyond just generating entertaining media.
- 4Potential applications include using video models to train robots, create interactive gaming experiences, and enable non-linear narratives that blur the line between pre-made content and real-time experiences.
- 5The goal is to move these models beyond just language understanding towards a more holistic grasp of reality, which could unlock new frontiers of artificial general intelligence.
Topics Discussed
Frequently Asked Questions
What is "The Engineering Behind the World’s Most Advanced Video AI" about?
This episode discusses the engineering behind Runway's latest and most advanced video AI model, which has recently topped the video generation leaderboard. The CEO, Chris, explains how Runway has been able to compete with larger tech giants despite limited resources, emphasizing the importance of a dedicated team, creative efficiency, and fundamental research. He highlights how the latest model can better understand the physical world and simulate complex scenarios beyond just video generation, opening up possibilities for applications in robotics, gaming, and interactive experiences.
What topics are discussed in this episode?
This episode covers the following topics: Video generation models, Runway ML, Fundamental AI research, Simulation and physical world understanding, Applications beyond media and entertainment.
What is key insight #1 from this episode?
Runway's latest video AI model, Gen 4.5, has achieved the top spot on the video generation leaderboard, a remarkable feat given the limited resources compared to larger tech companies.
What is key insight #2 from this episode?
The key to Runway's success is a highly committed team, creative efficiency in training and inference, and a focus on fundamental research and experimentation rather than just chasing the latest hardware.
What is key insight #3 from this episode?
Video models are evolving into universal simulation engines that can grasp the physical world and causality in a more consistent way, going beyond just generating entertaining media.
What is key insight #4 from this episode?
Potential applications include using video models to train robots, create interactive gaming experiences, and enable non-linear narratives that blur the line between pre-made content and real-time experiences.
Who should listen to this episode?
This episode is recommended for anyone interested in Video generation models, Runway ML, Fundamental AI research, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Is video AI a viable path toward AGI? Runway ML founder Cristóbal Valenzuela joins Lukas Biewald just after Gen 4.5 reached the #1 position on the Video Arena Leaderboard, according to community voting on Artificial Analysis. Lukas examines how a focused research team at Runway outpaced much larger organizations like Google and Meta in one of the most compute-intensive areas of machine learning. Cristóbal breaks down the architecture behind Gen 4.5 and explains the role of “taste” in model development. He details the engineering improvements in motion and camera control that solve long-standing issues like the restrictive “tripod look,” and shares why video models are starting to function as simulation engines with applications beyond media generation. Connect with us here: Cristóbal Valenzuela: https://www.linkedin.com/in/cvalenzuelab Runway: https://www.linkedin.com/company/runwayml/ Lukas Biewald: https://www.linkedin.com/in/lbiewald/ Weights & Biases: https://www.linkedin.com/company/wandb/
Full Transcript
The best way to understand video models is to understand them as basically universal simulation engines. They can effectively try or will try to simulate hopefully everything and we're simulating entertainment and media first but you can see how the models will start to like simulate way more than just that. You know I always think like wow like how's Runway going to compete against the resources of Google? Do you have any like thoughts on how that's working? When we started seven years ago I think people just never thought it was that interesting to to build video models in the first place. I feel now we've created an industry, but I think still like resources here are not enough. The only reason we're able to do this is because we have this incredible team that's incredibly committed to achieving the vision that we have and it's the manifestation of that. What are some things that like your latest model can do that like a previous generation couldn't? Now, Gen 4.5, which we just announced, it has a really good margin with respects to like everyone else who comes after. And so being able to train not a language alone, but on observational data, on real data, allows the model to grasp reality and how the world works in a much more consistent way to hopefully do much more than just video generation. You're listening to Gradient Descent, a show about making machine learning work in the real world. And I'm your host, Lucas B. Wald. All right, I got a text this morning from Chris, the CEO and founder of Renway ML, that his latest model is on top of the video arena leaderboard. And so I just wanted to record a quick podcast to ask him how he did it and what's working. And I hope you enjoy it. All right. So you're at the top of the video arena leaderboard. We are. Yeah. Quite a moment. So can you tell me how that leaderboard works, first of all? Sure. So all leaderboards are asking basically the internet world people at large to rate between two different outputs. And so you go basically through pairs of examples and then people vote. And as models get better, better, better, you eventually have someone who's like on the leaderboard, on a ranking. And so for video, that's, I think it's been, leaderboards on video have been happening for the last couple of years, one or two years. And so, yeah, now Gen 4.5, which we just announced, is on the top of a leaderboard. It has a really good margin with respects to everyone else who comes after. And it's a pretty remarkable feat, to be honest. It's very hard. Video is one of those very expensive, hard modalities to be very good at. And we managed to get the best video model in the world out there. It's really amazing. I mean, we've been friends for a while and I've been obviously working with you guys for a long time. And I always think like, wow, like how's Runway, you know, going to compete against the resources of Google and yet you, you like continue to do it. Do you have any like thoughts on how that's working? Yeah. You know, when we started like, I don't know, seven years ago, and I think we met like when we were really early in the journey, I don't think it was like competition or either board, because I think people just never thought it was that interesting to build video models in the first place. I feel now we've created an industry, we've helped others kind of come on board. And And it's true now it's some of the biggest and most, I guess, well-funded companies in the world are building very similar things to runway. And it's great. It feels like a great push for the market. It helps, I think, models improve consistently across the board. But I think still like resources here are not enough Like you need a consistent vision and an obsession to build this I think also the underestimated part of a lot of what we done is how efficient and effective we've been with the resources that we have. I think it's easier to train models. You have hundreds of billions of dollars to spare, but managing to get to the best model in the world with a small set of resources requires you to be very creative about efficiencies and optimizing both training and inference. And the only reason we're able to do this is because we have this incredible team that's incredibly committed to achieving the vision that we have. And it's the manifestation of that. And I guess what makes... I mean, I'm continuously astonished by how well these video models do. Like, I mean, how do you, like, what do you think you do, like, particularly better in this version of the model? Like, kind of where do you look to improve models at this point? I think, you know, in all parts of research, like, taste matters a lot. And taste doesn't mean, like, aesthetics. It does mean aesthetic at some point, but it's not the only thing. I think taste, I think, how do you train these models? And there's, like, thousands of different knobs and parameters and chips and tricks and, like, things we know have worked that you need to keep in mind. And I don't think there's one single element to why models are getting better over time. I think it's a combination of these elements. I was listening to Ilya's podcast from a couple of days ago and I really like the idea of phrasing these areas as the era of research, but with bigger computers. So we're back to fundamental research. You need to understand and spend a lot of time doing science, trying experiments. And so if you're really good at trying experiments, I would say it's one of the things that works really well here. But specifically, I guess, where is the frontier here? Where do models still kind of underperform? What are some things that your latest model can do that a previous generation couldn't? Or where on the leaderboard do you think you're getting an advantage and why? I think there's definitely a lot of things that models can do today that they couldn't do before. but it's still like there's other stuff that they're not able to like do today that I'm sure we'll be able to solve in a hopefully in the future. I think overall this idea of like world understanding has become more evident that the models are really good reasoning systems that understand like spatially temporal spatial like consistency that can understand like cause and effect that can understand like just the world and the implications of it are kind of like pretty broad. There's ways of customizing or fine-tuning these models so they can be also useful in other domains. And that becomes extremely interesting from a general intelligence perspective, where I think one of the bottlenecks of language models is that language is always constrained by what language actually is, which is a human obstruction of reality. We've created this mechanism for us to communicate with each other and describe the world, but it's not an accurate representation of the real world. It's an obstruction of the world. And so being able to train not a language alone, but on observational data, on real data, on video data, allows the model to kind of grasp reality and how the world works in a much more consistent way. And a lot of the work we're doing now is kind of heading towards what that means and how you scale and extrapolate that sort of like the ability is to hopefully do much more than just like video generation Are there any particular like prompts or things you tried recently where you like ooh that like super cool Yeah there a lot I have like internal benchmarks on which I always try, you know, to see how models are doing. Ooh, can you tell one or two of those? Sure, there's like, I have this kangaroo pushing another kangaroo, like a baby kangaroo in a stroller. It's complex because like you need the motion of the kangaroo to be consistent with how actually they move, but just generally how the camera should follow, it's a hard prompt. And now the model has kind of nailed it pretty well. I think physics in general, like the model does pretty good, like object permanence and like movements of things, like human motion. It's one of the things that it's very hard to like accomplish well. The model does pretty good at that. And something we spend a lot of time. And if you're, if you ever want to try the model at some point, being able to follow cameras, like movements with precise like descriptions. So you can like pan and zoom and kind of like focus and go back to another focus all in a single sequence. It breaks a little bit of this like, I would say AI feel that a lot of this previous models had that were like very kind of like consistent with single shot kind of tripod cameras, which I think a lot of great stories and great videos are more like more than that. And so we've managed to like renail, I would say how camera movements have at work in the model. I've always felt like you've had this perspective of, hey, we don't want to just make the most flashy, impressive model, but we want to make one that's really useful for telling stories. And I feel like you probably are focused on some things outside of what might win a competition like this, like character consistency and things like that. That must also be a consideration for you, right? Yeah, of course. I mean, I think storytelling and media is one of the most compelling use cases of the models in the first place. and it's the obvious one, it's the first one, but you really need to think about these models as I would say the next frontier of intelligence where the models can reason and be used in other sort of domains beyond just like arts and media. And I think in a way like two or three years ago, the question we had to ask ourselves was like, can you actually make great art and great films and great media with models? And I think that question has been like solved by now and we have a film festival that we can hosting every year. This year that we had it, it was the packed Lincoln Center in New York. We have artists from all the world come to it and make these amazing films. And so in a way, I think that answer is already solved. But now you can see how the models are becoming way more than just for entertainment and something we're looking into it. So one thing that happened in one of our recent hackathons that we do is someone took one of like a model like this and they used it to then train a robot to use its arms correctly by generating synthetic data with video data, which just seemed like super cool and evocative. Is that the kind of thing you're talking about? Correct, yeah. There's ways of doing like physical or body like AI with world models, with the models that we have. And I think that that idea is, I would see largely still unexplored because the consistency of the models wasn't there yet. And you can think about how you can manipulate the physical world with this simulation. They're basically simulation systems. And it's something we've been kind of advocating for a long time is the best way to understand video models is to understand them as basically universal simulation engines. They can effectively try or will try to simulate hopefully everything And we simulating entertainment and media first but you can see how the models will start to simulate way more than just that Well give me some more examples Well, I think gaming is another one. Non-linear experiences, like narrative, linear content, as I would say, like, well, films, videos, like things you've seen that are pre-made. But the moment you have really good consistency and permanence and a bit more of a deterministic world building and you have it in real time, then you can have all this non-linear type of content. We maybe even call it just experiences. They're live experiences in which you can either move around or have conversations with people or see things play out in the real world in real time. That for me might resemble a video game. And I think there's a lot that we're borrowing from the language of games. but effectively it does feel like a new medium altogether because the way you interact with it, the way you make, the way it's developed in the first place, it's different. And I think that thinking about it from an artistic perspective, thinking about it as a new medium is incredibly exciting because it's like all these new things you could do with it. But also it's like all the adjacent things that you can also kind of use these models for, like if you want to learn. Pretty much every time I want to learn something, I probably watch a YouTube video these days. imagine a world where we can have customized learning experiences and videos that are completely generated on real-time just for you. I think we're not that far away from that. It's really exciting. I guess one selfish question, I love using this model with my daughter who's six and we play this stuff all the time. But one of the things that drives her nuts is you often put kids in the models. It often gets flagged. Have you thought about this at all? I do think kids are kind of the new generation of creators, but the fact that there's so many guardrails, I mean, I understand why you'd want to have guardrails, but I also think it's, I don't know, I would say it's my daughter's biggest objection to these media generation models. yeah i mean look they're um that's hard problem like you've you've and and everyone has different approaches i think we've taken a a more like safe and consistent approach to how we think we should develop the models and moderation will get better and like the ways of um you know uh making sure the outputs are safe will get better over time i've done a really incredible job at doing that we have a team of just dedicated to trust and safety to make sure we know how how uh video moderation can be done effectively um but yeah i mean i also have a daughter you want to make sure i guess you can like make stuff with them and it's just a fun experience like there's i would say there might be a way where you could do something like what um what a lot of streaming services do these days where you can like have a kids like safe kind of option where some kind of rules have been applied and you as a parent can decide which ones um and if you're you want to remove them you can remove them i think probably that will be a maybe a consistent or a more interesting way of working towards solving that. Well, if you ever release that, you'll have a fan of me. Yes, totally. We'll work on that. Awesome. Well, thanks so much. Congratulations on coming. Of course. Yeah, well, thank you for the support. It's been quite a great journey. And by the way, all the models are trained always with weights and biases. So a big shout out to you guys. You've helped us a lot to get here. Thanks so much. thanks so much for listening to this episode of gradient descent please stay tuned for future episodes
Related Episodes

Why Physical AI Needed a Completely New Data Stack
Gradient Dissent
1h 0m
The CEO Behind the Fastest-Growing AI Inference Company | Tuhin Srivastava
Gradient Dissent
59m
The Startup Powering The Data Behind AGI
Gradient Dissent
56m

Arvind Jain on Building Glean and the Future of Enterprise AI
Gradient Dissent
43m

How DeepL Built a Translation Powerhouse with AI with CEO Jarek Kutylowski
Gradient Dissent
42m

GitHub CEO Thomas Dohmke on Copilot and the Future of Software Development
Gradient Dissent
1h 9m
No comments yet
Be the first to comment