Bringing Robots to Life with AI: The Three Computer Revolution - Ep. 274

The AI Podcast (NVIDIA) • NVIDIA

Wednesday, September 17, 202552m

Spotify Apple

The AI Podcast (NVIDIA)

0:0052:17

What You'll Learn

✓The three computer concept includes: 1) NVIDIA DGX systems for training large AI models, 2) Omniverse and Cosmos for simulation and world modeling, and 3) Jetson AGX for on-robot AI inference
✓Omniverse and Cosmos enable generating data, experiences, and evaluating robots in simulation before deploying to the real world
✓Recent advancements like the sim-to-real paradigm, transformer models, and ChatGPT have transformed robotics research and capabilities
✓The Seattle Robotics Lab at NVIDIA focuses on fundamental and applied research across the full robotics stack, including perception, planning, control, reinforcement learning, and more
✓The lab has a close relationship with the University of Washington and aims to transfer its research to enable real-world robotic applications

AI Summary

This episode discusses the concept of 'bringing robots to life' through the integration of three key computer components: NVIDIA's DGX systems for training large AI models, Omniverse and Cosmos for simulation and world modeling, and the Jetson AGX platform for running AI inference on the robot itself. The guest, Yash Raj Narang, who leads NVIDIA's Seattle Robotics Lab, explains how these three elements work together to enable more intelligent, adaptive, and robust robotic systems that can learn and adapt to changing conditions.

Key Points

1The three computer concept includes: 1) NVIDIA DGX systems for training large AI models, 2) Omniverse and Cosmos for simulation and world modeling, and 3) Jetson AGX for on-robot AI inference
2Omniverse and Cosmos enable generating data, experiences, and evaluating robots in simulation before deploying to the real world
3Recent advancements like the sim-to-real paradigm, transformer models, and ChatGPT have transformed robotics research and capabilities
4The Seattle Robotics Lab at NVIDIA focuses on fundamental and applied research across the full robotics stack, including perception, planning, control, reinforcement learning, and more
5The lab has a close relationship with the University of Washington and aims to transfer its research to enable real-world robotic applications

Topics Discussed

#Robotics#AI/Machine Learning#Simulation#World Modeling#Onboard Inference

Frequently Asked Questions

What is "Bringing Robots to Life with AI: The Three Computer Revolution - Ep. 274" about?

What topics are discussed in this episode?

This episode covers the following topics: Robotics, AI/Machine Learning, Simulation, World Modeling, Onboard Inference.

What is key insight #1 from this episode?

The three computer concept includes: 1) NVIDIA DGX systems for training large AI models, 2) Omniverse and Cosmos for simulation and world modeling, and 3) Jetson AGX for on-robot AI inference

What is key insight #2 from this episode?

Omniverse and Cosmos enable generating data, experiences, and evaluating robots in simulation before deploying to the real world

What is key insight #3 from this episode?

Recent advancements like the sim-to-real paradigm, transformer models, and ChatGPT have transformed robotics research and capabilities

What is key insight #4 from this episode?

The Seattle Robotics Lab at NVIDIA focuses on fundamental and applied research across the full robotics stack, including perception, planning, control, reinforcement learning, and more

Who should listen to this episode?

This episode is recommended for anyone interested in Robotics, AI/Machine Learning, Simulation, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Yashraj Narang, head of NVIDIA's Seattle Robotics Lab, reveals how the three computer solution—DGX for training, Omniverse and Cosmos for simulation, and Jetson AGX for real-time inference—is transforming modern robotics. From sim-to-real breakthroughs to humanoid intelligence, discover how NVIDIA's full-stack approach is making robots more adaptive, capable, and ready for real-world deployment. Learn more at ai-podcast.nvidia.com.

Full Transcript

Hello, and welcome to the NVIDIA AI podcast. I'm your host, Noah Kravitz. Our guest today is Yash Raj Narang. Yash is Senior Research Manager at NVIDIA and the head of the Seattle Robotics Lab, which I'm really excited to learn more about along with you today. Yash's work focuses on the intersection of robotics, AI, and simulation. And his team conducts fundamental and applied research across the full robotics stack, including perception, planning, control, reinforcement learning, imitation learning, simulation, and vision language action models. Full robotics stack, like it says. Prior to joining NVIDIA, Yash completed a PhD in materials science and mechanical engineering from Harvard University and a master's in mechanical engineering from MIT. And he's here now to talk about robots, the field of robotics, robotics learning, all kinds of awesome stuff. I'm so excited to have you here, Yash. So thank you for joining the podcast. Welcome. Thank you so much, Noah. So maybe first things first, and this is a very selfish question I mentioned before we started, but I think the listeners will be into it too. I've never been to the Seattle Robotics Lab. I don't know much about it. Can we start with having you talk a little bit about your own role, your background, if you like, and, you know, give us a little peek into what the Seattle Lab is all about. Yeah, absolutely. So the Seattle Robotics Lab, it started in, I believe, October of 2017. And I actually joined the lab in December of 2018. And the lab was started by Dieter Fox, who's a professor at University of Washington. And, you know, at the time, I believe he had a conversation with Jensen at a conference. Jensen Huang, of course, the CEO of NVIDIA. and Jensen thinks way far out into the future. And at that point, he was getting really excited about robotics. And he said, you know, essentially that we need a research effort in robotics at NVIDIA. And that's really how the lab started. So that was kind of the birth of the lab. And at the beginning, the lab, you know, and it still does a very academic focus. So we consistently have really high engagement at conferences. We publish a lot. We do a lot of fundamental and applied research. And recently, NVIDIA has been developing, especially over the past few years, a really robust product and engineering effort as well. And so we're working more closely and closely with them to try to get some of our research out into the hands of the community. So, you know, fundamental academic mission, but it's really important for us as well to transfer our research and get it out there for everyone to use. Fantastic. And you mentioned Dieter Fox, I believe, at UW, University of Washington. Is the lab, is there a relationship there? Yeah. So when Dieter started the lab, we, you know, over a number of years had a very close relationship with the University of Washington where many students from his lab and others would come do internships at the Seattle Robotics Lab. We still definitely have that kind of relationship. I stepped into the leadership role just a few months ago. Oh, wow. Okay. And, you know, plan to maintain that relationship because it's been so productive for us. Awesome. I have a little bit of bias. Somebody very close to me is a UW alum. Go Huskies. So, you know, I had to ask. All right. Let's talk about robots. We're going to start talking about, well, really, I'll leave it to you and I'll ask at a very high level. How do robots come to life? What does that mean when we talk about, you know, a robot coming to life? And I think there's going to get into the three computer concept and stuff like that. But I'll leave it to you. At a high level, what does that mean, bringing robots to life? Yeah, it's a big question. I think it's a real open question too. I think we can even start with what is a robot? I think this is a subject of debate, but generally speaking, a robot is a synthetic system that can perceive the world, can plan out sequences of actions, can make changes in the world, and it can be programmed. And it typically serves some purpose of automation. That's really sort of the essence of a robot. And now there's the question of, if you have a robot, how can it come to life? So I would say that if most people were, for example, to step into a factory today, you know, like an automotive manufacturing plant, they would see lots and lots of robots everywhere. Right. And the motion of these robots and the payloads of these robots and the speed of these robots, it's extremely impressive. But those same people that are walking into these places, they might not feel like these robots are alive because they don't necessarily react to you. In fact, you probably want to get out of their way if they're putting something together, you know, to be safe. So I think part of robots coming alive is really this additional aspect of intelligence so that when conditions change, it can adapt, it can be robust to perturbations, and it can start to learn from experience. Yeah. And I think that's really kind of the essence of coming alive. Got it. And what is the three computer concept and how does it relate to robotics? Yeah, the three computer concept is pretty interesting. I think this was, you know, I don't know the exact history of this, but I think this was inspired by, you know, the three body problem. So the three computer concept, it's really a formula for today's robotics, you know, both on the research side and the industry side. And it has three parts, as the name suggests. So the first computer is the NVIDIA DGX computer. So this includes things like GB200 systems, Grace Blackwell, super chips and systems that are composed of those chips. And these are really ideal for training large AI models and running inference on those models. So getting that fundamental understanding of the world, being able to process, you know, take images as input, language as input, and produce meaningful actions, robot actions as output, for example, training these sorts of models. and then running inference on those. The second computer is Omniverse and Cosmos. It's a combination of these things. So Omniverse is really a developer platform that NVIDIA has built for a number of years with incredible capabilities on rendering, incredible capabilities on simulation, and many, many applications built on top of this platform. So for example, in the Seattle Robotics Lab, we're heavy users of Isaac Sim and Isaac Lab, which are basically robot simulation and robot learning software that is developed on top of Omniverse. And what you can do with Omniverse is essentially train robots to acquire new behaviors, for example, using processes like reinforcement learning, which is sort of intelligent trial and error. You can also use it to evaluate robots. For example, if you have some learned behaviors and you want to see how it performs in different scenarios, you can put it into simulation and kind of see what happens there. Cosmos is essentially a world model for robotics. And world model is this kind of big term and many people have different interpretations of it. But just to kind of ground things a bit here, some of the things that Cosmos has done is actually make video generation models. So you could have an initial frame of an image, you can have a language command, And then you can predict sequences of image that come after that. So this is the Cosmos predict model. There's also the Cosmos transfer model. And the idea here is that you can take an image and you can again take, let's say, a language prompt and you can transform that image to look like a completely different scene while maintaining, you know, the shape and semantic relationships of different objects in that image. And then there's Cosmos Reason, which is really a VLM, which is a vision language model. So it can take images as input, language as input, and it can basically produce language as output. It can answer questions about images, and it can do a sort of a step-by-step thinking or reasoning process. Now, just stepping back a little bit, you know, second computer again, Omniverse and Cosmos. And what they're really used for is to generate data, to generate experience, and to evaluate robots in simulation. And so in a sense, this can kind of come either before or after the first computer. You know, you can, for example, generate a lot of data and then learn from it using that first computer, these DGX systems. Or you can train a model on that DGX system and then evaluate it using something like Omniverse or Cosmos. And the third computer is the AGX. By the way, I looked this up recently. I was curious. We've been here for a while, but still curious. What is the D and DGX stand for? What does the A and AGX stand for? Oh, yeah. Okay. So, A is apparently for deep learning, and A is apparently for autonomous. So, it's kind of a nice way to remember. Interesting. The more you know. Yeah, exactly. The more you know, right? So, the third computer is the Jetson AGX. specifically the Thor has been recently released. And this is all about running inference on models that are located on your robot. So instead of having separate workstations or data centers, this is a chip that actually lives on the robot where you can basically have AI models there and you can run inference on them in real time. Really powerful. So before asking to follow up, I feel like I have to plug the podcast real quick because it was really sort of satisfying in a way to listening to you and thinking, oh, yeah, we did an episode with that. Oh, yeah. Sonia talked about that. Oh, yeah. So I will say if you would like to know a little more about the feeling of walking through an automotive factory with a lot of robots doing amazing things without worrying about getting out of the way. Great episode with Siemens from a few months back. Check that out. I mentioned Sonia Fiddler recently from NVIDIA. She spoke around SIGGRAPH, but a lot of stuff related to robots. Of course, from GTC, there are my plucks. Okay, so you got into this a little bit, Yash, but mentioning Thor in particular. But what's changed recently in the field? And what does that mean for where robotics is headed? Yeah, I think there have been many changes in the field. I think, for example, the three-computer solution, three-computer strategy from NVIDIA, that's been definitely a key enabler. Just the fact that there is access to more and more compute, more and more powerful compute, and tools like Omniverse, for example, for rendering and simulation, and Cosmos for world models, and of course, you know, better and better onboard compute. I think that's really, really empowered robotics. on, let's say, you know, maybe if we think a little bit about the learning side, I think since joining the lab in December of 2018, I've sort of been lucky to witness different transformations in robotics over time. So, you know, one thing that I witnessed early on was actually, I think this was in 2019 when OpenAI released its Rubik's Cube manipulation work. And so these are, these were basically dexterous hands, human-like hands, that learned to manipulate a Rubik's cube and essentially solve it. But it was learned purely in simulation and then transferred to the real world. So that was kind of a big moment in the rise of the sim-to-real paradigm, training and simulation deploying in the real world. I think other things came after that. The, you know, transformers were, of course, invented kind of before, but really starting to see more and more of that model architecture and robotics. I think that was a big moment or a big series of moments. Another specific moment that was pretty powerful was just, of course, as everybody in AI knows, ChatGPT. So I think that was released in late 2022. Most people started to interact with it early 2023. And then, you know, the world of robotics started thinking about, OK, how do we actually leverage this for what we do? And, you know, many other fields kind of felt the same thing. Sure. So there was really an explosion of papers starting in 2023 about how to use language models for robotics and how to use vision language models for robotics. And I think that was that was quite interesting. So there are papers that kind of explore this along every dimension. Like, can you, for example, give some sort of long range task to a robot or, you know, in this case to a language model and have it figure out all the steps you need to accomplish in order to form that task? Can you, for example, use a language model to construct rewards? So when you do for example reinforcement learning intelligent trial and error you usually need some sort of signal about how good your attempt was You know you trying all of these different things How good was that sequence of actions And that's typically called a reward. So, you know, these are traditionally hand-coded things using a lot of human intuition. And there was some very interesting work, including Eureka from NVIDIA, about how to use language models to sort of generate those rewards. There was also kind of a simultaneous explosion in more general generative AI, for example, generating images and generating 3D assets. A lot of this work came from NVIDIA as well. So on the image generation side, there was work, for example, on generating images that describe the goal of your robotic system. So where do you want your robot to end up? What do you want the final product to look like? let's generate an image from that and use that to sort of guide the learning process. And then there's also, you know, when it comes to simulation, one of the, one of the, and we'll probably get more into this a little bit later, but one of the challenges of simulation is you have to build a scene and you have to build these 3D assets or meshes. And that can take a lot of time and effort and artistic ability and so on. So there's a lot of work on automatically generating these scenes and generating these assets. And in a sense, you can kind of view this transformation that we've seen over the past few years as kind of taking the human or human ingenuity more and more out of the process or at higher and higher levels, as opposed to absolutely doing everything and sort of hard coding things like rewards and final states and, you know, building meshes and assets manually and describing scenes and so on and so forth. So we're able to automate more and more of that. There's so much in what you just said. And one of the big things for me from this perspective is thinking about how little I understood about Omniverse, let alone Cosmos, before having the chance to have some of these conversations, particularly over the past few months and having to do with robotics, physical AI and simulation and the idea of creating the world and the robot is able to learn and Cosmos. It's all it's it's just. fascinating. It's so cool to, you know, I'm wanting to geek out on my end, but when you're talking about the different types of learning and, you know, I'm sure they go together in the same way that you mix different approaches to anything in solving complex problems. Can you talk a little bit about, I don't know if pros and cons is the right way to describe it, but the difference between imitation and reinforcement learning, not so much in what they are, but in sort of, you know, effectiveness or how you use them together and that sort of thing. Yeah, absolutely. I think these, you know, these are two really popular paradigms for robot learning. And I will, you know, try to kind of ground it in what we do, what we typically do in robotics, the typical implementations of imitation learning and reinforcement learning. So in a typical imitation learning pipeline, you're typically learning from examples. So for example, let's say I define a task. I'm trying to pick up my water bottle with a robot. What I might do if I were using an imitation learning approach is maybe physically move around the robot and pick up the water bottle. Or I might use my keyboard and mouse to sort of teleoperate the robot and pick up the water bottle. Or I might use other interfaces. But the point is that I am collecting a number of demonstrations of this behavior. I do it once in one way, I do it the second time in a different way, and maybe I move the water bottle around and I collect a lot of different demonstrations there. And basically, the purpose of imitation learning is to essentially mimic those demonstrations. The behaviors would ideally look as I have demonstrated it, right? Now, reinforcement learning operates a little bit differently. Reinforcement learning tries to discover the behaviors, you know, or the sequences of actions that achieve the goal. So, you know, in the most extreme case, what you might do if you were to take a reinforcement learning approach, again, intelligent trial and error, is you might just have proposals of different sequences of actions that are being generated. And if they happen to pick up the water bottle, I give a reward signal of one. And if they fail, I might give a reward signal of zero. And the key difference here is that I am not providing very much guidance on this sequence of actions that the robot needs to use in order to accomplish the task. I'm letting the robot explore, try out many different things, and then come up with its own strategy. So, you know, pros and cons. So imitation learning, you know, one pro is that you can provide it a lot of guidance. And the behaviors that you learn, for example, if a human, if a person is demonstrating these behaviors, then the behaviors that you learn would generally be human-like. They're trying to essentially mimic those demonstrations. Now, reinforcement-like, on the other hand, again, in the most extreme case, you're not necessarily leveraging any demonstrations. The robot or agent, it's often called, has to figure this out on its own. And so it can be less efficient. Of course, you're not giving it that guidance. And so it's trying all of these sequences of actions and there are principled ways to do that. But essentially, it would be less efficient than if you were to give it some demonstrations and say learn from that. Now, the pro is that you can often do things that you have the capability of doing things that can be really hard to demonstrate. So one of the things, one of the topics that I've worked on for some time, for example, is assembly, literally teaching robots to put parts together. And this can actually be really difficult to do via a teleoperation interface. You probably need to be an expert gamer in order to do that. I hear you talk about assembling things and I think of forget the robot. I think of myself trying to put together like very small parts on something, you know, twisting a screw in. And I can't. That makes me cringe, let alone trying to tell operate a robot. Yeah, it can be really hard depending on the task. And the second thing is that reinforcement learning generally has the potential to achieve superhuman performance. So there are things, and I think games are a great example. Like, you know, one of the domains of reinforcement learning historically has been in games like Atari games. And that's kind of where people maybe in recent history got super excited about reinforcement learning because all of a sudden you could have these AI agents that can do better at these games than any human ever. The same capabilities apply to robots. So you can potentially learn, the robot can learn behaviors that are better than what any person could possibly demonstrate. And maybe like a simple example of this is speed. So maybe there's a tricky problem you're trying to give your robot where it has to go through a really narrow path and has to do this very quickly. and if you were to demonstrate this, you might proceed very slowly, you might collide along the way. But if a reinforcement learning agent is allowed to solve this problem, it could probably learn these behaviors automatically, these smooth behaviors, and it can start to do this really, really fast. And assembling objects is another example. You can start to assemble objects faster than you could possibly demonstrate. And I think that's the power. That's very cool. The thinking about, or listening to you talk about different approaches to teaching and learning brought to mind, I was looking at the NVIDIA YouTube channel just the other day for a totally different reason and came across the video of Jensen giving the robot a gift and writing the card that says, you know, dear robot, enjoy your new brain or something along those lines, right? There's something I only know kind of the name modular versus end-to-end brain. What is that about? Is that, am I along the right lines or is that something totally different? No, no, that's, it's essentially a way to design robotic intelligence. I would say these are two competing paradigms. Both of these paradigms can leverage the latest and greatest in hardware. I would say that. Now, the modular approach is an approach that has been developed for a very long time in robotics. And sort of a classic framing for this is that a robot, you know, in order to perform some task or set of tasks, needs to have the ability to perceive the world. So to take in sensing information and then come up with an understanding of the world, like where everything is, for example. And it also needs the ability to plan. So, for example, given some sort of model of the world, like a physics model, for example, or a more abstract model, and maybe some sort of reward signal, you know, can it actually select a sequence of actions that is likely to accomplish a desired goal? Right. And then, you know, a third module in this modular approach would be the action module. And that means that you get in this sequence of actions, maybe these configurations that you'd like the robot to reach in space. And the action module would figure out, also called control, would figure out what are the motor commands that you want to generate? Literally, what are the signals you want to send to the robot's motors in order to move along this path in space? So that's kind of the perceive plan act framework. It's called different things over time, but that's kind of the classic framing for a modular approach. And so following that, you would have maybe a perception module and you'd have some group of people working on that. You'd have a planning module. You'd have some group of people working on that. You have an action module. And so this is kind of how many robotic systems have been built over time. Now, the end to end approach is something that is definitely newer. And the idea is that you don't draw these boundaries, really. You take in your sensor data, like camera data, maybe force torque data if you're interacting with the world. And then you directly predict the commands that you may send your motors. So you kind of skip these intermediate steps and you go straight from inputs to outputs. And that's the end-to-end approach. And, you know, I would say the modular approaches are extremely powerful. They have their advantages, which there's a lot of maturity around developing each of those modules. It can be easy to debug, you know, for teams of engineers, which I was mentioning the groups of people earlier. Yeah, it can be easier to certify as well, you know, if safety is a safety-critical application. The end-to-end approach, the advantage there is that you're not relying as much on human ingenuity or human engineering to figure out what exactly are the outputs I should be producing for my perception module, what exactly are the outputs I should be producing for my planning module, and so on. That requires a lot of engineering, and if you don't do it right, you may not get the desired outcome. Yeah, I was just going to say, conceptually, it made me think of the difference between doing whatever task I'm used to doing and asking a chatbot just to shoot me the output, you know, and yeah, yeah. Right. And I think just another analogy here would be, I think this has been a really fruitful debate, a really vigorous debate in autonomous driving, actually. So in the 2010s, I would say just about every effort in autonomous driving was focused on the modular paradigm. Again, you know, separate perception, planning, control modules and different teams associated with each of those things. And then kind of late, let's say, you know, early in the 2020s was a real shift to the end-to-end paradigm, which basically said, let's just collect a lot of data and train a model that goes directly from pixels to actions, you know, actions in this case being steering angle, throttle, brakes, and so on. And many things today kind of look, I would say, like, a hybrid, you know, different companies and strategies. But most people have converged upon something that has elements of both. I'm speaking with Yash Raj Narang. Yash is a senior research manager at NVIDIA and the head of the Seattle Robotics Lab. And we've been talking about all things robots, AI, simulation, which we'll get back to in a second. But we were just talking about different styles, different approaches to robotics learning. I wanted to go back to earlier in the conversation when you mentioned, you know, going into the factory and seeing all these different robots doing these kinds of things. And even before that, your definition of what a robot is or is not. And thinking about that, and I'm getting to thinking about asking you to define sort of the difference between traditional and humanoid robots. And I'm thinking traditional like robot arms in a factory. You know I have fuzzy probably images from sci movies when I was a kid and stuff like that right And humanoid robots and I mentioned this earlier back during GTC I had the chance to sit down with the CEO of 1X Robotics who we talked all about humanoid robots. So maybe you can talk a little bit about this, traditional robots, humanoid robots, what the difference is, and maybe why we're now starting to see more robots that look like humans and whether or not that has anything to do with functionality. Yeah, absolutely. So one of your earlier questions too is kind of how has robotics changed recently? I think this is just another fantastic example of that. It's been unbelievable over the past few years to see the explosion of interest and progress in humanoid robotics. And, you know, to be fair, actually, companies like Boston Dynamics and Agility Robotics, for example, have been working on this since, you know, probably the mid, maybe even early 2010s. Yeah. And, you know, so they made continuous progress on that. And everybody was always really, you know, excited and inspired to see their demo videos and so on. Can I interrupt you to ask a really silly question? But now I need to know. Is there a word, we say humanoid robots, right? Is there a word for a robot that looks like a dog? Because Boston Dynamics makes me think of those early atlas, I think, those early videos. Yeah, yeah, yeah. And I think Boston Dynamics used to, you know, they had a dog-like robot, which was called Big Dog, you know, sometime then, which is, you know, maybe why this is compromised. People typically refer to them as quadrupeds, just four legs, right? Got it. Got it. Thank you. No problem. Yeah. So, yeah, where were we? So traditional robots versus humanoids. So there's been an explosion of interest in humanoids, particularly over the past few years. And I think it was just this perfect storm of factors where there was already a lot of excitement being generated by some of the original players in this field. Folks like Tesla got super interested in humanoid robotics, I think 2022, 2023. And it also coincided with this explosion of advancement in intelligence through LLMs, VLMs, and early signals of that in robotics. And so I think, you know, there's a group of people, you know, forward-thinking people, Jensen very much included, this is near and dear to his heart, that felt that the time is right for this dream of humanoid robotics to finally be realized, right? You know, let's actually go for it. And, you know, this begs the question of why humanoids at all? You know, why have people been so interested in humanoids? Why do people believe in humanoids? And I think that the most common answer you'll get to this, which I believe makes a lot of sense is that the world has been designed for humans. You know, we have built everything for us, for our form factors, for our hands. And if we want robots to operate alongside us in places that we go to every day, you know, in our home, in the office, and so on, we want these robots to have our form. And in doing so, they can do a lot of things, ideally, that we can. We can go up and down stairs that were really built for the dimensions of our we can open and close doors that are located at a certain height and have a certain geometry because they're easy for us to grab. Humanoids could manipulate tools like hammers and scissors and screwdrivers and pipettes if you're in a lab, these sorts of things, which were built for our hands. And so that's really the fundamental argument about why humanoids at all. And it's been amazing to see this iterative process where there's advancements in intelligence and advancements in the hardware. So basically the body and the brain and kind of going back and forth and just seeing, for example, the amount of progress that's been happening over the past couple of years in developing really high quality robotic hand hardware. It's kind of amazing. So that's really kind of, you know, my understanding of the story and kind of the fundamental argument behind humanoid robots. But I definitely see, I would say I see a future where these things actually just coexist, traditional and humanoid. Yeah. So earlier we were talking about the importance of simulation, creating world environments where robots can explore, can learn, all the different approaches to that. And I think we touched on this a little bit, but can you speak specifically to the role of simulated or synthetic data versus real-world data. It's something we touched upon. And again, listeners, the more we're talking about, I feel like all these recent episodes sort of coming together, talking about the increasing role of AI broadly generating tokens for other parts of the system to use and all of that. So when it comes to the world of robotics, simulated data, real world data, how do they work? How do they coexist? Yeah. So first I'd like to say that in contrast with a number of other areas like language and vision, robotics is widely acknowledged to have a data problem. So there is no internet scale corpus of robotics data. And so that's really why so many people in robotics are very, very interested in simulation and specifically using it to generate synthetic data. So that's basically the idea is that simulation can be used to have high fidelity renderings of the world. They can be used to do really high quality physics simulations. And they can be used as a result to generate a lot of data that would just be totally intractable to collect in the real world. And real world data is, you know, generally speaking, your source of ground truth. It doesn't have any gap with respect to the real world because it is the real world, but it tends to be much harder to scale. You know, in contrast with autonomous vehicles, for example, robotics doesn't really have a car at the moment. There aren't fleets of robots that everybody has access to. Can't put a dash cam on those little food delivery robots and get the data you need. Even if you could, you know, will it be nearly enough data? The answer is probably no, you know, to train general intelligence. You know, that's kind of why people are really attracted to the idea of using simulation to generate data. In real world, whenever you can get it, it's the ideal source of data, but it's just really, really difficult to scale. So you mentioned, you know, using real world data, there's no gap. We've talked about the sim to real gap in other contexts. How do you close it in robotics? What's the importance of it? Where are we at? And you talked about it a little bit, but get into the gap a little more and what we can do about it. Sure. So sim to real gap. So there are different areas in which simulation is typically different from the real world. So one is, you know, on the perception side, literally, you know, the visual qualities of simulation are very different from the real world. Simulation looks different often from the way the real world does. So that's one source of gap. Another source of gap is really on the physics side. So for example, in the real world, you might be trying to manipulate something, pick up something that is very, very flexible, and your simulator might only be able to model rigid objects or rigid objects connected by joints. And, you know, even if you had a perfect model in your simulator of whatever you're trying to move around or manipulate, you still have to figure out like, what are the parameters of that model? You know, what is the stiffness of this thing that I'm trying to move around? What is the mass? What are the inertia matrices in these properties? So physics is just another gap. And then there are other factors, things like latencies. So in the real world, you might have different sensors that are streaming data at different frequencies. And in simulation, you may not have modeled all of the complexities of different, again, different sensors coming into different frequencies. Your control loop may be running at a particular frequency. And these things may have a certain amount of jitter or delay in the real world, which you may or may not model in simulation. So these are just a few examples of areas where it might be quite different between simulation and the real world. And generally speaking, the ways around this are you either spend a lot of time modeling the real world, really capturing the visual qualities and the physics phenomena and the physics parameters and the latencies and putting that in simulation. But that can take a lot of time and effort. Another approach is called domain randomization or dynamics randomization. And the idea is that you can't possibly identify everything about the real world and put it into simulation. So whenever I'm doing learning on simulated data, let me just randomize a lot of these properties. So I want to train a robot that can pick up a mug or put two parts together. And it should work in any environment. It shouldn't really matter what the background looks like. So let me just take my simulated data and randomize the background in many, many, many different ways. And you can do similar strategies for physics models as well. You can randomize different parameters of physics models. And then there's also another approach which is really focused on domain adaptation. So I really care about a particular environment in which I want to deploy my robot. So let me just augment my simulated data to be reflective of that environment. You know, let me make my simulation look like an industrial work cell or let me make it look like my home because I know I'm going to have my robot operate here. And maybe the final approach is kind of, you know, this thing called domain invariance. So there's randomization, adaptation and invariance, which is basically the idea that I'm going to remove a lot of information that is just not necessary for learning. You know, if maybe if I'm if I'm picking up certain objects, I only need to know about the edges of these objects. I don't need to know what color, for example. So, you know, taking that idea and incorporating it into the learning process and making sure that my networks themselves or my data might be transformed in a way that it's no longer reliant on these things that don't matter. Yeah. I'm thinking about all of the data coming in and, you know, all the things that can be captured by the sensors and using video to train. And earlier you were talking about the problem, and it made me think of reasoning models, the problem of, you know, can you give a robot a task and can it break it down and reason its way and then actually execute and do it? What are reasoning VLA models been talked about? A lot recently, I keep hearing about them anyway. Can you talk a little bit about what they are and how they're used in robotics? Yeah, absolutely. So reasoning itself, you know, just stepping back for a second, reasoning is an interesting term because it means many things to many different people. I think a lot of people think about things like logic and causality and common sense and so on, you know, different types of reasoning. And you can use those to draw conclusions about the world. Reasoning in the context of LLMs and VLMs and now VLA, so vision language action models that produce actions as outputs, often means, you know, in simple terms, thinking step by step. In fact, if you go to ChatGPT and you say, here's my question, you know, show me your work or think step by step, it will do this form of reasoning. And so that's the idea is that you can often have better quality answers or better quality training data if you allow these models to actually engage in a multi-step thinking process. And that's kind of the essence of reasoning models. And reasoning VLAs are no exception to that. Okay. So I might give a robot a really hard task like setting a table. And maybe I want my VLA to now identify what are all the subtasks involved in order to do that. And within those subtasks, what are all the smaller scale trajectories that I need to generate and so on. So this is kind of the essence of the reasoning VLA. Got it. Right. So to start to wrap up here, I was going to ask, I am going to ask you to sort of, in a way, it's kind of summarizing what we've been talking about, but maybe to put kind of a point on what you think sort of the most important current limitations are to robotic learning that, you know, we're working, you're working, you and your teams and folks in the community are working to overcome. You mentioning setting the table, though, made me think, you know, a better way to ask that. How far are we from laundry folding robots? Like, am I going to, I'm the worst at folding laundry and I always see demos. And I heard at some point that, you know, folding laundry sort of represents conceptually a very difficult task for a robot. Am I going to see it soon before my kids go off to school? I think you might see it soon. I've seen some really impressive work coming out recently from various companies and demos within NVIDIA on things like laundry folding. Yeah. And the general process that people take is to collect a lot of demonstrations of people actually folding laundry and then use imitation learning paradigms or variants of the paradigms to try to learn from those demonstrations And this ends up actually being, if you have the right kind of data and enough data in the right model architectures, you can actually learn to do these things quite well. Now, the classic question is, how well will it generalize? If I have a robot that can fold my laundry, can it fold your laundry? Right, right. The typical answer to that is you probably need some amount of data that's in the setting that you actually want to deploy the robot in. And then you can fine tune these models. But I would say we're actually pretty, we're getting closer and closer, closer than certainly I've ever seen on tasks like laundry folding. I'm excited. I'm excited. You've got me optimistic. And I thank you for that. So perhaps to get back to the more general conversation of interest, the current limitations, What do you see them as? And, you know, what's the prognosis on getting past them? Sure. I think one big one is people feel, I would say the community as a whole, is really optimistic about the role of simulation in robotics, or at least most of the community. Simulation can take different forms. It can take kind of the physics simulation approach, or it can take this, you know, video generation, like, let me just predict what the world will look like? And these are really, you know, really thriving paradigms. And I think two questions around that one that we just talked about, which is the sim to real gap. So I think sim to real gap is people have made a lot of progress on it, something we've worked very hard on in NVIDIA, but there's still a lot more progress to be made, you know, until we can truly generate data and experience and simulation and have it transferred to the real world without having to, you know, put a lot of thought and engineering into truly making it work. And conversely, there's the real to sim question. So building simulators is really, really difficult. You again, have to, you know, design your scenes and design your 3d assets and so on. Wouldn't it be great if we could just take some images or take some videos of the real world and instantly have a simulation that also has physics properties, doesn't just have the visual representation of the world, but it has realistic masses and friction and these other properties. So sim to real and real to sim, I think they're two big challenges and we're just getting closer and closer, you know, every few months on solving those problems. And then the boundaries between sim and real, I think will start to be a little bit blurred, which is kind of maybe an interesting possibility. I think that's one big thing. And the second big thing I'd say for now is the data question. Again, robotics, as we're talking about it here, doesn't have the equivalent of a car. There is no fleet of robots that everybody has access to that can be used to collect a ton of data. And until that exists, I think we have to think a lot more about where we're going to get that data from. And one thing that the group effort at NVIDIA, which is around humanoids, has proposed is this idea of the data pyramid, where you basically have, you know, at the base of the pyramid, things like videos, YouTube videos that you're trying to learn from. And then maybe a little bit higher in the pyramid, you have things like synthetic data that's coming from different types of simulators. And then maybe at the top of the pyramid, you have something like data that's actually collected in the real world. And then the question is, what is the right mixture of these different data sources to give robots this general intelligence? So, Yash, as we're recording this, Coral is coming up. Let's end on that forward-looking note, and it'll be a good segue for the audience to go check out what Coral is all about. But tell us what it's about and what your and NVIDIA's participation is going to be like this year. Yeah, absolutely. So CORAL stands for the Conference on Robot Learning. And it started out as a small conference, I think, and it was probably 2017 was maybe the first edition of it. And it's grown tremendously. It's one of the hottest conferences in robotics research now, as learning itself as a paradigm has really taken off. This year, it's going to be in Seoul, in Korea, which is extremely exciting. Yeah. And it's going to bring together the robotics community, the learning community, and the intersection of those two communities. And so, you know, I think everybody in robotics is looking forward to this. Our participation, you know, the Seattle Robotics Lab and other research efforts at NVIDIA, for example, the Gear Lab, which focuses on humanoids, you know, presenting a wide range of papers. And so we're going to be giving talks on those papers, presenting posters on those papers, hopefully some demos. And, you know, we're just going to be really excited to talk with researchers and, you know, people will be interested in joining us in our missions. Fantastic. Any of those posters and papers you're excited about in particular? Maybe you want to share a little teaser with us? Yeah, I'm excited about a number of them. But one that I can just call out for now that I work closely on is this project called Neural Robot Dynamics. So that's the name of the paper. And we have abbreviated that to NERD. I was going to ask. I'm glad. So it's just N-E-R-D, also kind of inspired by neural radiance fields. Right, right, of course, yeah. So we had this framework and these models, which we call NERD. And the idea is basically that classical simulation, so typical physics simulators, kind of work in this way where they are performing these explicit computations about, here are my joint torques of the robot, here are some external forces, here are some contact forces, and let's predict the next state of the robot. And the idea behind neural simulation is, can we capture that all with a neural network? And so, you know, you might be wondering, why would you want to do that? And there are some advantages to this. So one is that, you know, neural networks are inherently differentiable. And what that means is that you can understand, if you slightly change the inputs to your simulator, what would be the change in the outputs? And if you know this, then you can perform optimization. You can figure out how do I optimize my inputs to get the robot to do something interesting. So neural networks are inherently differentiable. And if you can capture a simulator in this way, you can essentially create a differentiable simulator for free, which is kind of exciting. Another thing which is really exciting to us is fine-tuneability. So it's very difficult if you're given a simulator and you have some set of real-world data that you collected on that particular robot that you're simulating to actually figure out how should I modify the simulator to better predict that real world data. And neural simulators can kind of do this very, very naturally. You can fine tune them just like any other neural network. So I can train a neural network on some simulated data and then collect some amount of real world data and then fine tune it. And this process can be continuous. You know, if my robot changes over time or there's wear and tear, I can continue fine tuning it and always have this really accurate simulator of that robot, which is pretty exciting. Yeah, that's really cool. Yeah, I think it's really cool. And a third advantage, which we are sort of in the early stages of exploring, is really on the speed side. So a lot of compute today, as many people know, it's been really optimized for AI workloads and specific types of mathematical operations, specific types of matrix multiplications for example, that are very common in neural networks. And if you can transform a typical simulator into a neural network, then you can really take advantage of all of these speed benefits that come with the latest compute and with the latest software built on top of that. So that's really exciting to us. And we sort of did this project in a way that allows these neural models to really generalize. So for given a particular robot, if you put it in a new place, in the world or you change some aspects of the world, this model can still make accurate predictions and it can make accurate predictions over a long timescale. Amazing. For listeners who would like to follow the progress at Coral in particular, Seattle Robotics Lab in particular, NVIDIA more broadly, where are some online places, some resources you might direct them to? Yeah, I'd say the Coral website itself is probably your you know, your primary source of information. So you'll find the program for Coral. You'll find, you know, links to actually watch some of the talks at Coral. You'll be able to have links to papers and you'll see the range of workshops that are going to be there. And a lot of them, I'm sure, will post recordings of these workshops. That's a great way to get involved. And that's just corl.org for the listeners. Yes. Yes, that's right. You get your website as well. I'm sure we'll have on the website and through NVIDIA social media accounts. Noah, you could probably call out to those. I'm sure there's going to be plenty of updates on Coral over the next period of time. Can I ask you, as a parting shot here, predict the future for us. What does the future of robotics look like? You can look out a couple of years, five years, 10 years, whatever timeframe makes the most sense. And we want to hold you to this, but what do you think about when you think about the future of all this? Yeah, I think it comes down to those fundamental questions. So, you know, one is kind of what will the bodies of robots look like? So this is kind of what you touched on with, you know, robot arms and factories versus humanoids. And I think what you'll see is that there'll be a place for both. So, you know, robot arms and more traditional looking robots will still operate in environments that are really built for them or need an extremely high degree of optimality. and humanoids will really operate in environments where they need to actually be, you know, alongside humans and, you know, in your household and in your office and so on around many, many things that have been built for humans. So I kind of see that as the future of the body side of things. On the brain side of things, there's also these questions of, you know, modular versus end-to-end paradigms. And what I've seen in autonomous vehicles is, of course, as we talked about before, starting with modular, swinging to end to end, starting to converge on something in the middle. And I can imagine that robotics, as we're talking about here, for example, robotic manipulation, will start to follow a similar trajectory where we will explore end to end models and then probably converge on hybrid architectures until we collect enough data that an end to end model is actually all we need. And that's kind of how I see those aspects. There are some other questions, for example, are we going to have specialized models or are we just going to have one big model that solves everything? That one is a little bit hard to predict, but I would say that, again, there's probably a role for both where we're going to have specialized models for very specific domain specific tasks and where, for example, power or energy limits are very significant. and you're going to have sort of these generalist models in other domains where you need to do a lot of different things and you need a lot of common sense reasoning to solve tasks. Yeah, I would say those are some open debates and that would be my prediction. And then maybe one other thing that you touched on was simulation versus the real world. And again, I kind of see this as one of the most exciting things. I'd love to see how this unfolds, but I really feel that the boundaries between simulation and real world will start to be blurred. The sim-to-real problem will be more and more solved, and the real-to-sim problem will also be more and more solved. And so we'll be able to capture the complexity of the real world and make predictions in a very fluid way, perhaps using a combination of physics simulators and these world models that people have been building, like Cosmos. Amazing future. Yash, thank you so much. This has been an absolute pleasure, and I know you have plenty to get back to, so we appreciate you taking the time out to come on the podcast. All the best with everything and enjoy Coral. Can't wait to follow your progress and read all about it. Thank you so much, Noah. It's been a pleasure. Thank you. Thank you.

Share on X Share on LinkedIn