Back to Podcasts
TWIML AI Podcast

Inside Nano Banana 🍌 and the Future of Vision-Language Models with Oliver Wang - #748

TWIML AI Podcast

Tuesday, September 23, 20251h 3m
Inside Nano Banana 🍌 and the Future of Vision-Language Models with Oliver Wang - #748

Inside Nano Banana 🍌 and the Future of Vision-Language Models with Oliver Wang - #748

TWIML AI Podcast

0:001:03:39

What You'll Learn

  • βœ“Nano Banana is a new image generation and editing model from Google DeepMind, integrated into the Gemini language model
  • βœ“It can handle more open-ended and abstract prompts compared to previous image models, demonstrating improved understanding of user intent
  • βœ“The model has seen unexpectedly high adoption and usage, with millions of votes on the Eliza Arena platform
  • βœ“The goal was to build a general-purpose image editing and generation model, rather than one focused on specific tasks
  • βœ“The integration of image capabilities with the broader language model knowledge has been a key factor in the model's success

AI Summary

The podcast discusses Nano Banana, a new image generation and editing model developed by Google DeepMind. The model, officially called Gemini 2.5 Flash Image, is integrated into the broader Gemini language model, allowing it to leverage extensive world knowledge. Unlike previous image models that required specific prompts, Nano Banana can handle more open-ended and abstract prompts, demonstrating improved understanding of user intent. The model has seen unexpectedly high adoption and usage, with millions of votes on the Eliza Arena platform, indicating its broad utility for creative and everyday tasks.

Key Points

  • 1Nano Banana is a new image generation and editing model from Google DeepMind, integrated into the Gemini language model
  • 2It can handle more open-ended and abstract prompts compared to previous image models, demonstrating improved understanding of user intent
  • 3The model has seen unexpectedly high adoption and usage, with millions of votes on the Eliza Arena platform
  • 4The goal was to build a general-purpose image editing and generation model, rather than one focused on specific tasks
  • 5The integration of image capabilities with the broader language model knowledge has been a key factor in the model's success

Topics Discussed

#Vision-language models#Image generation#Image editing#Generalist AI agents#Model integration and convergence

Frequently Asked Questions

What is "Inside Nano Banana 🍌 and the Future of Vision-Language Models with Oliver Wang - #748" about?

The podcast discusses Nano Banana, a new image generation and editing model developed by Google DeepMind. The model, officially called Gemini 2.5 Flash Image, is integrated into the broader Gemini language model, allowing it to leverage extensive world knowledge. Unlike previous image models that required specific prompts, Nano Banana can handle more open-ended and abstract prompts, demonstrating improved understanding of user intent. The model has seen unexpectedly high adoption and usage, with millions of votes on the Eliza Arena platform, indicating its broad utility for creative and everyday tasks.

What topics are discussed in this episode?

This episode covers the following topics: Vision-language models, Image generation, Image editing, Generalist AI agents, Model integration and convergence.

What is key insight #1 from this episode?

Nano Banana is a new image generation and editing model from Google DeepMind, integrated into the Gemini language model

What is key insight #2 from this episode?

It can handle more open-ended and abstract prompts compared to previous image models, demonstrating improved understanding of user intent

What is key insight #3 from this episode?

The model has seen unexpectedly high adoption and usage, with millions of votes on the Eliza Arena platform

What is key insight #4 from this episode?

The goal was to build a general-purpose image editing and generation model, rather than one focused on specific tasks

Who should listen to this episode?

This episode is recommended for anyone interested in Vision-language models, Image generation, Image editing, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Today, we’re joined by Oliver Wang, principal scientist at Google DeepMind and tech lead for Gemini 2.5 Flash Imageβ€”better known by its code name, β€œNano Banana.” We dive into the development and capabilities of this newly released frontier vision-language model, beginning with the broader shift from specialized image generators to general-purpose multimodal agents that can use both visual and textual data for a variety of tasks. Oliver explains how Nano Banana can generate and iteratively edit images while maintaining consistency, and how its integration with Gemini’s world knowledge expands creative and practical use cases. We discuss the tension between aesthetics and accuracy, the relative maturity of image models compared to text-based LLMs, and scaling as a driver of progress. Oliver also shares surprising emergent behaviors, the challenges of evaluating vision-language models, and the risks of training on AI-generated data. Finally, we look ahead to interactive world models and VLMs that may one day β€œthink” and β€œreason” in images. The complete show notes for this episode can be found at https://twimlai.com/go/748.

Full Transcript

I'd like to thank our friends at Capital One for sponsoring today's episode. Capital One's tech team isn't just talking about multi-agentic AI, they already deployed one. It's called Chat Concierge and it's simplifying car shopping. Using self-reflection and layered reasoning with live API checks, it doesn't just help buyers find a car they love, it helps schedule a test drive, get pre-approved for financing, and estimate trade-in value. Advanced, intuitive, and deployed. That's how they stack. That's technology at Capital One. What we tried to do this time around was we wanted to approach the problem kind of bottom up and build a really generalist agent. We released this to Elam Arena under the codename Nano Banana. The ELO scores ended up being really high and we had a big improvement over other models. I think we had a couple million votes on El Marino, which at the time was equal to all of the text model votes up to that point. So there was much more interest than we anticipated. But I think when we saw that, it was really encouraging that, oh, this model is actually really useful for people. All right, everyone, welcome to another episode of the TwiML AI podcast. I am your host, Sam Sherrington, and today I'm joined by Oliver Wong. Oliver is Principal Scientist at Google DeepMind and Tech Lead for Gemini 2.5 Flash Image, aka Nano Banana. I'm sure you've all heard of it. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Welcome to the podcast, Oliver. Thanks for having me. I'm looking forward to digging into the conversation and talking with you a little bit about NanoBanada, what you're seeing with the model and the experiences bringing it to our devices. To get us started, though, I'd love to have you share a little bit about your background and how you came to work in the field. So I've been in this area of generative models for image and video generation for probably about five years. I came into this field through the media entertainment side, actually. So I started after I got my PhD. I started as a researcher at Disney Research, and I was there for about six or seven years. After that time, I went to Adobe and worked on professional tools at Adobe, again, for about seven years, and joined Google two and a half years ago to work on bringing generative models, image generation into the AI models that we interact with every day. It's kind of really interesting thinking about the, you know, the arc of generative models and in particular media models and how much of the innovation is coming out of like your frontier model companies, as opposed to these media giants like Disney and Adobe, like, I'm curious how you think about, you know, what you were working on then, you know, at a place like Disney and, you know, how you're seeing, you know, how that differs from the way you're working at Google? Yeah, so I think for a long time in the field, we would kind of solve problems for creative applications, creative tools, and for content creation on a case-by-case basis. And during this time, actually, some of the best groups of people working on these problems were at the media companies. I think what we've seen happen with AI is there's been, more of an emphasis on using the kind of world knowledge and all the other powerful tools that come out of large language models. It turns out that those are actually very useful for creative tasks. So I think we see the big foundation model and kind of these big AI labs taking more of a role into this space simply because the tools exist in these large language models that can really help with creative tasks. I also think actually that the creative tasks are, it's a little bit counterintuitive People often thought that the creative realm was going to be the last one that gets touched by AI. But actually, because there's many possible answers you can get from an AI model that are useful for different creative tasks, for ideation or for iteration, it turns out that this is one of the first places that the models are useful. So a lot of things start in the creative realm and I think maybe move later into other sort of factuality spaces. I think we're seeing this as well for images too. Yeah, I think part of what I'm hearing is kind of an ode to this transition from bespoke models, bespoke systems to kind of bitter lesson, collect a lot of data, train a model, and, you know, let the model figure it out. Yeah, and you get a lot of emergent behavior once the model has all the world knowledge in it. And it makes doing things that used to be really hard much easier. So in the intro, I said everybody's probably heard of Nano Banana. Maybe that's not true. But in any case, we should kind of explain what it is. You know, how do you explain what Nano Banana is? Yeah, so Nano Banana, which the official name is Gemini 2.5 Flash Image, is our latest and best image model. It can do generation and editing, importantly, which means you can have a conversation with an agent where you kind of iterate on editing an image to get it to the point that you want. And this is another model in the line of image models that we've released at Google. So we have the Imagine line and we released a Gemini Flash Image 2.0. So this is the next version of that. And I think we really were able to make a lot of improvement for this model. And we've seen the usage kind of take off because of that. Talk a little bit about how it compares and contrasts with the Imagine line and more broadly, like how you think about the arc or trajectory of image generation models. Yeah. So the thing that we're really proud of in the Gemini 2.5 flash image model is that we've integrated it into Gemini, which means we can take advantage of all the world knowledge I was talking about earlier. So for Imagine, you know, you can generate very good images, but you have to be very explicit about what you want to generate. And with Nano Banana, it's possible to have prompts that are much more seeking input or information from the AI model itself. So we see people asking kind of like high level abstract prompts and the model is able to do a much better job of deciding what it is the user's trying to ask and then coming up with a reasonable response and image to satisfy this request. So I think it's the main thing that we're seeing, there's a couple of things, I guess. One of the main things is that we have a lot more world knowledge. We can have a lot better understanding what user's trying to do. We can operate more autonomously as a result. And the second thing is that we can also use it for editing images or any kind of conversational editing, which has opened a lot of doors as well. I'd like to dig a little bit deeper into that integration between Gemini and Nano Banana, that functionality. And in particular, like it's clearly its own model, but it, and, you know, you can select it in the model selector as a separate thing. but yet you're describing it as integrated. I'm thinking about a conversation I had with Logan Kilpatrick there about how DeepMind's kind of broader strategy is focused on a single model as opposed to individual models for different tasks. Talk about how you think about delivering kind of a new model, like a new functionality, but having it be integrated into this you know, Gemini family and like, you know, do you ultimately see them converging? Like there's so many questions under there. So yeah, when I say integrated, I guess I mean that yes, you select the Gemini 2.5 flash image model in a dropdown. But that model, you can then ask text questions to, you can ask coding questions, you can ask image generation questions. So it's really, it's like a fully fledged language model with all the world knowledge of Gemini. It is a separate checkpoint that has the image capabilities in it. But it is a sort of unified Gemini experience. And, you know, this model is new. And we're kind of seeing where the model can be successful, where it can't be successful. And I'm not going to give any predictions on the future. But I think that the general trend of models becoming more integrated and more modalities being integrated the models is something that's going to persist in the future. Sure. And likewise, some of the things that are currently up to a user to differentiate the amount of thought, image capability, things like that, we've seen in OpenAI, they've kind of abstracted away from that with the GPT-5 models. It's the same thing in the Gemini app. If you use the Gemini app, then also there's one experience you can access Nano Banana in that app. So it's not sort of a drop-down choice there. It's a drop-down for developers in AI Studio or in Vertex, for example. Okay. And did that change in the several weeks since it was launched? I remember initially seeing the little banana thing, but I wasn't really sure whether that was a discoverability tool or that was like a functional switch. The banana you see in the app is a discoverability tool. But since we launched at the end of last month, it's been integrated into the Gemini app experience. Talk a little bit about the project as a whole. Did you set out to deliver a model with these capabilities? Or did you kind of continue the progression of building image generation models and you arrived at a set of capabilities that are this checkpoint? We've had editing capabilities in our image models before. And kind of what we've tried to do this time around was we wanted to approach the problem kind of bottom up and build a really generalist agent. And I think that was kind of key for its success in adoption. So instead of being trained to do one or two tasks really well, this model can perform a very wide variety of edits. And some of the edits that we've seen can take off and be used by the public are things that we've never really anticipated. So we're kind of surprised ourselves, I think, in seeing the creative uses that people come up with for the model. So I kind of view this as an attempt to build a general purpose image editing and generation model. I think we were proud with what we had and we released this to LM Arena under the code name Nano Banana. This is supposed to be our anonymous codename just for the Elmarina release. And we saw it took off like much, much like way steeper than we expected, the adoption rates. So we ended up, the ELO scores ended up being really high and we had a big improvement over other models. And that was great. But I think the real indication that this model was going to be somewhat successful is that we had so many users coming to Elmarina to use Nano Banana. even if they're only getting it some random percentage of the time it was still it still ended up being a huge driver in traffic and we had to like emergency increase our qps support for elmarina i think we had a couple million votes on elmarina which at the time was equal to all of the text model votes up to that point so this was really like a huge there was much more interest than we anticipated. But I think when we saw that, it was really encouraging that, oh, this model is actually really useful for people. There's a lot of people that have, like that can use this in their day-to-day life and for things that we didn't really expect. So I think it was a pleasant surprise, but it's kind of an outcome of designing the model to be as general purpose as possible. So I'm trying to reconcile a couple of things you said. one is it sounds like as opposed to other models the imagine models for example you very much set out to focus on image editing as opposed to just generation but also you're describing it as general purpose is what you're saying that when you add editing to generation that creates the generality that you were going for? Or am I misinterpreting the focus on editing? No, that's right. I think actually it's kind of a misnomer to say that it's a focus on editing. I think really what, so Gemini is a multimodal model and it can handle multimodal inputs and multimodal outputs. And this is the kind of most general form of interaction. And I think that the differentiation between generation and editing is just what are the modalities of input the model can take, right? If you can only take text as input, then you're doing generation. But if you can take images as input as well, you're doing editing. So they're not really separate modalities, at least in my head. It's just sort of like those are the uses that people have for them. And once your model is sort of general in the types of inputs that it can take, it could do either of those things. Got it. So being able to provide the prior image or a base image as input is what kind of sets you up to be able to do editing. But the model also is like, you know, relative to other models that, you know, can manipulate from a base image, you know, based on a text description, it has an astounding ability, you know, to like enforce consistency between the things that you're not trying to edit. Talk a little bit about the challenges in doing that and the way you approach that problem. Yeah. So that's an intentional choice to support this type of high fidelity edits. I think that's a really important part, especially if you consider that most people are interested in, for example, putting themselves into another situation or people they know. And as humans, we are very, very sensitive to the details in people's faces that kind of define the identity. So if you get it wrong just a little bit, it's like instantly noticeable that, oh, that's not me, that's not my kids. So we did put a lot of effort on making sure that the model had kind of good faithfulness to the input, the context. And I think this is, yeah, you know, there's not, I would say there's not one thing that we did as to why this works. It's really a culmination of many years of experience working in the space and also all the like sort of figuring out over time all the different parts of the recipe and how to tweak it to just make it better and better So the improvement kind of behind the scenes is much sort of more smooth linear ramp And then on the outside, I think what happens is you cross a point where the model becomes more useful. So it looks like these big jumps, but it feels like this is something we've been working on for a long time to get to this point. I'm imagining that a lot of what went into, you know, for example, being able to produce that degree of consistency is just, you know, data collection focused on, you know, those kind of before and afters. Like, is it a lot of data collection or like, are you doing like you featureizing different things? Are you building out the architecture in a particular way? Or is it all data, data, data? Yeah, I think, you know, of course, I won't be able to give you a fully satisfactory answer. unfortunately. But I will say that it's not either or. Both of the things have to be right to work. So you need to have a model that can generate this much detail and preserve detail from the context. And you need to have data to be able to train the model to do that. So it's really the interplay between the two. I think if you just go for one of them, you won't get there. but it's, it's, so it's really everything. And I hear that. And I think that the, the model being able to generate sufficient detail is like a very broad capability. And then on top of that, you want to layer the ability to do a good job of specific things. And that, you know, is like primarily data. I think that that take is accurate. I will say that I think that to get identity preservation working, it's really the same problem as a bunch of other hard problems in the space. So another thing these models struggle with is small text. And the reason why is because the details have to be correct. Otherwise, you can't read the text or otherwise the people's faces don't look right. So really, you're looking for just a general way to improve the quality of the detail across all generations. And then that has these effects of the things that it enables. Are you still doing a lot of human labeling for a model like this? Or have you transitioned to synthetic data primarily? I think that there's a lot of analogies to the text space as well. Like, um, you have kind of different tiers. You have data that is, that's highly curated and, um, and very high quality and you don't have as much of it because it's harder to get that data. Um, and then you also have data that's, that's much noisier and, um, at a high volume. You mentioned that some of the things that people, you know, have done with the model kind of surprised you. Talk a little bit about some of those things that surprised you. and give some specific examples if you can. First of all, there's lots and lots of these cases. I saw some posts on X where people had taken the model and written like a geometry problem for it to solve. So you imagine like a triangle and one of the sides is labeled X and then you can ask the model to solve for X and it will do it in the image space, which is really cool. so this is a direction I think is really exciting with these general multi-purpose models like with world knowledge I think if you think about the way that humans learn or like a textbook there's always lots of images like I think we need to move past this idea that language models communicate by a text because many things are just explained better in images or videos even so I think that seeing signs of life of like useful kind of information seeking or educational use cases from this joint, just emerge from this joint world knowledge image generation model is really cool. And of course, there's a lot of people who are also, you know, making their dog wear funny costumes and, and these are great too. I have a lot of fun with my kids to, you know, make, make their, their stuffed animals come to life. And so I think there's a lot of kind of, There's the whole range of use cases. There's like things that people do just that's funny to share with their friends. And there's also the more kind of information seeking use cases. So I saw one where people are asking for like advice as to what kind of plants I should plant in my garden or like what kind of art would go well on this wall. Or maybe I saw one that people were saying, how do I improve the curb appeal of this house? and then it's really cool to see what the model kind of comes up with. And these models are still getting better as well, but you already get kind of reasonable answers to a lot of these questions. I've also seen a bunch of interesting kind of crossover use cases where folks were using the Nano Banana capability to generate starter images for Vio and use that for video generation. Is that something that you anticipated or put energy into or just kind of let it loose and, you know, it came out? Yeah, so this is definitely a use case that we tracked. We, like, I think that there's a lot of benefit in kind of storyboarding via image and then people are using the consistency you get from image image editing to help, like, guide multi-shot sequences in VO and there's a really nice interplay. So I think that image generation and editing becomes an important part of the multimodal video creation workflow. This is especially cool for me to see because I actually spent a lot of my career as a video researcher working in Premiere and After Effects. So it's great to have an impact on the video creation space as well. And of course, it's a sister team that we work with that we know really well. So we collaborate closely with the VO team too. When you think about the broad set of use cases, how do you think about the most dominant or important ones? Where are you seeing folks beyond creating memes and creating kind of funny things to pass around? I'm like, where are you seeing value creation happen with the new capabilities that the model makes available? Yeah, I mean, I think there's definitely quite a large set of creative professionals who could benefit from using these techniques. You know, I think a lot of what we're trying to do is automate the parts of the process that are hard or tedious and allow people to really be more creative. I think you see that if you look at what people are doing, that creativity is always something that blows me away. And I think it's really not an indication of the model's abilities, but really an indication of the creativity of the users using it. I mean, if you stick me in front of a prompt and say, make something cool, I'm going to be totally lost. I won't be able to come up with anything interesting. But what I see people doing out there is amazing. So I think there's definitely a lot of immediate use cases there. And then, like I mentioned earlier, I think we were kind of at a stage, like if you remember when GPT-1 and 2 came out, people were like, oh, this is a cool model I can use to write haikus about funny topics. And people are doing a lot of these creative use cases because it's more forgiving and you can kind of riff on things and take them. And I think we're there with the image generation as well. So historically and up until now, probably the majority of use cases have been for creative professionals or casual creative users. but I think we're moving to a point where if you look at what people are using language models for now the use case is much, much broader so people are using it for information seeking queries and just sort of for like talking to agents about things working through problems like I think all these use cases have kind of visual components to the communication process and we could end up seeing these models play a role in those areas too So in my opinion, that's kind of the trajectory we're going on and what we'll see from the next generation of image models as we improve the factuality and the reliability of the images generated, basically. I'm curious how you think about the kind of opportunities for individual models in isolation relative to, and I'm thinking about like, if you ever seen like the comfy UI, stable diffusion, like these huge, like very tailored and elaborate really workflows for, you know, doing image generation, you know, to address things like, you know, getting consistency. You know, a lot of the things that, you know, now we're one-shotting with Nano Banana, right? Is it like, do you see these things as coexisting or like the, you know, the, you know, the one-shot frontier, like continuing to push and obliterate all of the hand crafting as we've seen so much in AI? So I love node-based interfaces, to be honest, like Complete UI is amazing. I absolutely don't see those going away. I think as we push the model to be more useful and to kind of one-shot more use cases, I think that it'll be more accessible for people using it. But I think there's always going to be those creative people who are really pushing the boundaries and figuring out like, oh, if I combine this with this and this really complicated workflow, then they can do like amazing new things. And I never underestimate the ability for humans to kind of get tired of the thing that they get, what you get used to, right? So I think like there's always going to be this need of creative people to push the boundary, to do things that are not just easy to do with models and to really squeeze like more juice out of them and make more impressive stuff. So I don't think that's going away. What I do think we're getting to is the point where more people can use the models to do like, oh, I wanted to make like a birthday card and I can do that really easily. But on the high end, I think that the space is even just opened up even more because people, if you give them more tools, can do more creative, amazing things. And are folks using hosted models like Nano Banana in the context of these node-based interfaces, or is it primarily like open models? We have a pretty big developer community on AI Studio and enterprise community on Vertex. And I've seen both of those different groups be hooked up into interfaces. I mean, one great example is Adobe has put Nano Banana in some of their products as like an optional tool that users can choose in combination with all the other tools in Photoshop or in Illustrator. So I think it's definitely a piece of a larger puzzle. And I really like to watch what the kind of developer community puts together because it's always very impressive. I get that folks are using it via the API for a variety of use cases. I guess when I think of those node-based interfaces, I think of a lot more granular control, like changing temperatures and swapping out weights and doing just weird funky stuff that I'm assuming you can't really do with a hosted model like that. Yeah, so this is true. I think that there's always this kind of push and pull between how many levers you want to expose. And when you have open source models, then people will hack in everything and you can have full control. with the hosted models, we, you know, we try to give controls so the users can really do a broad variety of things. And I think we may, you know, continue to do this. And I would like to see more ability for hackers to kind of kind of do things that would be cool on the models. But I think the ecosystem is large enough for both the open source models that are fully hackable, and people can run like local inference on, as well as the hosted models that are run by APIs. So I think we'll see both of those. But yeah, but figuring out how to sort of like merge the really complicated node-based interface with thousands of edges and like a pretty simple API interface for the majority of people, I think is an interesting challenge. And it's one we're still working on, yeah. Is there a notion of fine tuning for this model today? I mean, I think it's an interesting question. So I'll make the analogy again to language models here where I think there was a lot more fine-tuning and then people realized they could do a lot of stuff with putting their fine-tuning data in context and sort of just trying to prompt in context. So bottom line is I think that professional use cases will probably benefit from fine-tuning and we hope that the majority of use cases can benefit from just kind of having examples in context. But we don't offer a fine-tuning service for the model right now. But then when you think generally about image models and this type of model, like is it, you know, would it conceptually support the ability to fine-tune and what might that look like? Just users uploading, you know, reference images or something like that? Oh yeah, definitely. I think that fine-tuning will always be a useful capability on top of models. I think that if you're, for example, if you're trying to generate scenes from a specific show with specific characters, then you could fine-tune these models on your data and have it do a better job generating these characters. I think we do see quite a few use cases like that. um and um also for for example for brands um to be able to fine-tune models on uh on like their their brand looks like this is this i think this makes a lot of sense so um so i don't think i don't think that fine-tuning is going away um but maybe in the long run we need it less and less um but i think that it'll still be useful and uh and yeah and these models i think can be fine-tuned just like any language models or image models out there. Taking a step back from Nano Banana specifically, when you look at kind of the broad image generation or vision language model space, what else do you see out there that's inspiring? What's most exciting for you in the broader space? Well, I think that all the teams that are working in this space are full of really brilliant people. And every time there's a new model that comes out, I'm always amazed to see what it can do. So I think I still have the wonder that we all felt when Stable Diffusion 3 came out with every future release maybe especially because I worked on these models for so long so I kind of know how hard it is to do all that stuff But I think we I don think we anywhere near the ceiling for quality Like I think just, you know, just improving the quality of these models will make them more and more useful. I mean, I think we get to the point now where, you know, if you know what to do, you can kind of cherry pick a really good image out of most of the top models and it'll be like a very high quality, you know, kind of exactly what you want to get. But if you're just exploring the models, as soon as you leave this space, the comfortable space of things that work really well, I think you often can fall into like, oh yeah, this doesn't work or this doesn't look right or it's not what I was trying to create. So I think that we still have a number of years ahead of us for just making these models better. And every time they're made better, they'll become more useful. And that's like totally ignoring the whole world knowledge aspect, which I think is coming, as I mentioned, with all the information-seeking use cases. the other thing that I think is of course really interesting is thinking so you know the models all models now think in text traces and of course when models can generate images they can think in images and I think that we as a community don't really know yet what is the potential of being able to do this but I think we'll find out really soon like what happens when these models can think and create images as well as text in their thinking process Yeah, with kind of this focus on quality being the big challenge, like what do you think is required to get there? What do different teams need to be focused on? Or what would you expect to see different teams focusing on as kind of vectors for driving model quality? Yeah, it's an interesting question. So I think there's, you know, one thing is, of course, scale. So we're seeing what happens at scale. I think that the image community has been behind the language models for scaling up their models and actually figuring out how to do it. I mean, I think a lot of people have tried, but like it's very hard to sort of successfully scale up an image model to be the same size as our language models and sort of still see gains. So I think there's sort of questions is like, how do we scale the models? And if we do scale them, what's the impact of doing that? So that's one thing that I think is kind of unknown at this time, but it's kind of interesting. um and then uh i don't know maybe this is like a bit in the weeds but uh it's but evaluation of models i think is significantly harder than than um than text models i think a lot of times we're trying to like there's no clear right and wrong answers so a lot of what the model does like like how do you know it's getting better um a lot of it is really kind of personal preference and there's a wide variety of personal preference that different people have. So something that's getting better for one person may be getting worse for another person. And so it makes it really interesting, I would say, trying to figure out how to hill climb different aspects of your model and still be able to measure, like, is it getting better or not? So this is like something that I think people and all the other people who work on images and media generation that really struggle with, but it's something where I'm excited to see what people come up with as sort of proxies or ways that people can predict quality. Can you talk a little bit about how the industry has taken on this evaluation challenge historically? Like, are there any papers that come to mind that have become standard practice or seminal in kind of laying out an approach? or how do you get from personal preference to some kind of function that you can optimize over? Yeah, I mean, I think from the beginning, there's always been aesthetic models that people have used. And these aesthetic models are typically trained with pairwise human preference scores. But I think that, and this dates all the way back to the beginning of image generation models. We've collected preference data and trained our models in them. But I think that there's, it's kind of interesting is because the, it's not clear that the mean preference, kind of the average overall preference, human preference is something that you want to get to. Because it's, it's a little bit like, oh, you're in a space where everything kind of works. Yeah. Yeah. It's like you mix all the colors together and you get brown, you know, it's kind of like, you're kind of there. So it's, it's not like a, it's very hard to, yeah, it's very hard to say that that's, that's like the best model. So if you just optimize for very large scale preference, you end up with maybe with something that's generic and doesn't have like character or it doesn't generate interesting images. So I think like, I mean, if you look at like Midjourney, for example, they have spent a lot of effort on, on having kind of stylistic control and being able to, to generate images that, that are sort of personalized to users stylistic preferences. And this is something that's, that's a, I mean, it's very useful for their users in generating these really nice images. But yeah, I think there's a lot of different ways that people can do it. I don't think we actually know what the best way is. Talk a little bit about what you see Nano Banana as being like really good at and what you think it's less good at. That came to mind because you mentioned mid-journey really focusing on style transfer. I don't think I've tried style transfer with NanoBanet. I don't know how well it does, but presumably it does really good at some things and less good at other things. How do you think about the things that it's best at versus not best at? Yeah, I think the things it's best at, I think, are preserving the content and people's identity in images that you edit. I think this is one of the reasons why it took off when we released the first version. I think, yeah, I mean, I think it's definitely not an artistic model, like a mid-journey model. So if you put a random prompt in, you know, you don't expect to get back something that looks amazing. I think we try to make it really expressible and put a lot of effort into making sure that it's following the user instructions as closely as possible. And sometimes for creative tasks, those two things are a little bit like contradictory. because you might want to actually take the user away from the thing they're asking because it's more interesting or cool to look at. So I would say that because we've pushed for a really controllable and high amount of instruction following, that it's maybe less suitable for this creative exploration kind of tasks. Does that degree of controllability and instruction following imply that the prompting bar is a lot higher? Like you have to really, if you want to get amazing output, you have to, you know, give it amazing input, essentially. I think we try to make it so you can give very simple input and get a reasonable output. I do like we also have a team who are professional prompters. And every time we give them the model, like I am stunned by the things that they get back. So like you can, if you know what you're doing, you can really direct this model into generating, you know, very impressive things. But I think like for most users, they're not going to get to that point where that's the control. And I think that the goal there is to really have good instruction following with simple prompts and simple instructions as really a primary goal. For a while, it seemed like the direction the space was going was like prompt rewriting. So, you know, the user gives a kind of average prompt, some secondary thing, like makes that prompt more elaborate and tries to produce a better result. Just based on what you're saying here, it doesn't seem like it's necessarily doing that. Like, can you talk, you know, about, you know, that technique in the broader industry? Yeah, I guess I'll say that I think prompt rewriting, one way to think about prompt rewriting is it's a way for a language model, which has world knowledge, to communicate that world knowledge to an image model that only takes text as important. Got it. So if you've got an integrated model, it's like less of a thing that you want. Yeah, I think that the thing that you want to think about is this, is like what is the, how does the information get exchanged between world knowledge that you learn from text data and image generation that you create, learn from images. And, you know, with integrated models, you have a lot more choices there, how you can communicate information between the language models and between image generation models. And I think, so, you know, I think we're probably, everyone who's working in this space probably has a different solution exactly to how that works. But like, that's kind of how I think about the role of Promptry Writers. and prompting writers were kind of the first step of doing that when you have like a fixed language model and you can kind of chain them together, right? You can call language model, you can call your image model and you can pass information through the rewritten prompt. Are there any papers that come to mind? And I feel like I asked this and I don't know that we got the papers. Like, are there any papers that come to mind that like you feel have been super important for the space over the past, whatever? It feels like a year is too long in this space. I'm not going to call out any specific papers. I think I've seen a lot of very good papers. I think that some of the labs actually are still, some of the big labs are publishing papers still, although fewer and fewer. And I think those are always interesting to look at. You know, this is kind of a space where the forefront is being pushed by the big labs that are usually keeping secrets pretty close to themselves. So I think like, uh, I, I like spaces when the academic community is leading and every paper is like, uh, you know, amazing and going to change the field. Um, and I think, uh, we'll, we'll get back to that point again at some point. Um, but right now, I mean, I think that the, the paper is kind of like all together. Uh, if you take all the papers together to kind of like enable the progress that we've seen, but I don't, I don't think there's one specific paper that's really, um, opened up an entire new area, for example. Given the degree of secrecy in the space, if, you know, for folks that are interested in the space from a technical perspective, but not at a big lab, like what's the best way, you know, from your perspective to try to gain an understanding of what's really happening behind the scenes and what's like pushing the frontier forward? Actually, I think that for sort of enthusiasts in the area, it's completely possible to have a home set up and do interesting work and interesting research. I think that the sort of required boilerplate to get systems like these up in training is pretty well supported with today's tools. The things that you can't do and you're not working at a smaller scale is training models that are like sort of across all axes. But I think that it's definitely possible to say, take some narrow data slice and like do experiments and kind of like, you know, make improvements in this kind of small, constrained way. And those are things that will probably set you up for like knowing how the larger systems work. So I mean, I think that like it's definitely, you know, it's not inaccessible, I would say, just because like we do have these big, well-resourced groups that are pushing the frontier of the quality. I don't think that means that it's an inaccessible technology for people to pick up. And actually, I think it's quite accessible. There's actually not, like I said, to get a baseline running is totally doable by one excited person in their room over a couple months. And is that assuming that you're starting from some open pre-trained baseline or? Not necessarily. I mean, I think a lot of interesting work started in small scale ablation experiments, you know, on small toy data sets. Like for a long time, we were using CIFAR and before that MNIST and ImageNet. And so like you still see, I mean, I still think that's the right place to start is to start on like smaller models, simpler data sets and work from there and make it work there. And along the way, you'll discover things that make it better. And those things you discover are probably really useful in the broader scheme of things. What do you think of as some of the most exciting, like open problems, like, you know, beyond, you know, something general like pushing model quality forward? Like, are there, you know, specific, you know, dimensions or capabilities that you feel like, you know, we've not been able to, you know, crack for these models yet? Yeah, I mean, I mentioned earlier the evaluation problem, and I think that kind of well thought out evaluations and kind of testing domains are still really interesting. I saw this great paper on vision language models about how they're fooled by, it's called the illusion illusion, and it's about how if you give the model something that's not actually an illusion, but is close to an illusion, it'll be fooled into thinking it's an illusion. What's an example of an illusion in this case? so an example is like an optical illusion for example like there's the classic one where you have the checkerboard and the shadow of a cylinder and and these two squares are the same color right so we've all seen that image a lot of times even though our brains interpret them as different colors because one is in the shadow um well it turns out if you uh if you give it to a vision model and you make them very different colors it'll still say oh these are the same color because it's seen the illusion before. So I think it's just a really clever kind of the vision analogy of like the crossing the river thing where the model just can't seem to get out of that rut. Right, right. Like I think in the generation space, people have been playing with like trying to get full wine glasses for a while. And it's like the same concept, right? We've seen so many wine glasses that are not quite full that like it's very difficult to get a full wine glass. So I think like there's all kinds of interesting ways to probe the model and understand failures. And this is something that's very easy to do and really can be done distributed. And I do see a lot of interesting work coming out on kind of like evaluating where these models fail, where they don't. So that's something that I think that's always interesting. I also think that we're kind of on the early days of understanding post-training and reinforcement learning for multimodal models and generation models. So I think that's like an interesting area that there's a lot of cool papers and work coming out about. So I, yeah, there's so much, you know, like once you start working on these problems, you realize that like, it's not, none of this is solved. Like everything we're doing is really in the early stages. I mean, we've only been doing this for a few years. So like there's a lot more to learn. Along the lines of, you know, taking things that are happening in the tech space and bringing them the images you mentioned, RFT, another is this idea of like verifiable data. Like, is there an analog for that in the image space? Does verifiable data mean verifying that it was generated by a model or what is actually No more like the idea that you know we going to find a data source that you know we can self because like you know we get it to generate code and we can evaluate the code We get it to generate a math problem, we can evaluate the math problem. Like, I think it's related to this idea, this broader idea of evaluation that we've been talking about. but I'm wondering you know for example like if we're talking about the domain of like give the model a bunch of like Pythagorean theorem problems in an image like that could be a verification like there could be a verifiable loop there and I'm wondering if there are other things like that if that's a thing that people are thinking about. Yeah I like to say which is something I think text people probably get annoyed by that like text is a subset of images because you can imagine any image and just render the text in that image. So I think when we talk about kind of verifiable generation accuracy, I think anything that is in the text side and of course all the mathematics and geometry problems, I think this is like a space where we can verify images and that's probably a pretty interesting thing to look at. But even more generally, I think you can kind of pick sort of sub axes of generation that are verifiable. by vision models. So, you know, a common one that people look at is like counting, like did the models get counting right? Because that's something you can evaluate pretty well with models out there. So I think for perspective, for example, I'm looking at the dresser behind you and thinking that that could be an interesting way to train a model to like be able to manipulate and enforce perspective or being able to look at a scene from different angles. Yeah, that's right. So you could imagine if you generate multiple camera views of a scene, you could do some 3D reconstruction and see if it actually makes sense. And yeah, and then these maybe can form reward signals. I think this is an interesting space. So the problem I see there is coming up with a way to get a wide enough range of these verifiable domains that you can then generalize to being good at everything else. And I think here we're talking mostly about accuracy, which is definitely a big thing we want to improve. But we also have to consider aesthetics, which is a much harder thing to measure. So like, is the lighting nice on this image? It's like, it's very much more difficult to tell whether we've succeeded in that space. So I think probably we'll see that if you target a few verifiable domains and figure out how to maximize the quality on these spaces, it will be beneficial in other areas, but I haven't seen much of this yet. And do you find like a direct tension between, you know, accuracy, whatever that means in a particular context, image context and aesthetics? Like, are they, you know, I can imagine a world where like you, you know, have sufficient data and evaluation that you train a model that's good at aesthetics and it would, you know, and then you start pushing on accuracy and it will just do things more accurate, do aesthetic things more accurately. But I can also imagine a world where like there's some fundamental tension And, you know, I think we saw, we've seen lots of examples of this, like, you know, when in the like GPT-4 timeframe, when we were trying to like get the model to, you know, do better at following instructions, like the output was less creative. What's your take on that? So I think as a community, we are just starting to get to this point because I think we've been working really hard on just making the images not be full of artifacts. and I feel like these kinds of, you know, six fingers and like, like, bro, I just got to five fingers. Right. So we just got to five fingers. And like, until we got to five fingers, kind of all the improvements are benefiting both of these axes because like you still need an image that's like structurally sound in order to be either correct or aesthetic. I mean, like, or like believable as an image. I mean, maybe. So I don't think we've yet really struggled with this, but I do think we're starting to get to the point where like different, you know, different teams can take different roads where they optimize for one or the other. And I think it's a, you know, in some sense, it's a bit of a data distribution problem. Like if you are able to narrow your data distribution into something that's, that's just the, the good looking images it can actually be easier to generate those because it's a narrower distribution. So it's easier to model. so I think like a lot of what we did at the beginning of these image models was figure out how to like restrict the generation to this very very small subset of things that looked reasonable and in the in the process that was also the way to generate the best images now as we get better at modeling and we can model this like wider distribution of things that are maybe look good and things also that don't look good but maybe have interesting information into them it opens up more avenues for us to be able to do kind of both at the same time in the same model. So I think we're basically broadening the distribution of things we can generate well enough that it's not just instantly thrown out. You know, like 99% of the things you generate before, you just look at that and be like, oh, that's unusable image because it's full of problems. But that number is going down and the amount of things we can generate is becoming more broad. So I think we will have to choose in the future more between like, what's the purpose of this model? Is this model supposed to generate things that look good? is supposed to generate things that are that are accurate and believable or like you know what's the purpose it seems really counterintuitive to me that generating things that look good is like that you would characterize that as a narrow domain um you know maybe when i think about it from the perspective of like you know i don't know if we're still saying this but like there's a period of time when like you know that's a mid-journey image like all the mid-journey images kind of look the same. They have this, or AI generated images even more broadly. They all kind of look at this look. I feel like we've pushed that a little bit, but like aesthetic pleasingness doesn't feel like a narrow domain. It feels like kind of a broad quality that, you know, is applied different to different things. Like break that down for me, like how you think about that. So what you said is definitely true. that like the first models, I mean, I think it's a bit of a precision recall thing because if you have a model that can generate only aesthetic things, it doesn't mean that you're generating all the aesthetic things, right? You haven't covered everything that is aesthetic, but you've restricted yourself to only generating things that look good. So that's the kind of like narrow view that was a starting place and was easier to generate. Now that we can have models that generate broader distributions, we probably actually can get a better recall of things that different people would find aesthetically pleasing. Like maybe someone really likes crayon scribbles, like, you know, like this is something that we can do now that we have broader distribution. So I think it's, yeah, so I'm not saying that like, we covered everything that everybody has ever liked. But I think that like, at the beginning, we focus on things that people did like. And what's interesting is that uh you know this ai look is is something that was is kind of came out of that first round of trying to trying to narrow the distribution to things that we can generate well and still look good um and then people got um kind of saturated by it right so like now it's like a problem if your model generates only these things in this ai look so the tastes are actually changing a bit as well because like when that first came out i think everyone was just so amazed that you could generate those things that um that it was still really cool and fun to like play into space of like, what is this, this like an AI looking space. But now we have to, you know, the people expect more. So we need to be able to generate more broad images. And how do you think about the challenge of as generated images become, you know, a greater portion of the images on the internet that like, you're like smoking your own exhaust, so to speak. You know, it's clearly a problem on the tech side as well, but like, is it, you know, you know, are there unique aspects to it that you think about on the image side? Yeah. I mean, I think it's definitely a problem. I think, you know, everyone who's, you don't want to add a bunch of AI, bad AI generated images and into your training data because you'll learn to generate bad AI generated images. It's like, I think the same thing exists in the tech space as well. But yes, this concept of I think the concept of AI slop is interesting too, because like there's also good AI images and it's a little bit less clear as like, do you actually want to train on good images or not? I mean, there's been some distillation process by which people have said, oh, that's a really good image. And like, we do know that if you can distill synthetic data to the point where it's like high quality, it can actually be beneficial to train on. So it's not like you never, it's not like all AI generated images are necessarily detrimental, but like the ones that are bad. So I think there's a couple of interesting problems. There's like, how do we handle the fact that there's more AI-generated images out there and AI-generated text that we don't necessarily want to train on because we might learn factual inaccuracies or we might learn unrealistic lighting with the fact that like some of this data is like highly curated and is actually people think it's really good and the quality might be much higher. And how do we tell the difference between those two? These are all interesting problems. And also this idea that we were just referring to, like if you have only produced good images in kind of a narrow band or domain, you know, if you're then training on just that band or domain, that doesn't necessarily help you get a broader ability to generate more diverse aesthetically pleasing images. Yeah. Yeah, that's right. yeah um yeah i think that that's something that has been a topic of discussion in ai for a while is like how much can you eat your own tail you know like um is it possible to train um like models on fully synthetic data and um use self-supervision and there's a lot of a lot of people have been looking at different parts of this and i think like how much human supervision versus synthetic supervision is needed. Yeah, like I don't think we know the answers to these questions. And everything that exists in the tech space has an analogy also in images for this. You just said like everything that has an analogy, everything in the tech space has an analogy. Like one of the things that I'm thinking about in the tech space is like a revisiting of thinking about, you know, model scale versus data curation. Do you spend a lot of time thinking about that? Yes, yes. I mean, there's always discussions. I think, like I said earlier, we're just scratching the surface and it's possible that someone could come up with a much more efficient recipe that does way better, is more compute efficient. I think there's still a lot of gains that are going to be made by the community in this area. I wouldn't predict which direction we're going. Is it all going to be just finding the exact right data points to train a small model on it or is it going to be like you know just scaling everything up 10 times bigger and hoping it all works out like yeah it's um i don't i'm not gonna predict which one is is the winner if you think about the tech side of things like from a broad perspective you know it's all still kind of early and um you know we're early in kind of this ai curve generally in terms of like it you know it's a broader impact but you know the in the text domain like you get the sense that like you know a lot of the core techniques are established and like we're you know pushing really hard at you know a you know i don't even know if optimal or approaching optimal is the right way to characterize it. But like we're operating at a scale that it's, you know, it's hard to get like big, to get big leaps beyond that. And in fact, like a lot of the innovation we're seeing are folks that are like doing things to like reduce the scale and try to do more with less. Like, and I'm wondering, when I'm trying to get as like on the image side of things, is it kind of the same? Like, or are we like much further behind and like there's just so much headroom and, you know, all we need to do is convince Sundar to write the check and then we could like 10x this thing. Yeah, I mean, I think this is interesting. I don't know how far we'd get by scaling sort of naively. I mean, I think if I look at the text side, I feel like we saw a huge boost in accuracy come when all the main labs started doing thinking, right? Like test time scaling. And that ended up being really important. So these ideas, I think, still can also land in images. Like I think test time scaling images, something we haven't really figured out quite yet. And probably all the labs are trying different ways to solve this problem. and so I would say that it's not that the recipes are not very well known like I hope we get to the point where everyone starts publishing again and we can see what we've all been doing because maybe we'll be able to put our heads together and figure out like oh these are the things that actually mattered you know barring that I think that the discoveries will happen independently but people will figure these types of things out and it's a lot easier to do something when you've seen it be done So when other labs make progress and you're like, oh, this is possible, it's like, okay, now I kind of know what to try because I can see that something can work. So I do think like there's this like lifting up of the models, you know, just the fact that we have all these different labs working on them. And I do expect actually large, more large improvements coming. I think, you know, we also saw with Genie 3 that that kind of interactivity is is really on the horizon and these like world models that you can you can have like, you know, you can walk around, you can interact with components of them like this stuff's all really early and just launching and very exciting. So I think like the space is really open with possibility. Well, Oliver, thanks so much for jumping on and sharing a bit about kind of the way you think about the space and Nano Banana image generation. You know, congratulations. Very cool product. And it's been amazing to see like the uptake and interest out there. And I certainly had a lot of fun and gotten a lot of value out of using it. Excited to see what comes next. Thanks so much for having me. And yeah, definitely go out there and play with Nano Banana and tag me on X. If you find failures or things that you really liked, you know, I'll check these. So like, it's, we always appreciate feedback. And yeah, thanks so much for having me on. It's been great to chat. Awesome. Thanks so much. Thank you.

Share on XShare on LinkedIn

Related Episodes

Comments
?

No comments yet

Be the first to comment

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies