High-Efficiency Diffusion Models for On-Device Image Generation and Editing with Hung Bui - #753

TWIML AI Podcast

Tuesday, October 28, 202552m

Spotify Apple

TWIML AI Podcast

0:0052:23

What You'll Learn

✓Bui's early work on personal assistant systems and the connection to Siri
✓The risk and opportunity of setting up an AI research lab in Vietnam, and how Bui built a talented team through a mix of experienced researchers and a residency program
✓The shift in focus to efficient models due to limited computational resources, leading to work on smaller models like a 7 billion parameter Vietnamese language model
✓The goal of developing high-efficiency diffusion models that can run on mobile devices
✓Qualcomm's acquisition of VinAI Research, indicating the shared interest in efficient on-device AI

AI Summary

The podcast episode discusses Hung Bui's background in AI research, including his work at places like SRI International, Adobe Research, and DeepMind. It then focuses on Bui's experience setting up an AI research lab in Vietnam and the challenges he faced in building a talented team. The key focus of the discussion is Bui's work on efficient diffusion models for on-device image generation and editing, which led to Qualcomm's acquisition of his previous company, VinAI Research.

Key Points

1Bui's early work on personal assistant systems and the connection to Siri
2The risk and opportunity of setting up an AI research lab in Vietnam, and how Bui built a talented team through a mix of experienced researchers and a residency program
3The shift in focus to efficient models due to limited computational resources, leading to work on smaller models like a 7 billion parameter Vietnamese language model
4The goal of developing high-efficiency diffusion models that can run on mobile devices
5Qualcomm's acquisition of VinAI Research, indicating the shared interest in efficient on-device AI

Topics Discussed

#Diffusion models#Image generation and editing#On-device AI#Model efficiency#AI research in developing countries

Frequently Asked Questions

What is "High-Efficiency Diffusion Models for On-Device Image Generation and Editing with Hung Bui - #753" about?

What topics are discussed in this episode?

This episode covers the following topics: Diffusion models, Image generation and editing, On-device AI, Model efficiency, AI research in developing countries.

What is key insight #1 from this episode?

Bui's early work on personal assistant systems and the connection to Siri

What is key insight #2 from this episode?

The risk and opportunity of setting up an AI research lab in Vietnam, and how Bui built a talented team through a mix of experienced researchers and a residency program

What is key insight #3 from this episode?

The shift in focus to efficient models due to limited computational resources, leading to work on smaller models like a 7 billion parameter Vietnamese language model

What is key insight #4 from this episode?

The goal of developing high-efficiency diffusion models that can run on mobile devices

Who should listen to this episode?

This episode is recommended for anyone interested in Diffusion models, Image generation and editing, On-device AI, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode, Hung Bui, Technology Vice President at Qualcomm, joins us to explore the latest high-efficiency techniques for running generative AI, particularly diffusion models, on-device. We dive deep into the technical challenges of deploying these models, which are powerful but computationally expensive due to their iterative sampling process. Hung details his team's work on SwiftBrush and SwiftEdit, which enable high-quality text-to-image generation and editing in a single inference step. He explains their novel distillation framework, where a multi-step teacher model guides the training of an efficient, single-step student model. We explore the architecture and training, including the use of a secondary 'coach' network that aligns the student's denoising function with the teacher's, allowing the model to bypass the iterative process entirely. Finally, we discuss how these efficiency breakthroughs pave the way for personalized on-device agents and the challenges of running reasoning models with techniques like inference-time scaling under a fixed compute budget. The complete show notes for this episode can be found at https://twimlai.com/go/753.

Full Transcript

Thanks so much to our friends at Qualcomm for their continued support and sponsorship of today's episode. Qualcomm AI Research is dedicated to advancing AI to make its core capabilities, perception, reasoning, and action ubiquitous across devices. Their work makes it possible for billions of users around the world to have AI-enhanced experiences on devices powered by Qualcomm technologies. To learn more about what Qualcomm is up to on the research front, visit twimlai.com slash Qualcomm. So we're realizing this open-width models, 7 billion parameter models. We're still getting complaints from the community in Vietnam that, oh, this model is too big, right? We can't fit it on our GPU, right? And we said, okay, fine, you know, let us go, you know, one more step. Try to, like, reduce the number of parameters. Try to half the size of the model to less than 4 billion parameters. And with a couple of improvements over the way we train it, we noticed that this model, less than 4 billion parameters, actually performed even better than 70 billion parameter model. All right, everyone, welcome to another episode of the TwiML AI podcast. I am your host, Sam Charrington. Today, I'm joined by Hung Bui. Hung recently joined Qualcomm as VP of Technology through the recent acquisition of VinAI Research, which ranked in the world's top 25 industrial AI labs based on research output in top conferences like ICML and NeurIPS. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Hung, welcome to the podcast. Thank you so much, Sam, and it's my great pleasure to be here. So we've got a bunch of really interesting topics to dig into, including your research into topics like diffusion models, image generation and editing, and more. And of course, how to make all of that efficient on mobile devices. To get us started, I'd love to have you share a little bit about your background, which includes time at places like Google DeepMind and Adobe Research, among others. And tell us a little bit about how you got into AI. Actually, let's see. I did my PhD almost 30 years ago. And the PhD was actually on this topic called multi-agent system. And back then, it was also a very interesting AI topic. The reason I got into AI is just by curiosity. During my undergrad classes in Australia, I learned about things like Turing test. And I was very curious to myself that, you know, how we were able to program a machine to pass the theory test. And back then, I have to be honest that I don't think I'm going to live to see the machine passing the theory test, which, you know, it's kind of like taking for granted now. Tell us a little bit more about some of the work that you've done in your career. So I think my first job, first real job after academia was at a place called the AI Center at SRI International. SRI is Stanford Research Institute. That's right. Yeah. It's formerly known as Stanford Research Institute. and back then, you know, we're talking about 20 years ago, I got a chance to work on a project called KLO and back then we tried to build, you know, guess what, a personal assistant. It was running on a desktop and we were trying a lot of things, you know, a lot of machine learning techniques, a lot of probabilistic reasoning techniques to understand the intent of the user and so on. You know, we even have a system that can actually record all the user actions on a desktop screen. And, you know, it was, I think, a very interesting project. I think back then it was sort of like the largest AI project of the day. And yeah, I think one of the artifacts of that project was a little system known as Siri. Right. Right. And after that, I started to move to various industrial research labs in the Bay Area. I spent some time at Nuance Lab on Natural Language Understanding. Also some time in Adobe Research, where we start to look into applications of machine learning in various areas of the business. And this is, I think, the time that Generative AI starts to become popular with models like Variation Auto-Encoder. and so on. And it's also correlated with my next move to Deep Mai. And I was part of the Deep Mai team, also still based in Mountain View, the Bay Area. And I think that was a really interesting time looking back. But yeah, I think in 2019 I actually got an opportunity to move back to Vietnam which is the place I was born and had a chance to set up you know, the first AI research lab in the country. Tell us more about that. Like you, you're coming from some of the top industrial research labs in the Bay Area, you know, the place where it's all happening, like what prompts you at that point to leave all that and go back home? I mean, I could tell you that, you know, it was a very exciting opportunity. I think a lot of us when we're looking for big companies in the US or so I think we all have a little dream to return to the home country and make an impact I think that's kind of like a big opportunity presenting itself for me but I think in hindsight I should admit that I also took a big risk as well because it wasn't known whether it's actually possible to actually run a proper, one-class AI research lab in a place like Vietnam, right? Because it is known that AI research is something that requires huge investment, especially around talents. So yeah, big risk, but I'm glad that I did it and it paid off. And how did you craft a research focus or direction for the lab? I must say that, you know, I must credit this to the really interesting research that I was doing back in the days where I was at DeepMind, right? So I think it is quite natural that, you know, we'll continue to follow the same direction in what is called back then deep generative models, which is kind of like the precursor of generative AI today. Okay. So, you know, we, you know, from day one at VIN AI, we start off working on things like, you know, variation autoencoder, generative adversarial network, and of course, you know, like autoregressive models, book model for Vietnamese, for example, book model for tweets. And those are kind of like, you know, just the natural research topics that people will latch on back in those days. It just turned out to be that all of those topics, you know, are super important. And it really prepares really well for the, you know, the evolution of generative AI that follows until today. You mentioned that talent was a big challenge and starting a lab there in Vietnam. How did you address that? I remember myself that during one of the first few weeks when I was here in Vietnam, I stood in the Hanoi office. I looked out of the window and, you know, I kind of like, you know, just checking the view of the streets of Hanoi. And that's when it hit me that, OK, I'm no longer in the Bay Area. And then, right. OK, so how do you, you know, how you actually build a team here? And, but I think, you know, one thing that I know for sure is that, okay, I have to be able to convince people to move to Vietnam and work in Vietnam, right, for this effort to be successful. And then I have to be able to have a good balance between the experienced guys, right, who kind of like, you know, been there, done that, but are still willing to move back to Vietnam. With the, I would say, the young talents, the people in the country, right, the talents are still in the country. Whether, you know, just really, really smart young guys might not be that experienced in AI or, you know, things like deep generative models. But, you know, smart enough so that you can actually train them quickly. So, you know, it kind of, that was the strategy that, you know, we went about in building a team. So we will be able to hire a couple of, you know, I think really strong research scientists with many years of experience working in the U.S. and also other countries. They're willing to, you know, move back to Vietnam and work with me to build up the team. And then for attracting the young talents, then we actually started the first AI residency program, not only in Vietnam, in Southeast Asia, actually. And that has been a great way for us to attract and recruit the best young talents in this part of the world. And they come and work with us, become a full-time employee with the company for two full years. And yeah, we've had, I think, a really fantastic opportunity to work together with those young talents. They're just so smart and they're learning things so quickly. So the lab eventually became known for its work on efficiency. So mobile device, getting models to work on mobile devices. And that ultimately led to the acquisition by Qualcomm. I think that's a clear shared interest. from your initial research focus on deep generative models, how did this focus on efficiency come about? Going back to 2019, this is also the time when people start to look into how to scale up these models to work with large and larger quantity of data and how to get models of larger and larger size, the number of parameters keep on increasing and demonstrate that the bigger the model, the more capabilities. And, you know, in Vietnam, I think we try to follow the trend, right? But very quickly, we are limited by our access to computational resources. We have, you know, the investment to have access to a small GPU cluster locally in Vietnam, right? But very quickly, we cannot compete with the giants in big tech. And because of this, so knowing that technology is like deep generative models is important or generative AI is important, but yet we know that we are being constrained by the access to computation resources to train these models. And that led us down to the only path that you could possibly go, which is how you'll be able to make these models more efficient, which means that we have to figure out a clever way to make the smaller models work, get more juicer out of the smaller models. And I think that's the reason why efficiency is a very natural focus. Talk a little bit about some of the research that that path led you to. Was it primarily focused on topics like quantization and related ideas? Or how did you approach it? I think the first thing that we're looking at is what we can actually get by smaller models. right and let me give you an example ChatGPT was released this is roughly November 2022 and this you know obviously caught all of us by surprise but we were looking we were already working on models like BERT BERT for Vietnamese And then it very natural for us to figure out okay so can we actually pre something like charge PT, but using only Vietnamese data? And we know that this is at a time where charge PT, I think charge PT3 is close to 200 billion parameter. And we know that there's no way be able to rig that scale in terms of size with the training resources that we have. So we kind of run and perform an experiment to see what we can actually do with a very small model, something of only 7 billion parameter back then. And this is actually going against the trend when other companies were trying to get larger and larger model, we kind of like, okay, we ask ourselves, okay, what is it you can do with like a 7 billion parameter model pre-chain completely from scratch, right? Using Vietnamese only data, right? So this is kind of like, you know, first it started as an experiment. Was this based on the GPT-2 architecture or what was the model architecture for it? Did you come up with that independently as well? The model itself is well understood. It's just like we're not entirely sure, you know, 7 billion parameter and the Vietnamese-only data, like whether we can actually produce anything interesting, right? And so we, you know, obviously we grab all the Vietnamese data that we could from the internet crawl, right? And then we, you know, fit into the model architecture. And I think it took us, you know, a couple of months, right, using our compute. And the end result was something that actually surprised us because we thought, oh, you know, we actually have a model that can speak Vietnamese really well. You know, you can answer questions in Vietnamese. You can, you know, write letters in Vietnamese. You can, you know, like write poems and all of that, right? So things that you see with the earlier version of ChatGPT, we actually saw it in Vietnamese with only a 7 billion parameter model. Then the next thing that we did is that we asked the question, what if we reduced the number of parameters for this model even more? So it's kind of like a counterintuitive back then, right? So we're asking ourselves the questions like, can we get more for less? Here is just less number of parameters. Or can we get the same performance with less number of parameters? And all of this is because, you know, it's the focus on efficiency, less number of parameters. I remember that back then, even with 7 billion parameter model, right? The other people in Vietnam, right? The other team in Vietnam, they're still complaining, right? So we're realizing this open-width model, 7 billion parameter model, we're still getting complaints from the community in Vietnam that, oh, this model is too big, right? We can't fit it on our GPU, right? And we said, okay, fine, you know, let us go, you know, one more step, try to like reduce the number of parameters, try to half the size of the model to less than 4 billion parameter. And with a couple of improvement over the way we train it, we noticed that this model, less than 4 billion parameter, actually perform, you know, even better than 7 billion parameter model. So we already know that there are a lot of room to optimize the model, the way we train it, and getting more performance out of a model that is even smaller. Can you talk a little bit about some of those optimizations and how you tweak that training recipe in order to get decent performance out of the smaller models? So one thing is, how do you get even more data? right now we're not going to get more data obviously right but we can iterate through the same data multiple times and back then we were already thinking okay you know can you actually use synthetic data to even try more pre-training right you found everything you could find okay so later on of course you know like you know other people found the same thing for for the internet but i think back then because we limit ourselves to Vietnamese then we we already hit that boundary. Okay, so we figured out, okay, if you iterate over the same data set, more, perplexity keeps going down and the model keeps getting better, right? And also, there are a couple of minor adjustments on the model optimization side that we did. But yeah, I think at the end, we have a sub-4 billion parameter model and it starts to fit into just the basic GPUs that people in the labs, people in the universities, even now, they could actually use, right? And so that was a very welcoming addition that we support the local community here by just giving these models, giving them out as open weights. I'm just thinking about how so much of the innovation in models is driven by data sets. And here you are trying to create models that perform well on Vietnamese language. You collected kind of the, you know, just the raw corpus of Internet information in the Vietnamese language. But, you know, when I think about, you know, traditional kind of NLP and, you know, model research, there are so many benchmarks that folks are trying to optimize models against that teach the models, you know, new and different things. and you didn't have all that for Vietnamese. Did you create any of that? Or did you find that just, you know, kind of the raw training on the data that you scraped was enough? We don't have data for Vietnamese for evaluation and so on, right? So we had to create some of that ourselves, of course, right? But I think lucky for us, a lot of the corpus for English can be auto-translated into Vietnamese, right? And I think looking back, machine translation between English and Vietnamese, and in particular, we actually also were also working on machine translation between English and Vietnamese, right? That was working well enough. So that lets you bootstrap, take an English QA data set and translate that into Vietnamese. Yeah, exactly, exactly. And so how did you get into the image generation side of things? So text and languages isn't the only focus in the lab. Right at the beginning, we got people working on, for example, GAN model for image generation. Then, of course, the thing that comes around is diffusion-based approach. The quality of image generation that methods like diffusion model can actually produce, at the time, was getting more and more impressive. And so, of course, because we were working on image generation using GAN, a lot of people in the lab started experimenting with the noise and diffusion. And it's a very natural thing for us. Regarding efficiency, it's a little bit different. I mean, the goal is the same. How are you able to get text generation? And also here is how you'll be able to generate an image, but in a more efficient way. So interesting enough, for image generation, the size model isn't that big, right? Compared to sort of like, you know, large language model. But for image generation, especially a denoising diffusion approach, the main thing was that you need to run these denoising steps for many, many times, right? So it's, you know, it's a sizable network, but you have to repeat it for many times. sometimes between 50 or 100 time steps. What that means is that if we have a big compute, you still have to wait. Back then, I think if you run this with ChatGPT, you still have to wait. It's not real time. This is noticeable. And if you have a small compute cluster, you have to wait even more. So for us to be able to do experiments, then we have to look at ways to just be able to generate image in a much shorter time step and significantly reduced latency. And that drives us down the path of asking ourselves the question, can you actually get a model that can generate image with as few number of steps as possible, or even with just one step, right? So imagine that if you can just do this in just one step, your latency is all of a sudden boosted up by almost two other magnitudes. Is it still diffusion at that point, or are you needing to create an entirely new approach? We are working on the assumption that you already have access to the noising diffusion model, right? So you already have access to this model that can actually produce beautiful images, but it'll need 50 or 100 steps. Yeah, exactly. So you assume that you have that. And you would treat that as a teacher, right? And you use kind of like a distillation approach, right? Maybe I should explain a little bit about intuition here. So distillation is usually thought of as, you know, you have a teacher that is a big model, and this big model encapsulates a lot of the knowledge about the task at hand, and you want to distill that knowledge into a smaller student, right? But I think in here, the problem is kind of like a step distillation because you don't have a big model. It's a model of the same size. You need to run it for multiple steps, right? And here you kind of have to distill that multi-step knowledge into something that has just one step. Meaning because your diffusion model is already relatively compact, it's not a huge model, you're not in this distillation process necessarily trying to shrink it you're just trying to eliminate the steps exactly yeah yeah in fact the architecture of the student and the teacher you know it could be almost the same right it's just that the teacher you have to run multi-step but the student just one step and and so another way you can you can kind of like think about it is that, so first of all, the teacher is given. So this is already a really strong denoising diffusion model that has already been trained with lots of data, right? And it can actually generate beautiful images with, you know, like, let's say 100 steps and it is given, right? But the way it is given to us is through a denoising function, right? Because you're going to have to repeat this denoising function like 100 times, right? So that's fixed, right? That network is fixed. We're not going to change that. Okay? We would have to learn a student network. Okay, so that's distillation. That's standard distillation. For any student network, you can also estimate a denoising network for that student. This is a function of the student network. And now you're forcing this denoising function for the student network to be the same as the denoising function for the teacher network. By forcing me the same, I mean we're going to minimize the loss. So that's your constraint. You're going to minimize the loss to get them to be as close as possible. So yeah, the teacher network is fixed and constant. But this secondary teacher network or the coach network needs to be learned. What's the intuition for why you need the secondary coach network? Like if the teacher has all of the knowledge about how to generate the images, what is the secondary teacher actually doing? Great, great. Very good question. Very good question. So the answer could actually get to the heart of why you need that additional instrument right So remember that in the first approach right we are asking that the denoising function for the teachers right has to work really well on distributions that are generated by the student Which would be true if the student would generate the same distribution as the teacher would. So this is what we choose if we manage to successfully learn the distribution generated by the teacher. So I think there's like an agreement that, yeah, okay, this thing should be zero, you know, at convergence. Okay. But during the beginning of the process, right, the distribution generated by student is widely different from the distribution generated by the teacher. Okay. Okay. Because the student's just not very good yet. Yeah, it's not very good at it, right? So basically the signal from the denoising function from the teacher might not be a very strong, even a good signal to follow, right? And that is the intuition, right? So the secondary network is kind of acting as a bridge between the students, the early student distributions and the teacher distribution? Yeah, yeah, yeah. So that's exactly right. So early on in the process, right, we would want to explicitly estimate the denoising signal for the student. And then we minimize the difference between that denoising function versus the denoising function of the teacher. And that's kind of like providing the bridge of guiding the process during early stage of the optimization process. and so that initial work again this is swift brush and that is focused on um you know we've got image generation diffusion it works great takes a long time you know 100 steps so that's a lot of kind of inference so how do we make that more efficient well let's skip the 100 and do like one shot from noise to the desired output image. You know, even with all of the explanation of how that works in the intuition, like it's hard to believe that it actually works and works well. You know, talk a little bit about qualitatively what kind of results you saw. Again, you know, we're getting very good quality in terms of image quality, right? And also quantitatively as well, right? With all the benchmarks that we measure, especially with some additional improvement that we later on did for the second version of sheetbrush. We call it just simply a slightly improved version of sheetbrush. The scores that we're getting is almost as good as a score. Sometimes it's even better than the score of the original teacher. And which benchmarks are we talking about? It's a bunch of benchmarks. It's a very standard benchmark on image quality and also diversity. and yeah and this is kind of like a standard benchmark that you will see in any papers in this area and you look at the score for this one step model we notice that we're getting the score sometimes as good as the teacher itself the challenge is to get this thing to converge the challenge is to be able to get this training pipeline to be in a stable condition and initialization and also the training pipeline. So it actually converged. But once it converged, we noticed that results are really strong. Yeah. So then, you know, like, okay, you can ask the question that, okay, now that you can do text to image generation in just one step, it is really efficient and so on, right? Can you also do text editing in a similar efficient manner? And that gets us into this topic of image editing, which is, I think it's a topic of a recent paper that we just published at CVPR this year. That was a Swift Edit paper. Yeah, so Swift Brass, which is just generating the image, but then Swift Edit, right, is to be able to also edit images quickly as well. in just like one step, right? So that's the key. It's the Swift Rust, one-step image generation, Swift Edit, one-step image editing. That is the goal. So, you know, do you remind me to get into how it's working kind of like a high level? Yeah. Assuming that you already have a very efficient one-step text-to-image model, getting to text editing, you know, it's a lot more simple, right? If you don't have that, then forget it. But you already have a one-step text-to-image model, then text editing, you can kind of start to conceptualize things in a much more simpler fashion. So the way we architect this model is as follows. So you have an original image. Of course, you have a text form. and you want this text prompt to kind of like, you know, operate on the image, right, and get into the final image here. Okay. But the way we want to do it is that first, let's get this text prompt into noise, right? And then we already know how to get from noise to the image, right? Given the text, right? So then the challenge here is then how to get from image to noise, like condition on the text, okay? Yeah. So to get from the image to noise, right? It kind of like, okay, you know, we could set up various slots to do this, right? If you have real images as training data, right? You can set up what is called an inverted network or inverted model. And again, this is a one step model. So this is like a neural network architecture that would take you from image, right? to, of course, the encoder in that image latent space, right? And then from the representation of that image to noise. But you should do it in such a way that if you apply Swift Brush again, remember, we have access to Swift Brush, right? If you apply Swift Brush again, it would take you from noise to a latent representation of the same image, right? So, you know, like latent representation of that image, Z, go to noise, apply ZBrush, and you get another latent representation. And you can simply say that, okay, this two latent representation has to be the same. And that gives you a very natural loss. You can also go to the image cell. You can go like, okay, from this Z latent representation, you can use the decoder. Of course, all of this coming from ZBrush to generate the actual image. and you can say that, okay, the original image here and the image that you generate by applying SIPRATS to this intermediate noise, right, has to be the same, right? And this is if you have access to only real training data, right? But you can actually do more, right? Because you have SIPRATS, right? So you can synthesize any images you want. so you can do this without access to any real data as well so you can go from noise applying sweep brush going to a particular image or the latent representation of the image and from that apply this inverted network going back to noise and you can say that okay the noise you're starting from and the noise that you ended up with as a result of applying the inverted network has to be the same so these two epsilon and epsilon hat here has to be the same. So that's just yet another loss function that you can have, right? There's another signal for training this inverted network. So having access to an efficient one-step model, right, allows us to kind of like train this inverted network to invert from latent resistance Z to noise and do that quite efficient, right? Because you can differentiate through this network quickly, right? It's just one step, so you can differentiate through this network really quickly. And you can also combine signal from real data, also signal from completely synthetic data because the synthetic data can be generated really quickly with this one-step network, right? So all of a sudden, you have a combination of really just, you know, very, you know, like a loss function that's highly intuitive. And most importantly, you know, you can implement it and train it very efficiently. All thanks to, you know, already have access to this one-step image generation model, which is Swift Brush, which is, I think, the core to making this Swift Edit possible. We'll link to these papers in the show notes, and I encourage folks to pull them up, in particular the Swift Edit paper. The images are super high quality, and it's surprising how high quality they are, given that process. Yeah, yeah, yeah. It also works really fast, right? So we can actually run this on a standard one single GPU. And, you know, it took us like, I think, a fraction of a second, I think a quarter of a second to do one image editing, which is real time, right? I think even before you finish typing, you already see the resulting image. So that work is all focused on efficiency. I know you're also working on kind of agents and making that type of model efficient enough to run on mobile devices. Can you talk a little bit about the way you're thinking about that space? The way I think about this is that I think we all want to have agents, assistants that are personalized, right? And what that means is that this agent or assistance, they need to have access to our private information. Otherwise, how can it actually be personalized, right? And where the private information resides today, right? Well, you can argue that a lot of those private information actually resides on our personal devices, right? But that also means that privacy suddenly becomes really important. And so the way we think about this is that we want these agents or assistants to be able to do as much of the compute the workload on device as possible. Why? Because, you know, first of all, it's very close to the personal data that's sitting there. And second, and more importantly, right, if you can actually process all of that information on device, right, and you don't have the risk of exposing this private information to third parties and so on. And of course, you know, if there are tasks that actually require information, you know, from the internet, right, and then these agents can also, you know, collaborate with, you know, a bigger agent, right, a more sophisticated, more sophisticated agents with access to broader knowledge from internet, right, that can actually reside on the cloud, right. And so how does that broad direction or vision translate into specific research projects? So first of all, I want to say that this relates very closely to all the stuff that we talk about on efficiency. To be able to run this model to support this agentic behavior entirely on device, it means that, again, you have to look for ways to have not only large language models, but this is large multi-model models that can actually take in both text and also multimedia as well. Those are the things that you actually have access on a device today. And you want to run all of that on a local device. So that means a lot of experimentation with smaller models, models that ranging from forbidden parameter or even less. And various ways to get these models to perform efficiently in terms of the rate, and which you can actually ingest tokens, pre-filled token rate versus encoding, also decoding token rates. Another thing is that I think we start to kind of look into what are the source of information right that are really important for this on agent right So I mentioned access to private information on a device, right, and making sure that you process the information securely on the device itself. But then I think, you know, very soon, right, you would need to look at the information that you're getting not only from your phone, right, but from other wearable devices, for example, smart glasses, right? And that's very rich in terms of monthly modality, and that just opens up a really interesting space of research, right, and how do you enable this model so you can actually, you know, understanding video that's coming through your smart glasses, you can actually understand the content of the screens, right? That's actually the user is looking at in the phones and so on. And yeah, to be able to kind of like, you know, distill all that information to a compact representation, to be able to do that in a way that's efficient so that, you know, you kind of like, you know, don't consume a lot of battery, right? And then store that information somewhere so that it can actually be retrieved, right? And so all of that is just a lot of work that needs to be worked out, both in terms of, again, the locking of information, compressing this information, bringing it to a form that can actually be ready to be retrieved later. And for retrieval, you can think about this as, okay, well, yeah, something like BRAC, but you do BRAC, but not only have access to information on the internet, but also information on a device. But this way of devices being represented in a different way. The data is different. So you're going to have to make it work for this new distribution of data as well. When I think about agents and the kind of broader innovation that's been happening there, the rise of reasoning models has really changed that game and what's possible. That's very inference and compute heavy. I have to imagine that that poses a big challenge to running those kinds of models locally on the device. Are you thinking about this idea of inference time scaling and what that's going to mean for on-device models? Thanks for bringing that up. I mean, inference time scaling is a really important topic. You mentioned that, yeah, it's a challenge to run this kind of technique, inference time scaling on device. But the way we should think about it is that it's both a challenge and an opportunity, right? Why I say it's an opportunity is because there has been multiple works in the literature showing that if you take a small model in terms of number of parameters, and you apply test-out scaling, right? And you measure on a particular kind of task, let's just say math, right? and with test time scaling the small models actually you know perform a lot better than models that significantly larger right and and so test time scaling is a way that you can actually make the small model right to even beat a much larger model on a specific task right and and i think this this is this is this is really important because it's all of a sudden it enrich the capability of the small models it's an interesting give and take so like the test time scaling implies that you're exploding inference and that's a constraint on a mobile device, but it also inherently allows smaller models to match the capabilities of much larger models, which is a tailwind for you trying to get this running on a mobile device. Yeah, exactly, exactly, right. And of course, I mean, regarding the challenge of how to make this work efficient on a mobile device, and this is kind of like more compute bound, right? It's no longer a memory bound, it's like compute bound. This is a topic that we are looking at very closely, right? How to, again, make this test time scaling work more efficiently. For example, how do you do it assuming that you have an upper bound in terms of access to compute, right? It's almost like a fixed compute budget. So you have a fixed compute budget and what other strategies? How do you allocate your resources across? Exactly, exactly. It's interesting because I was going to ask the degree to which test time, you know, test time scaling and getting these reasoning models working on constrained devices is different than just the inference problem of, you know, having any LLM inference happen efficiently. and so it raises these kind of you know meta issues and and a good example of that is if you've got a fixed overall you know budget you know whether that's you know compute or latency or whatever and you have you're able to do some kind of planning that this is going to require some number of uh inference requests like how do you optimize where you spend your compute uh is an interesting way to formulate a research question there. Yeah, I think you said it pretty well. And in a sense, this test time compute kind of like combine the probability that has been learned with an LNM, which is kind of almost like a forward predicting, with the kind of like an estimation of what's the future reward it's going to be, right? So it's kind of like combining probability and utility in a sense. So it's kind of like, you know, already gone over what the original L&M was designed to do, right? So in that way, I think it's just, you know, it allows us to, you know, move into a much, I think, richer problem, right? And which is, you know, how are we able to find a particular answer path that maximize expected utility. And that's a very rich framework, right? So things like compute resources, resources constrained, you can see that it's possible that you can actually formulate this under the framework of optimizing for future expected reward. And so this idea of agents clearly kind of opens up, you know, many different research directions as, you know, and it kind of serves as maybe a grounding kind of application area. Are there others that are high priorities for you? I would say that, yeah, in general, efficiency is important for many broad topics. And I think this is something that people also have realized. You know, like I think for some time, you know, people probably don't focus too much on this issue of efficiency. But I think recently, I think, you know, there's a lot of focus on this particular topic. So I probably don't need to say anything more. So it's been, I think, six months since the acquisition. just what are your biggest kind of lessons learned through that process and what are you most looking forward to as you look ahead? I must say that we were pretty lucky because issues like efficiency, right, and on-device models is something that is just something that Qualcomm and Qualcomm AI Research, those folks have been working on similar problem for quite a while as well. And there's a great depth and breadth of expertise within Qualcomm AI Research itself. So, because of that, I think integration has been a lot smoother. We don't have to change the objectives in terms of the way that we've been working. It's more or less the same focus. so for us I think it's more of a matter of learning about the capability within Qualcomm AI research and see how we can actually best help and enhance the capability of the group so yeah I think we were pretty fortunate and also having access to the talents and the resources of Qualcomm AI research which is making us even more excited. And talent was one of the big challenges that you mentioned when you got started. Are you continuing the residency program? Ah, yes, absolutely. I think this is, yeah, so this, the AI residency program today, it is something that we'll continue to reinforce. And, you know, like a little bit of a history of the residency program. We started this hiring the first batch in 2019. That's six years ago. And, yeah, at any point in time, we have about between 40 and 50 residents in the lab, right? 40 and 50? Yeah, between 40 and 50 residents. Wow, that's a lot bigger than I imagined. Right, yeah. And how many researchers total in the lab? The total number of people in the lab is about 90 people, right? So the number of residents is almost half or even a little bit more than half of all the researchers and engineers. Yeah, wow. Because they stayed with us for two years, right? They have enough time to contribute significantly to our research and also engineering process. And even now, right, the people who have gone through the residency program, almost like close to 100 of them. And a lot of them are actually in top AI PhD programs in the US or Europe, Australia and so on. And I think that has been a really nice tradition. And we want to continue to keep it that way. The program now would continue under the new branding of Qualcomm AI residency program. And I think we just hired the first batch of research residents. And we continue to look for ways to improve and expand the program as well. And we are about to recruit the first batch of engineering residents, so AI engineering residents. And I think this will also provide the opportunities for the young local talents to have this unique experience to be part of an AI research lab of a big tech company like Qualcomm. I don't know that we have a huge listener base in Vietnam, but in case there are folks listening that might be interested in the program, is there a page that they can go to to learn more? Yeah, we have a landing page for the Qualcomm AR residency program. You can find out about it from Qualcomm's website itself. And from that, we have a link to recruitment. Well, we'll find the landing page and stick it in the show notes. All right. Okay, awesome. Awesome. Well, Hung, it's been great catching up with you and hearing a bit about your journey and the projects that you have worked on and are embarking on as part of Qualcomm. Thank you, Sam. And again, thanks a lot for RTPT to share my thoughts and opinions here. Thank you. Yes, thanks so much. Thank you.

Share on X Share on LinkedIn