Why Vision Language Models Ignore What They See with Munawar Hayat - #758

TWIML AI Podcast • Sam Charrington

Tuesday, December 9, 202557m

Spotify Apple

TWIML AI Podcast

0:0057:40

What You'll Learn

✓VLMs can generate detailed images but struggle with simple physical tasks that require understanding the visual scene
✓VLMs tend to ignore the visual input and rely more on the language model's parametric memory
✓This is due to the imbalance between the training data for vision and language models, as well as the difficulty in aligning the two modalities
✓Expanding the descriptions of images during training to include physical properties can help, but more research is needed to address this challenge
✓Recent studies have shown that vision models perform well on spatial tasks, but their performance drops when combined with language models

Episode Chapters

Introduction

The guest researcher, Munawar Hayat, introduces himself and his work at Qualcomm AI Research, focusing on multimodal generative AI and visual understanding.

Challenges with Vision-Language Models

Hayat discusses the limitations of current VLMs in understanding and generating physical scenes, such as unstacking boxes or opening a drawer.

Reasons for VLM Failures

Hayat explains that the issue is that VLMs tend to ignore the visual input and rely more on the language model's parametric memory, rather than truly understanding the visual scene. This is due to the imbalance between the training data for vision and language models, as well as the difficulty in aligning the two modalities.

Potential Solutions

Hayat suggests that expanding the descriptions of images during training to include physical properties can help, but more research is needed to address this challenge. He also discusses recent studies that have shown the limitations of combining vision and language models.

AI Summary

This episode discusses the challenges faced by vision-language models (VLMs) in understanding and generating visual content. The guest researcher, Munawar Hayat, explains that while VLMs can generate detailed images, they often fail at simple tasks that require physical understanding, such as unstacking boxes or opening a drawer. The issue is that VLMs tend to ignore the visual input and rely more on the language model's parametric memory, rather than truly understanding the visual scene. Hayat suggests that this is due to the imbalance between the training data for vision and language models, as well as the difficulty in aligning the two modalities effectively.

Key Points

1VLMs can generate detailed images but struggle with simple physical tasks that require understanding the visual scene
2VLMs tend to ignore the visual input and rely more on the language model's parametric memory
3This is due to the imbalance between the training data for vision and language models, as well as the difficulty in aligning the two modalities
4Expanding the descriptions of images during training to include physical properties can help, but more research is needed to address this challenge
5Recent studies have shown that vision models perform well on spatial tasks, but their performance drops when combined with language models

Topics Discussed

#Vision-language models#Physical understanding#Multimodal AI#Alignment between vision and language#Training data imbalance

Frequently Asked Questions

What is "Why Vision Language Models Ignore What They See with Munawar Hayat - #758" about?

What topics are discussed in this episode?

This episode covers the following topics: Vision-language models, Physical understanding, Multimodal AI, Alignment between vision and language, Training data imbalance.

What is key insight #1 from this episode?

VLMs can generate detailed images but struggle with simple physical tasks that require understanding the visual scene

What is key insight #2 from this episode?

VLMs tend to ignore the visual input and rely more on the language model's parametric memory

What is key insight #3 from this episode?

This is due to the imbalance between the training data for vision and language models, as well as the difficulty in aligning the two modalities

What is key insight #4 from this episode?

Expanding the descriptions of images during training to include physical properties can help, but more research is needed to address this challenge

Who should listen to this episode?

This episode is recommended for anyone interested in Vision-language models, Physical understanding, Multimodal AI, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode, we’re joined by Munawar Hayat, researcher at Qualcomm AI Research, to discuss a series of papers presented at NeurIPS 2025 focusing on multimodal and generative AI. We dive into the persistent challenge of object hallucination in Vision-Language Models (VLMs), why models often discard visual information in favor of pre-trained language priors, and how his team used attention-guided alignment to enforce better visual grounding. We also explore a novel approach to generalized contrastive learning designed to solve complex, composed retrieval tasks—such as searching via combined text and image queries—without increasing inference costs. Finally, we cover the difficulties generative models face when rendering multiple human subjects, and the new "MultiHuman Testbench" his team created to measure and mitigate issues like identity leakage and attribute blending. Throughout the discussion, we examine how these innovations align with the need for efficient, on-device AI deployment. The complete show notes for this episode can be found at https://twimlai.com/go/758.

Full Transcript

Thanks so much to our friends at Qualcomm for their continued support and sponsorship of today's episode. Qualcomm AI Research is dedicated to advancing AI to make its core capabilities, perception, reasoning, and action ubiquitous across devices. Their work makes it possible for billions of users around the world to have AI-enhanced experiences on devices powered by Qualcomm technologies. To learn more about what Qualcomm is up to on the research front, visit twimlai.com slash Qualcomm. might be deformed, their sizes might be different, and this is problematic. All right, everyone. Welcome to another episode of the TwiML AI podcast. I am your host, Sam Charrington. Today, I'm joined by Munawar Hayat. Munawar is a researcher at Qualcomm AI Research. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Munawar, welcome to the podcast. Hey, Sam. Thanks a lot. Thanks for having me. I'm really looking forward to digging into our conversation. We're going to be talking about some of Qualcomm's papers on multimodal AI and visual AI from the recent NeurIPS conference. To get us started, I'd love to have you share a little bit about your background. How did you get into the field? So I did my PhD in 2015 from Australia. I got interested in computer vision and I took some courses in my undergrad on image processing. That's what fascinated me. And during my PhD, I worked on visual analysis of facial data. And then afterwards, I have been contributing to mostly computer, different subfields of computer vision. And yeah, and then in 20, after my PhD, I stayed in academia. I had a faculty position and then I moved to Qualcomm in 2023. What's your research focus at Qualcomm? At Qualcomm, I have been mostly working on multimodal generative AI. So on vision language models for understanding generation and retrieval. And those are the three areas where I have my papers at NeurIPS that I believe we are going to discuss deeply. So when I joined Qualcomm, the first project I worked on was making diffusion models run efficiently on Qualcomm hardware. That's a mobile phone. and we're generating images in under half a second. And then we also work on visual understanding. That's visual question answering. Given an image and a question, a text question, we want to respond to the question. And for visual question answering, we also had a model running on Qualcomm hardware. So basically here in Qualcomm, I have been working primarily on multimodal generative AI. on generating visual content, understanding visual content, and also retrieving cross-modal information. When you think about how the field has evolved over the past couple of years since you started there, what's most exciting for you about this moment right now? I guess we have come a long way. For decades, we were focused on solving niche, very specific problems. think of we spent so much time in just solving, classifying different objects from visual images. And then we moved on quite a bit. But there is a lot to do, I guess, especially in visual understanding and also in visual generation. In visual generation, physics-based generation, I guess we largely lack on that. And also in visual understanding, once we marry a vision model with a language model, there are lots of works that show that vision largely gets ignored and the language model takes over. And we are responding basically from the parametric memory of the language model and not by looking at the actual visual content. When you talk about physics-based generation, what do you mean by that? So physics-aware generation, meaning we had a submission recently in which we observed that if you take these foundation models, proprietary models, and you should give it very simple images like an image of two boxes, two cardboard boxes, very plain. One is bigger, one is smaller, and you ask it to generate an image where we unstag these boxes. So while these models can generate very fancy images with intricate visual details, but a simple task like this where we have very plain objects and we want them to unstack them. So once it unstacks, the physical properties of the boxes change. They are not exactly the same boxes. Their shapes might be deformed. Their sizes might be different. And this is problematic. Like in future, if you want rewards to work within human environments and they want to perform normal tasks, then they have to have understanding of the physical world. So simple tasks like how would an image look like if I open a drawer? So current models fail. They would hallucinate. They would maybe generate new objects in the air. Or once they open the drawer, its physical properties are going to be different. Humans are very much kind of accustomed to that kind of thing. We just get by those things very easily. We can create a mental simulation once we're opening a drawer, for example, that the drawer is going to open this way. so I don't have to stand in its way. I have to position myself at a certain position so when the drawer opens, it doesn't hit me and so on and so forth. And how much it's going to open and we can predict affordances like where do we grab to open the drawer and so on. So those kinds of things, while we take it for granted, but for the current models, it's a major limitation. they can't tackle it reliably and robustly. Now, some of those things, those challenges, when I hear them strike me as being fairly solvable with more training data, like, you know, affordances and like where a box is, where a drawer is movable from, like what are the handles, that kind of thing. Others strike me as more, you know, found, you know, where we're lacking foundation, like memory, You know, for example, as a more robust structure in VLMs, how do you think about what, you know, maybe what's lacking or where kind of opportunity lies in terms of overcoming the challenges that you're outlining? Exactly. I guess training data is a big, has a big role to play. So for this task, it's not as simple as crawling the web and collecting large-scale training data. We can collect image text pairs. We can even generate description of images. But even those descriptions, they don't carry the physics information, the physical properties of the object. So the models can kind of implicitly learn those. So that remains a challenge. So one thing we observed was during training of these models, which is generally done with paired image text data, like image and its description. So if we expand the description of the images, we do prompt expansion. And during that expansion, if we try to describe the physics of the object in the scene, it really helps. So, for example, if your task is to unstag boxes and you describe it, unstag the boxes, keep their structure intact, keep the lids closed if they are closed, and make sure the physical sizes of them, they stay the same. It seems like here we're taking advantage of the fact that the L is a lot stronger in VLMs than the V. That's exactly true. So that in itself, I guess it's a big challenge. So I guess from Travel Daryl's group, there was a recent very interesting study where they show the vision models, the vision foundation models like Dyno or Clip or SAM. So they by themselves have very strong vision capabilities, simple tasks like spatial correspondence. So if we have two images, one is, and both of them are taken from slightly different viewpoint of the same scene. And if you ask which point in this image corresponds to which point in that image. So the standard vision models can solve this task fairly reliably, but then they show that once you combine the vision model with the language model, the performance of vision language models falls below chance. which is very interesting they kind of fit so this is what we observed as well and this is one of our papers in Europe that tries to tackle that to some extent I don't claim it's a solved problem we have a long way to go and which of the papers is this? so in this paper we are basically we provide a systematic analysis of why do vision language models hallucinate. And what we observe is we take intermediate representations of the vision tokens recontextualized while they go through the language model. And then we have pair data of the same image text pairs. and we have data of different image text layers. So if you get the distribution of the similarity of image embeddings and the text embeddings in the embedding space of the language model, then for the distribution of the same versus not same, we would expect those to be disentangled. But instead, what we saw was that they were highly tangled, like there was a huge overlap. and this kind of intrigued us and we did some further analysis and what we observed was so remember in vision language models we have vision tokens and text tokens we concatenate them and then vision and text tokens they go through the language model so while predicting the next token we would expect that the language model would attend to the vision token for example if I ask what's the color of this box. So we would expect the box to be attended to, the visual tokens corresponding to the box to be highly attentive to. Or in other words, if we visualize the attention scores that we can easily do in the language model, we will expect that the attention scores corresponding to the visual tokens of the object of interest are higher. But what we observed instead was that was not the case. And this kind of intrigued us, and we wanted to dig deep and try to tackle this problem. And that's basically what our work in Europe does. And this specific paper is Attention-Guided Alignment in Efficient Vision Language Models. That's exactly right. So the thing is, how do we solve that, right? So we made some right observations that vision is not being attended to, vision is not being paid attention to. Just a question that just jumps out at me is, you know, when we think about how language models are trained, they're trained on, you know, lots and lots of textual data is, you know, part of the challenge that the, you know, these vision tokens somehow don't exhibit the same, you know, patterns and distribution as text tokens and the language model doesn't know what to do with them as easily as it does with text. That's right. So if you see, we have a pre-trained language model, as you said. It's trained on large-scale text-only data. And that scale of the training data is much bigger. Think of trillions of tokens, 15 or 20 trillion tokens, even for the smaller models. And then we have a pre vision encoder which is pre on think of a billion image text pairs and the embedding space of the vision encoder and the language are different because they were pre entirely differently We want to combine them together and we do some alignment of the vision and the text. And this alignment is done on a very small scale data, generally. So that's kind of, I think that's also, that has a role to play. So you want to combine these two foundation models pre-trained entirely differently, and then you're combining them and the scale of the data that you're combining and the quality of the data and what kind of descriptions are in there, that has a huge role to play. And that is generally much less. So we observe that in some of the benchmarks that exist for vision language model analysis, if we ignore the vision and only ask the question to the language model and the responses we get just by using the language model and on purpose ignoring the vision part, on these benchmarks, it's comparable to if you will feed the vision language model with both image and text, image data as well. So there is a problem with the benchmarks that we have as a community, and then I guess there are recent vision-centric benchmarks coming in. So for example, if you ask what's the color of an elephant, so for that, the language model probably knows what the color of an elephant is. It doesn't really need to look at. So, yeah, training data and the way they're trained, that has a huge role to play. And to tackle it, what we observed was that we want to inject the visual information at different hierarchical levels of the language model, meaning language model has different layers or different blocks. It's a transformer with different blocks. So we interleave So we interleave cross-attention modules after every fourth block, for example. And this way, right from the beginning till end, every fourth block, we have a cross-attention module through which we interject or we inject the visual information, visual tokens. So we make sure that visual tokens are there. And then the loss formulation we also change. so we make sure that the seminal region or the salient region within the vision is being attended to through an additional auxiliary loss apart from the standard next token prediction loss that's used to train these models and what is how do you articulate the intuition behind this loss? so think of if my question is, what is the color of the shirt of person sitting next to maybe a dog, right? So now we can get a segmentation mask of the person. That's easy to get with the off-the-shelf segmentation models. and the intuition is attention scores for the visual tokens must be highest for the visual tokens corresponding to the person because that's what the question is, right? And so if we ask a human to answer to that question, they would focus on that particular region. So we get these segmentation masks and we can come up with a loss formulation. We know where visual tokens are and we know where tokens corresponding to the question are, the visual tokens. And then we know what was the attention score, what were the attention scores while responding to this question. So we can maximize the attention scores corresponding to the relevant visual tokens. And we basically do it via segmentation masks through a SAM-like model. Is that implemented as an auxiliary model or is segmentation formulated as a loss component itself? It's formulated as a loss component. The segmentation masks are computed offline because this is just part of the training. So at inference, you don't need that. You can live without that. But during training, we have these segmentation masks computed offline. and then we can use these segmentation masks to compute the auxiliary loss. How does that impact your computational performance at training time? You said at inference time it doesn't impact it. That's a great question. Just because we are injecting the visual tokens through cross-attention modules and we're not concatenating them at the beginning. so if you think of like if M is your number of vision tokens and N is number of text tokens through standard design your compute complexity through the self-attention modules would be M plus N whole squared but now because text tokens are being self-attended to and vision tokens are being cross-attended to our compute becomes much less text if it's n, n squared plus mn. So it's much less than m plus n whole squared. So we observed that compared with standard approach, the training is actually faster this way. Yes, actually there is like an additional cost to compute these masks offline, But that's just turn once. That's just inference. And that N-squared relationship in terms of complexity is one of the challenges in achieving large context for language models, VLMs. Does reducing that complexity allow you to do things in terms of longer context as well? Is that something that you've explored? It does. We have explored. It's not part of the paper, but it's a continuing work. So this allows us to reason from longer videos, especially for retrieval purposes. There are other techniques as well, but standard way of concatenating visual tokens with text tokens for longer videos, we do have challenge in terms of compute and square complexity, as you mentioned. And via cross-attention, we find this is a good solution. You mentioned that the benchmarks are lacking in their vision centricity. Can you talk a little bit about the benchmarks, the key benchmarks for this problem that you use? And did you explore any of the newer vision-centric benchmarks? So benchmarks like ScienceQA, which has an image, high school science-related questions, AI2D, they generally, for most of the questions, even if you don't look at the images, you can respond to them. So more lately, people have come up with new benchmarks like CVBench, which makes sure that the question cannot be answered without looking at the image. For example, which object is closer, person or the table, right? So for that kind of information, just relying on the language model's parametric memory, you can't respond to that. You really have to look at the image and then respond to the question. So these kinds of questions where you don't need to articulate the answer in terms of language, but you have to, your answer might be short and crisp, but you have to look at the visual data. Those are the benchmarks that really, or spatial correspondences, like we are given paired images and we say on one image, we have four points, and on another image, A, B, C, D, and on another image, we have one point, and we say which one of the four points this point in image B corresponds to. So for that, you have to really look at the visual data. So these are the benchmarks, basically, that we need to pay attention to to make sure that vision is not being ignored. We are really solving the vision-centric problems. Earlier in introducing this challenge, you talked about object hallucination in VLMs. Do you think of object hallucination as like a general failure mode that contributes to like all of the problems that VLMs have? or is it, you know, a particular failure mode and the, you know, there are particular benchmarks that you looked at to look at hallucination performance? So, of course, we looked at a particular hallucination bench, which is a benchmark. But I guess it's a problem in the sense that if you ask a question, for example, how many cats are there in the image? If the cat is sitting in front of a mirror, right? So it would hallucinate and respond to two. Or if there is a scenario which is very common and you ask a counterintuitive question. So we know elephants are of a certain color. But if you provide a counterfactual example, like an elephant painted pink, and you ask, what's the color of the elephant? So it would hallucinate and say, so those kinds of scenarios where, I guess we can see and study if the model is hallucinating or is it really paying attention to the vision information. Yeah, okay. Yeah, the way you answered that kind of gets at, I think a thought that was underlying my question, and that is like when, you know, there's a cat looking in a mirror and the model says two, is that, you know, is that hallucination or is that some other kind of error? Or when the elephant, you know, when it says the elephant is gray, but the image says, you know, it's a pink elephant in the image, you know, I guess it's, you know, maybe I'm probing out like the definition of hallucination in a VLM context. And you kind of suggested that maybe one way to define hallucination is if it's not attending to the visual tokens and it's attending to the textual tokens only. but it suggests to me that there's maybe a more granular way to talk about error modes in vlms where you know hallucination is not necessarily the same as relying on you know textual uh you know what it what it thinks is knows from text parameters and maybe there are other ways to like kind of chop up these errors that, you know, might be insightful. That's right. So think of scenarios like which object is closer to the camera, who is on the left of a person, or from third person's perspective, like if there is an image, there is a car, there is a motorcycle, and then there's a person. And we ask the question, And if you are a motorcycle and you're looking to your left, what do you see? So those are, I guess, there are different granular levels to which we can break down to the problem. And maybe hallucination is not the right word to kind of club them all together. We are borrowing that term from our language community. But we are kind of identifying challenges related to them. and then seeing if current VLMs can tackle that. Counting is another one. So counting is an integrative reasoning problem. And we see that current foundation models struggle at that. And also iterative reasoning or multi-step reasoning. So, for example, if you took an image of the price board of a gas station and it says Supreme is $3.40 per gallon and you say, I want to fill how much of fuel I'm going to get for $50. So that's more of an iterative multi-step reasoning problem. Like it first needs to OCR, read at the visual data, then extract that information, go back, reflect on your question, and then do this compute and come up with a response. So currently, we just do inference in a single step in an end-to-end manner. We are given an image and a question and we just start spitting out tokens one at a time So you right like granular definition of these different problems they might not all club under hallucination as such to say, but yeah. Yeah, and as you're responding, I'm also kind of thinking about it from the tech's perspective and like it's probably true that like there's a whole set of like reasoning failures and other kinds of things that we kind of lump under hallucination in the sense that, you know, somewhere under, you know, somewhere down there, the model is, you know, making up something that isn't grounded in the reality of the prompt or the input. And that's like the foundational cause of the problem. And so that's, you know, from that perspective, the term makes sense here as well. Yeah, that's true. That's right. And so basically we try to identify all those problems. And then a common approach has been people identify a limitation, go back, collect data around that, include that data as part of visual chain of thought reasoning, and that helps us tackle that problem. But yeah, you're not basically grounded in the visual input or it's just outside the reality. Yeah, so. Awesome. Awesome. Let's switch gears to another paper that you have at NeurIPS on generalized contrastive learning, better search across text and images. Talk a little bit about the setting and the motivation for this paper. Yeah, sure. So, as you know, Qualcomm has a strong interest in mobiles. And one of the key use cases in there is we have a gallery of images and we want to retrieve. So, generally, we do retrieval based upon text query. But there are scenarios where a query can be composed or it can be fused. From that, I mean it can be composed of both text and visual information. So show me images, all the images of a person looking like this, for example. So our query is both image and text. So what we see is for scenarios where either query or key is composed, it's not single, either text or image, we see that the current vision language contrastively pre-trained model, like clip-like models, they struggle. And there's a clear reason for that. Why do they struggle? So this paper basically tries to tackle that problem. So it tries to tackle composed multimodal retrieval. And you said there's a clear reason why they struggle. Why do they struggle? If you see the way they are trained, And so we have a bunch of images within a batch, and then we have corresponding text descriptions. So images go through the image encoder, text descriptions are encoded via the text encoder. And then we have a shared embedding space where both image and text speak to each other. And in this space, we compute similarity of image and text. and within a batch. And then we make sure that image text pairs corresponding to the same should be closer together. And if they're different, they should be separate apart. That's how we train these models. So there's no component of training where we're corresponding text and images in the same space. Exactly. So the general way people try to tackle this is they collect training data of triplets where if you think of query and key, query can be image text and key can be image or it can be image and text plus image. Query can be image and key can be text plus image. So we can think of different possible combinations of that. So a general way people have been tackling this problem is collecting this triplet data of different possible combinations and then training the model with that newly collected data. but this has its limitations like it's laborious it takes time and it takes lots of effort manual effort so what we thought what we came up with this in this paper is let's not collect any data let's see if we can do something in the loss formulation so the loss is basically applied so on one axis we have all the images in a batch let's say n images and on the other axis we have n-text descriptions. Then we compute similarity between these n-text and n-images, and we apply the laws on this n-by-n matrix. So now instead of this n-by-n matrix, we try to change the possible combinations. So we say we have image embedding, we have text embedding, and we can get a fused image text embedding by just doing simple fusion. so now we have three embeddings image text and image plus text and we can come up with multiple combinations like we have three and we want to select two so we can come up with multiple combinations of those and then we can compute similarity between those multiple combinations and reformulate our laws based upon those similarities that come from combinations of two from three different options image text image plus text and this helps us generalized to scenarios where our QD and QE can be any modality, any single or fused. And so in the same system, you mentioned this triplet and you've got different permutations of this triplet. Are you training only on a particular permutation or are you training on all of the permutations of this triplet? We are training on all possible combinations of the triplets because for us, it's very easy to get those. We don't have to collect any data. And we gain in terms of generalization to unseen scenarios. And one of the things we observed as kind of an evolving characteristics of this loss formulation was that it even evolved and generalized to videos. So there is a benchmark called Composed Video Retrieval, COVR, and we do evaluations on that. So even though we did not train on the video data, but since our QDs were both images or image plus text or text, and same goes for keys, we saw that we were able to generalize to those scenarios as well. And is a way to think about this in the context of, like the traditional clip approach is that it's like adding dimensionality to the embedding space? That's right. So it's just enriching the embedding space more. And it's just, instead of just vision and text, it's enriching it with other modalities. But I guess this problem is still not solved. especially we see that in the case of clip or the follow-up variants of clip, the text encoder is kind of weak. It has seen a specific distribution of text, which is just description of images. Language is much more richer. So if we have information like verbs, actions, and so on, So we see Clip does a poor job on those. So I guess a principle way that we are, like a way that could tackle this is move away from that and maybe do retrieval in the embedding space of a multimodal model, where what we were discussing earlier, you have a language model, which is much richer in terms of text, and you also have a text vision model, And you just try to align these different modality combinations in those embedding space. Basically, your language model or text encoder is now a better text encoder, which is a language model. Do you see that or are you raising that as a future direction or is it something that you've tried or is it something that you're foreseeing as being difficult because of some of the challenges that we raise? I'm forcing it as a future direction, especially if you want to retrieve cross-model information in the form of text and image both. Then I guess like a future embedding space of multimodal model is promising compared to embedding space of a clip-like model. Clip-like model is also multimodal model, but in my definition of multimodal model here is embedding space of vision language model, like language model is a true LLM. It's not text and code like clip. Now, when you started off describing the problem, you reference wanting to do this in a mobile environment. A, I'd like you to talk a little bit about some of the results you saw and how they enable running in a mobile environment. But also, you know, in talking about going from a Clip type of model to a VLM, you're adding a lot more complexity. And that's kind of against your odds of running in mobile. So talk a little bit about what you see there. You're right. Clip has its benefits in being efficient. it's like 300 million 400 million visual encoder and 100 million 125 million to be precise text encoder which is very efficient and on mobile we can easily implement it and by consuming very little power it can enable this composed retrieval scenario for VLM, vision language model where we have a language model and a vision encoder we are increasing the memory. But interestingly, what we found is that even the small language models, this small SMOLVLM from Hugging Face, they have trained a variety of small language models, sub 1 billion parameters in 0.125, 0.25, 0.375 billion, very small language. So even its embedding space is good enough for retrieval. At least our initial experiments suggest that. And we are mindful of the additional memory and compute that it brings in. But I guess the challenge is going to be how do we tackle it going forward. And so with regards to the approach proposed by this paper, this generalized contrastive learning or GCL approach, What is the computational impact of that? I would say it doesn't bring in any additional compute. We just need one single forward inference pass of both. Once we're building the database, as well as once we're curing, we just need one forward pass of the query through the relevant encoder, image or text encoder. and as such, we are not changing the architecture of the model. It's basically a better way to train with a better loss formulation. So in terms of inference, there's no additional compute. And in terms of benchmark performance, what are the key benchmarks and what kind of performance results have you seen? So we have seen significant gains on massive multimodal embedding benchmark, for example, MMEB. And then there is another benchmark which is composed of multiple other benchmarks. So we have seen key improvements in those compared with other SOTA approaches. In some cases, quite significant. But another interesting aspect is if you see the results still for some of the different permutations that we were discussing earlier, the results are not, I mean, at a point that we can really use, we can use these in real life. My point being, this is still an open problem and we need to push research on that front. Another paper that we wanted to dig into is called Multi-Human Test Bench, Raising the Bar for Multiperson Image Generation. And, you know, when we talked a little bit about the motivation here, I don't know that I realized that generating multiple, you know, generating images with multiple people was such a challenge for these models. Talk a little bit about that issue. Right. So even if we consider a single person, so I think a common use case is I have my images, my portrait images my selfies and so on and I want to personalize them right So a challenge we see is once we personalize those images it might generate an image in the scene I have asked for but it loses my facial identity It might generate a person which doesn't look like me. And humans are very good at kind of identifying those kinds of artifacts. It doesn't really look like the person you're trying to generate. But even a bigger challenge is if we do conditional generation based upon our condition being two, three or four people, and we want to generate them doing an action or in a specific scene, there we see that facial identity information of the subjects that we want to generate, the conditional subjects. it's lost quite a bit. And even we see, and I mean, once we benchmarked the proprietary foundation models, we see that they can't get beyond certain number, beyond three or four, they are unable to get accurate number of subjects you have asked to generate. If there are five conditional images you want to generate, it might generate three or four, or it might not generate all five in there. So we realize these kind of limitations. And then I think there is a need for us to push research in this direction. And a low-hanging fruit for us is to come up with a benchmark that can help community and drive and track progress on this open research problem. So this is where this benchmark comes in. To be clear, what you are providing first and foremost is a benchmark. I'm imagining this has a lot of images or text prompts that aim to generate multi-person images. We have benchmark and we also propose a solution, a preliminary solution to tackle this problem. But primarily it's a benchmark. So the benchmark contains, for tests, it contains, we are suggesting, 5,500 different individuals to generate 1,500 test samples in different scenarios with different actions and so on. And we are working on extending it beyond. So we come up with complex, with prompts where humans are doing different actions. and then we also propose different quantifiable evaluation metrics like count accuracy being one and how do we get those metrics like we detect faces and then we count them and if we want to preserve facial identity information what's a quantifiable metric for that so we detect faces, we pass them through a facial discriminator model which has been trained to discriminate faces and then we compute the embeddings of the generated phase versus the conditional phase and they should be above a threshold for us to declare that it's the same person. And then we also have another metric which is MLLM as a judge and another metric which quantifies HPSV2 score, human perception score, which is a genetic score to suggest how much is the prompt adherence to the generated image and how is the visual fidelity of the generated image. So basically, we provide test samples, we come up with challenging scenarios, and then we also provide quantifiable metrics to gauge the performance of a model on the task. And that's one side of it. The other side is we also looked into how we can solve this problem. So I'm imagining that some of your solution is taking your metrics and building them into your loss function when you're training. That's an obvious thing to do. But another thing that we came up with, and I think it's interesting, is once you're generating multiple people. and if you think of like we have multiple tokens and the way attention is applied is either causal that your future tokens only looks at the past tokens or it's bidirectional. But in this case, if I'm generating images of two people, I don't want tokens of one person to attend to tokens of another person. That's a very simple intuition. And why? Because there would be some kind of identity leakage in doing so. So we basically define these kind of islands or masks where tokens corresponding to one person only attend to that person. and tokens correspond while generating an image. Tokens corresponding to every person just attend to that particular person and they don't kind of cross-attend to other people being generated. The models are able to do a good job of tracking that? We can define. It seems like a finer grain constraint than is usually placed on the model. That's right. So we have to define that, and we can do that fairly well. And we show some quantifiable results that this does improve, and it helps us tackle the challenge to some extent, but largely it's still unsolved. And I guess the recent version of Gemini and Nano Banana, which was released just... Nano Banana Pro just last week or so? Mm-hmm. One of the things they seem to tackle is in their blog post I was reading is this particular problem. Multiperson in particular or facial continuity in general? I think they portray as multiperson, multisubject or multi-object personalization. So once you want to generate an image conditioned upon multiple different subjects and objects, how do you preserve their identity? So it's more of a generic version of the problem. And going back to this theme of efficiency, does the approach that you propose or like how does efficiency play into either your solution or the benchmark more broadly? Right. I guess in the solution, we are not involving any additional compute. It's just that we are redefining the attention masks, which is a trivial operation. And it kind of helps. Like instead of attending to all tokens, we can skip quite a bit. But it doesn't introduce any additional compute. The other part is we could do personalization by just fine-tuning the parameters of model with the low-rank adapter or something like that. But that's prohibitive. That's compute prohibitive, especially in case of mobile. So this personalization that we are talking about is basically inference only. Like, okay, you do training, but once you deploy the model, you're not optimizing it anymore. So the model should be able to generalize to any person's face and it should be able to generalize and generate or personalize to any facial images. So at infants, we don't need to learn any adaptation parameters. So that's one key aspect of this approach. Like if you see Dreamboot-like approaches, we have to learn a dedicated adapter network to personalize. And in this approach that we are proposing, it's inference only. Awesome. As is usually the case, Qualcomm had a ton of papers as well as demos at NeurIPS. Are there any other papers that you would want to highlight? Qualcomm has a very strong presence at NeurIPS this year. We have, I guess, 17 papers. And if I'm not wrong, something like nine different demos, which make up of 50% of the expo demos, close to 50% of the expo demos at NeurIPS this year, which is, I think, which is quite nice. So we have quite some interesting papers. Some of the ones that come to mind is LLM inference from long context. I think it's called KeyDiff. they have a very interesting intuition that if you diversify your keys, this corresponds to the keys that are attended to most during inference. And this can enable us to do KVCache eviction. and it can enable us to do it as the new data is being prefilled, KVCache prefilled, we can do this KVCache eviction continuously. So I guess that was a nice work to enable inference from long context. Then there was another work on LLMs. It's called OmniDraft. So basically, it enables you to transfer the draft model trained for speculative decoding across different families of networks. And I guess they show it on LAMA3, Vicuna, and QAN models. and they come up with an interesting approach, Ngram Cache, where the main difference is across families of models, we have different tokenizers and how do you make sure that one tokenizer corresponds to another one and Ngram Cache basically bridges that gap. And then there are, apart from GCL and multi-human benchmark, There are other, I guess there is a work on streaming multimodal reasoning. So, and they tried. So, current language, vision language models, they struggle to, basically they don't know when to stay silent and when to speak. So, think of like if I'm doing cooking or if I'm doing an exercise and I want my model to be my coach or to help me. And if I do a mistake, it should tell me. If I'm doing okay, it should tell me. And if there's nothing interesting happening, it should just stay silent and observe. So currently, the way these models are trained, we upload an image and a question, and it has to respond. And then we ask another question, and it has to respond. So can it basically stay silent when it has to and interject when it has to? So this kind of interactive reasoning, I think it's called Qualcomm cooking benchmarks, interactive cooking benchmarks, something like that. So that's an interesting work on that. And on demos, I guess one of my favorite is video generation. So we have mobile diffusion transformer generating 48 frames in under eight seconds on a mobile phone, which is quite nice. And then there is a single-step diffusion-based image editing demo. It sounds like SwiftEdit. SwiftEdit, that's right. Which I covered in a conversation with Hung Bui on the podcast recently that was developed by his group. That's right. and we will show it at NerdEv this year. And then I guess there is another demo on retrieval from long videos. And there are computer-reviewed demos, traditional computer-reviewed, like open vocabulary detection and segmentation demos. And I guess there is a demo on 3D Gaussian splitting as well. So we have an interesting lineup of different demos at NerdEv this year. Awesome. Awesome. Well, enjoy the event. It is coming up very quickly and maybe enjoy your Thanksgiving as well. It was great to connect with you, Minamar. Thanks so much for catching us up on Alok Balkan's work at NeurIbs this year. Thanks. Thanks, Sam. And it was nice talking to you. And I look forward to our next conversation, hopefully. Same here. Thank you. Thank you. Have a great Thanksgiving. Thank you.

Share on X Share on LinkedIn

Related Episodes

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759

TWIML AI Podcast

52m

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

TWIML AI Podcast

48m

Proactive Agents for the Web with Devi Parikh - #756

TWIML AI Podcast

56m

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755

TWIML AI Podcast

54m

Building an AI Mathematician with Carina Hong - #754

TWIML AI Podcast

55m

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

What You'll Learn

Episode Chapters

Introduction

Challenges with Vision-Language Models

Reasons for VLM Failures

Potential Solutions

AI Summary

Key Points

Topics Discussed

Frequently Asked Questions

Episode Description

Related Episodes

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

Proactive Agents for the Web with Devi Parikh - #756

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755

Building an AI Mathematician with Carina Hong - #754

High-Efficiency Diffusion Models for On-Device Image Generation and Editing with Hung Bui - #753

AI Curator

Ask me anything about AI