

Multimodal AI Models on Apple Silicon with MLX with Prince Canuma - #744
TWIML AI Podcast
What You'll Learn
- ✓Prince Kanuma is an open-source developer focused on optimizing AI inference on Apple Silicon devices through projects like MLX Audio, MLX VLM, and MLX Embeddings.
- ✓He started contributing to MLX after being inspired by the project's potential for local inference and the involvement of the Apple team in its development.
- ✓Prince ported over 1,000 models to MLX, often doing so within 30 minutes of a new model being released, earning him a reputation as a 'superhero' in the community.
- ✓He faced challenges in running large language models like Cohere's 100 billion parameter model on his M1 MacBook, leading him to invest in a more powerful M3 Max machine to further his work.
- ✓MLX is an official Apple project, with the company's employees contributing to the project and the code containing Apple's copyright.
- ✓The MLX community was supportive of Prince's work, even suggesting that Apple should provide him with hardware, but he declined, preferring to invest his own resources into the project.
AI Summary
The podcast episode discusses Prince Kanuma's work on optimizing AI inference on Apple Silicon devices through the open-source MLX project. Prince shares his journey of discovering MLX, the challenges he faced in porting large language models to run on M1 Macs, and his commitment to making MLX the best platform for running AI models on Apple hardware. He also talks about the community support he received and his decision to invest in a powerful M3 Max MacBook to further his work on MLX.
Key Points
- 1Prince Kanuma is an open-source developer focused on optimizing AI inference on Apple Silicon devices through projects like MLX Audio, MLX VLM, and MLX Embeddings.
- 2He started contributing to MLX after being inspired by the project's potential for local inference and the involvement of the Apple team in its development.
- 3Prince ported over 1,000 models to MLX, often doing so within 30 minutes of a new model being released, earning him a reputation as a 'superhero' in the community.
- 4He faced challenges in running large language models like Cohere's 100 billion parameter model on his M1 MacBook, leading him to invest in a more powerful M3 Max machine to further his work.
- 5MLX is an official Apple project, with the company's employees contributing to the project and the code containing Apple's copyright.
- 6The MLX community was supportive of Prince's work, even suggesting that Apple should provide him with hardware, but he declined, preferring to invest his own resources into the project.
Topics Discussed
Frequently Asked Questions
What is "Multimodal AI Models on Apple Silicon with MLX with Prince Canuma - #744" about?
The podcast episode discusses Prince Kanuma's work on optimizing AI inference on Apple Silicon devices through the open-source MLX project. Prince shares his journey of discovering MLX, the challenges he faced in porting large language models to run on M1 Macs, and his commitment to making MLX the best platform for running AI models on Apple hardware. He also talks about the community support he received and his decision to invest in a powerful M3 Max MacBook to further his work on MLX.
What topics are discussed in this episode?
This episode covers the following topics: Apple Silicon, MLX, AI inference optimization, Large language models, Open-source development.
What is key insight #1 from this episode?
Prince Kanuma is an open-source developer focused on optimizing AI inference on Apple Silicon devices through projects like MLX Audio, MLX VLM, and MLX Embeddings.
What is key insight #2 from this episode?
He started contributing to MLX after being inspired by the project's potential for local inference and the involvement of the Apple team in its development.
What is key insight #3 from this episode?
Prince ported over 1,000 models to MLX, often doing so within 30 minutes of a new model being released, earning him a reputation as a 'superhero' in the community.
What is key insight #4 from this episode?
He faced challenges in running large language models like Cohere's 100 billion parameter model on his M1 MacBook, leading him to invest in a more powerful M3 Max machine to further his work.
Who should listen to this episode?
This episode is recommended for anyone interested in Apple Silicon, MLX, AI inference optimization, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Today, we're joined by Prince Canuma, an ML engineer and open-source developer focused on optimizing AI inference on Apple Silicon devices. Prince shares his journey to becoming one of the most prolific contributors to Apple’s MLX ecosystem, having published over 1,000 models and libraries that make open, multimodal AI accessible and performant on Apple devices. We explore his workflow for adapting new models in MLX, the trade-offs between the GPU and Neural Engine, and how optimization methods like pruning and quantization enhance performance. We also cover his work on "Fusion," a weight-space method for combining model behaviors without retraining, and his popular packages—MLX-Audio, MLX-Embeddings, and MLX-VLM—which streamline the use of MLX across different modalities. Finally, Prince introduces Marvis, a real-time speech-to-speech voice agent, and shares his vision for the future of AI, emphasizing the move towards "media models" that can handle multiple modalities, and more. The complete show notes for this episode can be found at https://twimlai.com/go/744.
Full Transcript
I remember one day going to my partner and saying, hey, there's this thing called MLX. It's open source. We will not make money from it, but I want to try. Like, I really want to try. She looked at me and said, OK, if you want, if you're going to do this MLX thing, you have to be the best, the best of the best. And I looked at her like, there are people that are building this. How do you want me to be the best? So it's a lot of pressure. And I said, OK, I will take that challenge on. I really want to do the Semilex thing. I made my first PR. All right, everyone, welcome to another episode of the TwiML AI podcast. I am your host, Sam Charrington. Today, I'm joined by Prince Kanuma. Prince is an open source developer focused on optimizing AI inference on Apple Silicon devices through projects like MLX Audio, MLX VLM, and MLX Embeddings. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Prince, welcome to the podcast. Thank you for having me. This is so surreal. I went from watching, from being a spectator of this fantastic podcast, especially back in the day with MLOps, and coming today to talk about what I am excited and working on currently, which is surreal as well. Yeah, I am really looking forward to digging into our conversation. You know, I consider myself a beneficiary of your work, bringing lots of interesting models to Apple Silicon devices, you know, one of which is sitting on my desk here. And, you know, I'm really curious about like how you got into it and, you know, how you approach it, where you see things going. How did you begin working on Apple Silicon? So I'm originally a machine learning research engineer. So I've been working on optimizing ML inference as well as training pipelines. So you can think about the models that are state of the art in open source. I contributed to some of them as well from RC and helped many, many companies establish their MLOps pipeline for training and inference. so I've always been up like focused on optimization and making sure that we can get the most out of the hardware that we have as well as software contributed a lot to projects like PyTorch Lightning and many others in the space I got started with MLX at 2023 end of 2023 so I just left Neptune and I was just experimenting with different things I tried some projects and Then I saw MLX for the first time. At the time, I had like a MacBook M1 Air. I only got it because I wanted the lightest laptop and the most durable laptop. And like that was a gamble. Apple's first silicon version on the Mac. And I was like, OK, let me try it out. We already use iPhones every day. Let me see how this goes. So I bought it and I was just trying it out. I installed MLX. I tried inference the first time. It was pretty slow. It was very slow. It was really, really slow. So I looked at it. I said, okay, I will not judge it from day one. I think the promise is here. We can quantize models locally pretty fast. And when I saw how fast the quantization process was, that's where it hinted me that this will get better. And I started investigating more. I got to know the people behind it. So Anwi and the MLX team, Angelos is one of the people that I know. So I reached out and I started looking what they're doing. Their GitHub was super, super active. And I remember one day going to my partner and saying, hey, there's this thing called MLX. It's open source. We will not make money from it, but I want to try. I really want to try. She looked at me and said, okay, if you want, if you're going to do this MLX thing, you have to be the best, the best of the best. And I looked at her like, there are people that are building this. How do you want me to be the best? But that's, that's kind of the standard I set always. And my dad is set for me as well. So we never, we were never allowed to have second place. Like in my family, that's like not a reality you you're either the best or not and that's why he gave me the name prince and my sister's name is queen it's like it's a lot of pressure yeah so it's a lot of pressure and i said okay i will take that challenge on i really want to do this mlx thing i made my first pr this was this was a porting hugging face star coder to MLX I said okay I'm going to do it during the weekend I get to the weekend there was already a PR but when I looked at the PR I saw that there were a lot of things missing I said okay let me try and do this so I make a parallel PR my PR gets working but the person that sent the original PR was already almost finished, but was just missing one tiny part. So they looked at my PR and said, okay, we're going to close this PR, but we're going to take the thing that works and put it in another PR. Okay? And we will add you as a contributor. I said, great. So that was the first contribution to MLX. And after that, I just got addicted. I said, okay, MLX has the promise of local inference that PyTorch and TensorFlow and many others tried in the past, but didn't really succeed. It's the fastest way. Of course, there is Lama CPP and Lama CPP is great as well. But when I looked at MLX, I said, okay, this is optimizing only for Apple Silicon. So there is untapped potential that we'll be able to reach compared to a more general approach. And we have the people that are designing the chips helping us do this. So I said, okay, what's missing in MLX right now. First, it was the models. We just didn't have enough models. Second, we didn't have enough operations, so enough modules to be able to build new models. I set out the mission to establish the developer tooling as well as port as many models as possible. So I ended up porting 1,000 models last year to MLX. Well, these are mostly quantizations. If you count just models, it's around 20 to 30 individual model, but quantized in different versions of sizes of the model are more than a thousand. So I started contributing. It was week over week, day over day. And as the more I did, the faster I got. So it got to a point where any new model released on the internet, it would only take me like 30 minutes to port it to MLX. So people started getting this idea that I am some kind of a superhero because I would be on Twitter just waiting for a model to drop. And then the moment that model drops, I would port it as fast as possible. And then there was this situation in particular in the beginning that made me even more committed to MLX, which was there was a model, I think from Cohere, there was like 100 billion parameters. and I only had the MacBook M1 with 16 gigabytes of memory. It's impossible to run that model there. There's no way you would be able to run it at a sufficient speed. I'm trying to remember the name of that model. Command R, I think. The Command R, okay. I was thinking of an earlier one. Command R, okay. Probably. Probably is an earlier one. I don't know if it's exactly that because they released so many now. but what was one of the early Coher models, 100 billion. And I said, okay, this is not fun. Most models are small, but this is too much. So I literally coded that model blind. So I just coded the model and I asked Anhui, one of the creators of MLX, to run it on his MacBook Ultra, Mac Studio Ultra, and he would give me feedback whether the code ran or not. So that was a very interesting experience. So I was just totally blind. I had no idea whether it would run or not. He just ran it. He was like, oh, it's working or it's not working. And based on whatever error he sent me, I would, you know, pretty much change my code and send a new commit. And around that time, I started thinking, okay, if I really want to take this seriously, I need a more powerful machine. At the time, I was working with a fund. They fund pretty much exceptional individuals to run projects. I was funded to create a series of models, open language models, including data. So I created the misuse. It's a 2 billion parameter model that punches way above its weight, or at the time it did. And I just was working on that whilst doing MLX. So I used pretty much enough, like six months rent of the year. And I put that on the MacBook that I have right now, which is an M3 Max. And then the community... I thought you were going to say studio. No, a studio is coming, actually. I just ordered it last month. So the studio is coming. So I ordered the M3 Max. The community was outraged. How come Prince does not have like a big machine? I was doing this really out of love. And when I really care about something, I go all in. I don't know how to go halfway. People said, okay, Apple should give you something. I said, no, I'm not really doing it because I want something from them. And we ended up talking and I was like, oh, this is not really something that we can do. And I said, of course, don't worry about it. I do this out of love and I will find a way to make it work. So I gambled on six months rent on the M3 Max. And it panned out. So I did some of my best work in the past year because of that. I managed to build MLX vision language models, which is basically running all of the best open source vision language models on your Mac, as well as on your iPhone now with the Swift implementation. And I also happen to create MLX embeddings as well as audio, thanks to a much bigger and more powerful machine. um let's take a step back and i'm trying to remind myself of the you know that early history like i also had an m1 macbook pro and i remember kind of as you describe like in the early days like it was clear it was an amazing machine just you know relative to the intel max in terms of like the everyday user experience. It was snappier. You had a lot less of this whole spinning beach ball thing that we used to remember from the Intel days. But for running ML models and local inference, for a long time, if I'm remembering correctly, there really wasn't anything there. And then we had Llama CPP, and that became kind of like a general framework in a sense that I remember when Whisper came out. We were waiting anxiously for the Whisper CPP version. I've not talked to Georgie. Do you know Georgie who does those models? I know him. I don't know if he knows me, but I follow him. I follow him and I interact sometimes, but I... Well, same here. Yeah. So, you know, he like, you know, is one of the first folks, at least that I know of, that was like trying to bring these models to Mac. And if I'm, I guess where I'm asking for you to chime in is like, my memory is that it was after that that mlx came out like mlx didn't come out with the m1 devices there was a bit of a lag and llama cpp was first um but then it was like this is the thing and oh my god it's fast like am i remembering that history right no you're remembering it correctly So I think, if I'm not mistaken, Lama CPP came early with the first open source large language models. So you could run, I think, Mistral, the early Mistral models and so on. And even Lama, I think the name Lama CPP also comes from the Lama series of models. MLX only came out, if memory doesn't escape me, end of 2023. So around December 2023, that's when it was originally released. The first M1, I remember getting my M1 in 2020, which was around the time it was released. So around three years lag between MLX and the first series of Macs with Apple Silicon. So yes, MLX is an official Apple product? Well, there are questions about that. They did get the recognition this year with WWDC. So it was in the forefront of WWDC. But before that, it wasn't really a lot of indication, but it is an official Apple product because there is an actual Apple team of employees working on this. Is it like on like an Apple GitHub account or is it a separate GitHub? Yes, so they have a GitHub that is, I think, ML Explores, and that is a repo that belongs to Apple. If you look at any MLX code, MLX LM in particular, you'll see at the top Apple copyright. Okay. So as far as I know it is, how center of their strategy it is, I don't know. But I think it should be. Do you have a sense for whether MLX is used in Apple products or is it, you know, an external library, but like they're writing directly to lower level hardware or doing something different when they're, you know, building out their apps? As far as I know, at the moment, it's not being used in any official product. they're using mostly core ml and swift um and also they have their own package the found foundry i think a foundry something that was released recently so those are the main uh products that are being used to run their ai strategy um mlx i'm not sure but i think it's not yet there, but honestly speaking, I think it should be. The discussion here would be around because we are not using the NPUs, or no, not the NPUs, how do you call it? We are using GPUs. MLX uses mostly GPUs, and we are not using the specialized part of the chip that Apple designed for this. Got it, and Core ML does use those Yes. Does MLX, is it a not yet thing or is it, you know, not possible? Or does the MLX team not have access to something that doesn't allow them to use, you know, these AI focused components on the chips? I think it's mostly because MLX is most right now focused on GPUs. And GPUs, as you know, especially on the phone, won't be as efficient as the neural engine. So the neural engine is specialized for this. It consumes less energy than if you were running the GPU. So especially for iPhones, I think this might be a concern. But when it comes to your Mac... And my question is, like, why doesn't MLX use the neural engine? Do the MLX engineers like not have access to it or is it incompatible ideologically with what MLX is trying to do? Like it would seem like That a really good question I think I will ask Henry about it Like I tried to think about it but I didn go too deep because for me personally as a researcher GPU is what you mostly use, especially if you want to run some fine tuning, some training. And it's just a much direct approach to building this. I think building a strategy where you use the neural engine and the GPU might be a bit more complicated and harder to do than it is for you to use the GPU and the RAM. So the unified RAM architecture between CPU, GPU, and the RAM makes it much easier for you to move around data. But if you want to do that and also put it into the neural engine, I think it might be a bit more complicated to do. But I will definitely ask him and get that answer for you. And I'll post it on Twitter when I get it. Okay. Do you have a sense for how the neural engine is used relative to the GPUs? Like, are there, you know, are you able to do the same things in terms of local inference with the neural engine? Or do you have, I'm imagining that you likely have maybe memory bandwidth constraints or other types of constraints that make it like it's for smaller models than the the ones that you're starting to deal with nowadays with mlx exactly so there was a discussion around this on twitter uh and i think there's a guy called annie ml on uh um on twitter and he's one of the main main persons focused on core ml as well as and we i saw the discussion and the discussion was exactly about this so the neural engine you cannot fit as big of a model that compared to the gpu So you would be very limited in terms of the models you fit and how you fit them. So for MLX, you can just take any Hugging Face Safe Tensors model, download it, and it will run. For the neural engine, you need to make all those optimizations and trace the model a certain way so they will be able to run. And also the size of the memory or the size of the model will be constrained because of that. Does the neural engine exist on the desktop Apple Silicon devices or only on the mobile devices? No, it exists on all. Okay. So you have access to the neural engine. It's just that right now you can't necessarily choose that because we don't have a platform that would give you that performance tradeoff. So it's mostly GPU. I think it's the most pervasive around all of them. So the same architecture can be used for iPad, iPhone, even Apple TV, if they do put Apple Silicon there, as well as Mac. So it's just much easier that way. And you can load really, really large models. You can, for example, load a 20-something billion parameter model quantized down to 4-bit on an M1 with 16 gigabytes of RAM. you wouldn't be able to load that big of a model into the neural engine. Yeah, most recently we've seen folks, you know, I've seen a lot of chatter on Twitter about the new OpenAI, open source models and folks trying to get these running on, you know, very old and, you know, much smaller, like these 16 gig M1 Macs, and apparently they do work. Performance isn't amazing. It is possible. I think we could optimize inference further, especially for a mixture of experts. I have a counterintuitive argument, which is we have those experts and they are active at different times. You're not loading. You don't need all those experts in memory. if we could find a way to quickly load only the experts needed during that particular inference step that you're doing, we could be able to run it at a faster pace. But that's something that's not yet been done. I know this because last year I ran inference on Mistral, the Mixtral, which is the Mixtral of experts model from Mixtral, across two Apple Silicon devices. I was kept, of course, by the bandwidth of the smaller one, which is the M1, but I could get around four tokens per second. And my idea was we could offload the expert layers to a more powerful machine and quickly do this hot swapping between GPU, CPU, between the experts. And you could reach decent speed. So I got about reading speeds, which is around four tokens per second using Thunderbolt 4, 5. Yeah, Thunderbolt 5. So we could definitely optimize it further. And who knows, potentially in the future, be able to run this in a smaller device by pruning some of the experts away and getting a much more compact model that is just decent for the particular task. So the pruning that you're talking about is like a real time adaptive pruning, like you wouldn't be eliminating experts altogether. other, you would be swapping them in and out on the fly, like trying to anticipate which expert is being invoked for a given inference? This is a very interesting question because we were, when I was at RC, it was one of the things that I was working on as well, which is I created, me and a colleague of mine created something called Fusion. And the idea of Fusion came because I was working on pruning. So when I joined RC, I had released the world's first LAMA tree with 6 billion parameters by pruning the 8 billion parameter model and then doing continuous pre-training on that model. So I could actually recover more than 75% of the performance of the 8 billion parameter model in the 6 billion. So my idea when I joined was, okay, let's explore and see if we can prune the width of the model. also the neurons of the model or the layers. And within that research, I found something very strange. When I was trying to do pruning of neurons, I found out that if I apply pruning on particular neurons, you could keep the language of the model, so the model could still speak English. And then when you apply it to others, the model will completely collapse. And I was like, okay, this is very interesting. I talked to this colleague of mine, Fernando. And I said, hey, I'm seeing this really weird behavior. Every single time I prune the model, since it's at random, I get different behaviors. I expected a similar somewhat behavior. But even if you do somewhat closer or less random approaches, the model will still behave completely different. And we theorize that there is a subnetwork within each model, whether dense or expert, that is responsible for certain behaviors of the model. So meaning it's responsible for cognitive behavior, meaning be able to do math or linguistic behavior, which is be able to understand English and output English. And then we ended up making Fusion, which is basically an algorithm that scans your model and is able to find the most important characteristics of the model or the strongest characteristics. And you can actually take those and apply it to a different model and then the resulting model will have characteristics of both parents without any training. This brings me back. This is based only on access to the weights? It's not based on like querying the model? No, no. There are zero queries. I'll give you a precise example of this. So we tested out something. He fine-tuned 5.3. I think it's 5.3.5 or 3. to do function calling. And then we took the function that that particular model and merged it with the original model, right? Fused, not necessarily merged, but merging is just a simple word. Through this fusion process. Through the fusion process with the original model. The original model initially could not do parallel function calling. But after fusing with this checkpoint, we could now do, how do you call it? parallel function calling, and the resulting model had better performance than both models combined. Oh, not combined, but better than both models on average. And this is with zero training, just access to the weights. That doesn't sound surprising if we're talking about the same model. Like you have one, if I'm understanding the scenario, you took Phi-3 and you trained it, and then you took another 5.3 and you like, that seems like you could just do a diff on the weights and like move over things that change and then you would get the new behavior and the... Not necessarily. It's not that simple. So merging already existed for a very long time. You have Slurp, you have all other techniques, but Fusion, it actually surpasses merging by a mile. Like in terms of performance, it's reliably better overall. And we also did another experiment. where we were fine-tuning a model and by fusing intermediate checkpoints until the end, you end up with a much better model at the end in terms of eval performance across the board compared to just taking all models or even doing traditional merging. Are you saying that you're... I guess I'm trying to get at like what... you know, maybe where I'm oversimplifying it, you know, or where I don't have the right mental model of this, like, what would be more compelling to me is if either, like, you're able to do this fusion across model families, which just sounds like really hard and messy. Okay. Okay. You know, or like, you know, you've got, you know, a version of a model and, you know, you, okay. you've post-trained both models and this model is post-trained to do X and this model is post-trained to do Y and you are able to fuse them and this model can still do X and now it can do Y. That is exactly what fusion does. So it's not only about those examples I gave, but it does this reliably. There's a paper from a China university, T something, I don't want to botch the name. There are papers from Cohere now talking about this. So it is a pretty great new technique. So it's capable of doing this. And because of that research that I was doing, it brings back the idea that with especially a mixture of experts, we could do a lot of things to optimize them. We could do early exit. So instead of processing the token all the way down, we could find a mechanism or ways to train the model to skip and just go to the end. That would improve performance. Another one would be offloading the experts to the CPU. You don't need all the experts loaded at all times. And there are ways of doing this without having a query to test this out. And there are ways to do it. It just depends on what you want to do and what are your constraints. If you're on mobile, you definitely don't want to be doing this on-the-fly optimization. You probably will just create a submodel that has those properties that you want to run on mobile. But if you have more resources like on PC, you should be able to just do this on the fly. Okay. Interesting, interesting. And so you've, you know, taking a step back and talking a little bit about like the, you talked about how you've kind of moved a thousand, you know, models over to MLX. I'd like to understand when you're approaching a new model, like what that process looks like. You mentioned quantization, you mentioned pruning. These are things that we've talked about on the podcast quite a bit. Like, but how are you thinking about the process? So I actually have a YouTube video on this where I walk people through the entire process. But basically, my process is quite simple. I go to Hugging Your Face. I go look at the configuration of the model. Usually, most models have a similar architecture. So the model type matters a lot. You should be able to see the model type and know whether it already exists in MLX or not. If it doesn't, I go ahead and look at that. In this case, is something like transformer-based or something else? Or is it a higher level? It's a higher level argument in the configuration of the model that you can use to see whether, usually in MLX right now, all of the models, the name of the file, of the file, the Python file, or the folder of the Python files is actually the model type. So that when we are talking about like safe vectors versus another kind of artifact or something else, what are examples of types? Okay, so basically let's take Cohere. So they have command R. Let's say the model type is command R. In MLXLM, if you go to MLXLM and then you look at the models folder, you should be able to see command R.py. That file contains all the code necessary to run inference using that model, right? So that's what I mean by model type. Hugging Face uses model type to load the tokenizer, to load, I think, model weights as well. And we just do it a more direct approach by calling the model type as the model file. And what I mean by that is like, usually if it already exists, you just run the model. You just download the model for MuggingFace and run it through the CLI or through your own Python script. Usually it's just MLX name of the project dot generate. And that should be enough to generate an initial response. If that does not exist, Then I go look at the transformers implementation, which they usually have a transformers implementation. I look at the code and I just convert one by one. MLX is very similar to PyTorch in terms of the API design. It's more inspired by NumPy and JAX, but overall it's pretty close. If you put them side by side, you'll be able to see where the commonalities are. So you just look at the Python class. Let's say it's defining the vision part of the model. You just take that code and look at the MLX syntax, and you should be able to convert it pretty easily. There are a lot of foot guns because MLX does things differently from PyTorch. You have to have experience with that, but it's pretty easy because we already have a lot of examples. So if you see a particular component on, let's say, command R, and then they release command A. If you go to command A and you see that it has a similar class that already exists in MLX, you can just copy that in and you should be okay. I'm talking in terms of like beginners. Advanced people, I would just say, spend time in the MLX documentation, spend time looking at the code that already exists, spend time looking at the PRs of adding support for new models, especially the family of models that you want to add. And then you'll be able to see the pattern and you'll be able to see what you need to do. In very rare cases, you will need to pretty much write something from scratch. And those are the hardest ones. So with MLX Audio, for example, I had to write a lot of layers from scratch. that just don't exist in MLX right now. Then I sent some PRs and some people helped me send some PRs to MLX to add those layers in but those are very rare cases It only happened to me because audio it wasn really as front and center for MLX initially but now it's kind of taking up that space and they are more aware that audio is kind of taking, growing as well as video. So usually I just look at the config file, identify whether it exists or not. Second step, I just implement it if it doesn't exist. And after that, I just make a new release and the models are on MLX. After that, after the release is done, I usually upload all of the model weights to MLX in particular setups. So the first one is quantized to 3 bits, quantized to 4 bits, quantized to 5 bits, 6, 8, and BFloat16, which is usually half precision from the original models. Some models are float32. I just do half of that. We still keep the same quality and speed compared to the original. And people can just choose. So if you have a smaller machine, let's say an M1, M1 Pro, you can go for the 3-bit, 4-bit models or even 5 depending on your RAM configuration. And then if you have more performance, you have a kind of middle range MacBook to high range, you can go for the 8-bit, 6-8 and BFloat 16. Are you just like programmatically producing the various quantized versions of the models and uploading them? Or are you doing like testing to determine like where that performance quality sweet spot is? And okay, for this one, you know, four bit works, but you know, two bit doesn't work, you know, for this one, we can get it down to two bit. or is it like you just do it and it's up to the user to determine what works for them and where they want to make that trade-off? Wow, Sam, the questions are really good. So initially I used to do this by hand. I would do one by one manually and test them. But nowadays I have to port an average of like two to three models a day sometimes. and that is just... And sometimes even more, it depends. In the morning could be open AI, in the afternoon could be on tropic and evening could be mistrial. So I never really know who's releasing. So what I do, I test out the 4-bit and 3-bit. Those are the two that I see the results and then the rest I don't see the results because I expect the performance to be good. I expect the performance of those ones to be good out of the box. But the 3-bit, some models just are too small or too fragile or sensitive to quantization that 3-bit will not give you any good outputs. So I watch out for those. 4-bit is usually standard. It just works. Might not give you the best performance when it comes to, for example, vision language models. But it works pretty well for text and audio. Because even when you think about audio, especially the more recent audio models, there are language models. So the tokens they are outputting are, you could imagine them as just normal text tokens. So they are a bit more resilient to quantization. But vision models, they are quite sensitive. So three bits, you have to see how big the model is. If the model, let's say, is one billion parameter, probably doesn't make sense to make a three-bit version of that model because it's too small to, you're compressing the model too much. Yes. So I usually just test out those two and then the rest I upload and I wait for community feedback. Right now on Hugging Face, I have, let me just quote this correctly. On Hugging Face, I have over 10,000 notifications. These are people that want you to port their model to MLX? Sorry, correction. Not 10,000. 1,700. Sorry, not 10,000, but 1,700. these are people giving me feedback on particular models asking for ports of new models and usually these also end up on GitHub and I just have lots of notifications I think 10,000 is because I also have email and overall I was counting this this week I have over 10,000 notifications across all platforms so yeah I'm imagining you building like some LLM-based thing to like go through all these notifications. I talk about that a lot. I actually released a video recently talking about cloud code because my expectation, I had this dream I actually posted, I want a system that could be a great intern for me because I want someone to check out the PRs. I want someone to see the GitHub issues across all my projects and just notify me if there's anything. I already have a body of work so the model could just see whether the new PR has the standard that I want. Or just, you know, give me an initial breakdown of what is there. Because it's like too many things to manage. But we're not there yet. It's still a long way to go before we get to an agent that you can reliably say, here are the keys to my life. go wild plus I really especially with open source I want people to know that I care know that I am there I'm reading every single notification I just usually keep them as unread because it's easier to go back to some of these notifications on hugging face I don't check because it's it's too much like there there's a lot of things going on GitHub for sure it's me it's always me I'm there all the time You've talked about quantization so far. Pruning you mentioned as well. Is that an extraordinary thing that you do, a regular thing that you do? When do you incorporate pruning into the process of porting a model over? Okay, so I didn't mention this. So with quantization, there are a couple of new techniques, especially the ones developed by Henri and the MLAX team. One is called DWQ and AWQ. AWQ, everyone already knows, which is activation-aware quantization. Basically, at inference time, you figure out which neurons fire the best and you quantize the rest. And then DWQ is more of a quantize the model down and then use a higher precision model, either an 8-bit, for example, if you quantize the model to 4-bit, or you use the full precision model and you do this sort of distillation. So basically you're taking the model, quantizing it, and recovering some of that performance through distillation. I am quite familiar with this. At RC, we worked on DistillKit. It was one of the things that we were pretty much the best at doing and we created this toolkit where people can do it as well. That technique is pretty good. Right now, I just don't do it because it requires a lot of compute. I'm waiting on the M3 Ultra that the community got me to be able to do that. So once the M3 Ultra gets here, I'll be doing more of these specialized fine-tuning of quantization models or quantized models. Additionally, with pruning, I think it also falls into this bucket. It requires some compute to recover the performance of the original model, which means you have to take some data set and train the model again a little bit or have a LoRa to be able to recover that performance. Because what you're doing at the end of the day is cutting the model dry. You're just going in and say, hey, you are out, you're out, you're out, you're out. even if you have some sort of metric that tells you which neurons to prune, which is something that I do, I created already the algorithm to do this, you still need some training. Usually you would need an A100 cluster of GPUs to do this. And I'm just waiting to get a more powerful machine to be able to test out some of these new techniques that can improve the performance of quantized models. But I'll give a shout out to actually DeepMind. They have been creating a lot of quantized aware models that they train already beforehand towards being optimized to 4-bit or even lower quantizations. So if more companies did that, I think that would actually save me the time and save a lot of people from the community at the time. So thus far, we've primarily been talking about models that other people create and you help port over to MLX. But in the intro, and you've mentioned MLX Audio, I believe, there's also MLX Embeddings, MLX VLM. Like these are models that you've created. So MLX audio and embeddings and MLX VLM are mostly packages for doing inference, fine tuning of these models and other optimizations there as well. Now, we are actually working on a model. Before we get to that model then, so these are all similar to when you mentioned audio, like MLX is the foundation, but in order to support audio models, there's a core set of components that are required. And that is MLX audio. You've like built out this kind of audio adaptation layer or, you know, you tell me what you call it, but that's the way I'm thinking of it. Think of it as a package or Python library, and even Swift. We also have a Swift package that allows you to run inference of all the models that we port. So just like MLX LM, when we see a new model in audio, we port it there, and we provide you the way to do inference. And this can sit on top of your applications as your local inference engine. We have servers, so you can stash a server that runs in your MacBook and serve it to your network. so you'll be able to generate audio or process audio on the fly in your network or use the Swift implementation where you can build an application with MLX audio components built within. So you'll be able to generate audio, do audio understanding and more. Got it. So the way to think about this is I've got a model that's a file that's a bunch of weights that's structured in a particular way. MLX knows how to take those weights and like do the math that require that's required to constitute a model but then you still have this problem of efficiently invoking or using MLX to do an actual inference and that's what these libraries are you know for audio and embeddings and VLM. Exactly. So it's quite interesting. I ended up creating more libraries that I wanted to. MLXVLM now is basically an Omni package. So we do inference of more than vision language models. We now do inference of models that support audio as well. So I wish all of this could be like one package, but unfortunately, just for the sake of my sanity, I needed to separate them and be able to think about them individually. But maybe MLXVLM comes to supersede all these other models and becomes like a single package for MLX inference? I think this is going that way, especially with the event of what I call media models. So models nowadays not only support text or vision as input, they support audio. That is something we support at the moment. But soon we will support, also we support video. We could consider that as well. But soon we will support even more types of modalities. And eventually I see all of that kind of falling into MLX VLM. Then I would have the problem of renaming it to something. Either MLX O, because Omni is taken, so I have to think about a new name. But eventually I think that is the direction the industry is going overall. I'll talk more about that later, but right now, MLX Audio is focused only on audio models. So if you want to do text-to-speech, speech-to-text, and speech-to-speech, you can use MLX Audio. A great example of this, I don't know if you heard of Unmute from Qtai. We have a similar pipeline that we released to open source weeks prior to their launch, which basically what our pipeline does, it's called a modular speech-to-speech pipeline. It allows you to take any language model or vision language model and turn that into a speech-to-speech conversation. So you can turn that model into a chat GPT with voice, for example. So you'll be able to speak to it and have it speak back to you. And you can use whatever language model or vision language model you are most interested in. interesting and so that like how is that working i got the impression that in order to have a you know let's say useful experience for speech to speech like you have to very tightly optimize that loop um and so like being able to like to to say that you have like an abstracted one that you can just plug in any model, like, that sounds hard. No, it actually works. So the beauty of it is the entire pipeline runs on MLX. So all of the models. So the speech-to-text model is MLX model. So right now we support Whisper and Parakit, which are the top automatic speech recognition models. Parakit's the NVIDIA model. Yeah, and we also support Gemma 3N through MLX VLM. So geometry N also supports audio, vision, and text as input. So you could also count that as one model there. Then we have the language model component. You can take any language model or vision language model supported by MLX and use it there. So you can use a model quantized to 3-bit, 4-bit, whatever quantization you want, and whatever model we support. And then we have the text-to-speech part, which is also on MLX, and it was how MLX Audio was created. So you can choose Cockro, which is an 82 million parameter model. It's a tiny, tiny, tiny model. You can use all of the existing models we have, including Sesame and etc. and our upcoming model that is going to bridge that gap because right now, all of the existing models don't do streaming of responses. And if you were to stream the audio back, it will be super choppy. We are solving that and our model is able to do pretty much real-time on M3 Max. Full precision. So this is full precision. If you quantize it down, we should be able to do real-time or close to real-time even on our M1 Pro and potentially M1. And this is what the new model you're working on? Yes, it's called Marvis. So it should be out, yes, my really awesome real-time intelligence system. As opposed to just another. As opposed to Jarvis. So this is where we are super excited. And I will share more details about this later today. Okay. Okay. Later today in our conversation or later today on Twitter? In our conversation as well as on Twitter. Let's do it. Share more details. Tell us about Marvis. With Marvis, we wanted to solve the problem that we also had with the modular speech pipeline. line. It's pretty fast, but if you want quality and you want to be able to stream, let's say you have a really large text that was outputted by your language model. With previous models, you'd have to divide characters every hundred probably characters or every hundred words and be able to create audio for those individual parts or paragraphs And that is not good because the audio will be super choppy one Second you have interruptions that are unnecessary so the audio will not feel natural. You'll see, for example, you might have to have a break mid-sentence because if you generate any longer than that, it's going to take a long time for the user to hear the feedback. We found out about a particular decoder that is able to do streaming. So we have around 80 milliseconds of response time for the first audio. And I was like, okay, this is great. But there's no model within the category that we are aiming for, which is just a few hundred million parameters or even just a few million parameters. So we set out to build this model and we have like I'm very happy to to announce that we we succeeded in that. We have our first model. It is about 250 million parameters and it can run real time on Apple Silicon. And you will be able to pretty much hear the voice as it comes with with no breaks or even like lag detected. Talk to me about like what the model is. Is it a port of some other model? Is it a post-trained version of an open model? Is it a ground up model? So we built this model from scratch. We are training it from scratch. It is based on an existing model, which is called Sesame. We kind of were inspired by the data they shared, which is not a lot. We only have the inference code and some blog articles to go from, but we managed to reverse engineer most of it to be able to have a great, great model. And we trained it from scratch using a few NVIDIA cards. Initially, it was a cluster, so I burned $700 initially training the first version. It turns out we needed to optimize a few things. We optimized those further, and now we are training using a single GPU card. And we just finished training of the first version. Now we are training an even smaller version of the previous one so that people with even less capable machines will be able to run Marvis close to real-time. So it's basically, the architecture is basically a language model. it's a language model that gets your text in and outputs audio tokens and those audio tokens are decoded into audio at the high level. But we will have the data, we will have the article, the code, everything will be open sourced. What are some of the things that you're seeing folks do with kind of these speech-to-speech models? Primarily like audio assistant types of experiences or have you seen anything that really blew your mind? To answer that question, I need to go back a little bit in time. Late last year, I was making a demo for Data Science Summit. I think the shirt I'm wearing right now. I've been speaking there for pretty much the past three years. And I was thinking about an audio solution because I had one problem. I created a computer use system. using vlm so i had a mlx vlm model analyze my screen and be able to click stuff and and achieve a certain task based based on my input but every single time i had to start a task i had to walk to the computer and that became annoying and i said why why do i have a computer agent but i have to sit in front of the computer and have to watch it do everything i told it to do i don't want that if If it's an agent, it's supposed to be kind of in the background doing this, and I should be able to command it in any way, shape, or form. And on top of that, I think the major reason why I created MLX Audio is because my father is actually blind. And he became blind in the past five years, and he loves to read. And I told him, hey, I will help you read. I think in 2020, I promised him that I'll help him read and navigate the world again. It's part of the reason why I created MLXVLM to help him see. So he has an iPhone. He'll be able to like navigate the world and be able to detect what's in front of him. I actually made some glasses for him, but those are just not Ray-Ban quality. They would be super bulky for him. On a hackathon, I made some glasses that would help him see the world, but it didn't work out. I just saw there's a group that's doing like an open platform glasses. Have you seen that? Well, if I get my hands on those, my dad will be able to see the world again. So I promised him I would help him see the world. And ever since, I've just been focused on vision and natural language. And now with audio, the task was he loves to read books, but he can't really read physical books anymore. And not every book is audible. So I told him, okay, I'm going to generate the audio of the books that you want. So he would give me some books that he wants. I would generate the audio and send it to him. He could listen to those books using MLX audio. And that is like part of the reason. But then going back to the computer use case, I thought, well, what if I could just sit on the sofa the way I'm seated right now? And I could just say, hey, Marvis, go ahead and order me this or find me this. or send an email to Sam saying, hey, this was amazing. That should trigger the agent. The agent should go on and do the task and just give me feedback as it's going. Now I open Chrome. Oh, now I'm actually writing the email. I wrote this to Sam. What do you think? Like I want that level of interactivity. So I set up to build MLX Audio to solve that problem. And I made a demo, I think, on the first month that I released MLX Audio doing exactly that. So we now have computer agents that run locally that can be able to speak, which is great. So with Marvis, this is going to go to the next level, which is actual real-time voice agents that can have a full-blown natural conversation with you and it should be close to a conversation you would have with someone that's next to you. And it can also clone voices so you can use the voice of a loved one, even if you want to. So that could be really cool. For my dad, he hasn't seen me in close to 10 years. So I could make his voice agents sound exactly like me. And I already have that running locally on my machine. Oh, wow. Wow. Very cool. I know you did a video recently on Cloud Code. I think you mentioned it in our conversation. And also, that would be a great use of speech-to-speech agents. So you can code your way through a Cloud Code project as opposed to using Whisperflow or something like that to paste into the thing. Absolutely. It could be even built in Cloud Code. I just don't have the bandwidth. I have so many ideas, but I just don't have the bandwidth to do. If I were to give in to my cravings to create, I would create a million things. That is one use case right there that you can build in. Cloud code could be able to speak all the thoughts and everything. Because I don't know about you, but personally, when I am coding with a model, I'm actually more interested in its logic than the answer. I find myself looking at the thought process, opening that tag, like thought for three minutes, click, let me see what it thought. Ah, this is wrong. This thought is wrong. Then I can already deduce whether the answer is correct or not. So if it could just speak through, now I'm thinking about this or just, I don't know, find a way to find the most interesting things or summarize the thought process and just speak out. I think that would make collaboration much, much easier. Well, especially if you can like reduce the latency of like, you know, being able to like just have very casually interrupt it and say, well, I don't know about that. Maybe think about it like this and then have it, you know, continue. Like you can do all that with the keyboard, but it's not fluid. Like you're working with the intern that's sitting next to you or you're pair programming with this with this machine. You should try our modular speech to speech pipeline. It does exactly that. Like one of the things I wanted is the ability to interrupt it. So I could interrupt it at any time and just say whatever I wanted to do and it will do. Now, the question is, I want to be able to empower developers and create the best SDK so that people can just integrate voice into their products and pretty much all of their local solutions, be it open source or enterprise if they want to. and this is what I'm working on and Marvis is a step towards that. So is that the next big thing in your world, Marvis? Or do you have like big plans beyond that about how it gets like integrated into other things? I have big plans about how it integrates with everything else, but I'm the type of person that is focused on what I can offer and then going step by step. I already met, like I have probably the next three to five years of Marvis in my mind. But I think the most important thing right now is getting it out of the door, getting people to use it. And then we'll be able to do really, really cool things. We already, I don't want to say too much, but we already have really great examples of what Marvis could do. You should be able to, for example, manipulate Marvis to have an actual agent that speaks the way you want it to. Whether you want a Jarvis more relaxed way or you want it a more active way, or I could just take all of your audience and say, hey, sound like Sam in terms of cadence, prosody, everything. Just have an agent that sounds exactly like you. I think that is important for me in particular. I think I would love that. I haven't seen my family in a long time. And I think having agents with some of my loved ones' voices would be pretty cool. Or them having me, which would be nice. Then I would annoy them everywhere. That's funny. Awesome. So we will keep our eyes peeled for Marvis updates and get a link from you as to where we can go to look for that. Of course, it'll be on X as well. very cool. And tell me what else, like, you know, beyond the things that you're working on personally, like, what are you most excited about in the world of AI? I think the most interesting thing right now is the direction that we are taking. Initially, it was all AGI, AGI, AGI. But now people are realizing that, yes, that is great. It would be a great world, but I want a system that is capable enough or competent enough to have an actual intellectual discussion with me and tell me when I'm wrong in a very reliable way. I can trust it. Like, I want a system that I can trust. And I feel like we are moving towards systems that we somewhat can trust. There's so much backlash to, you're absolutely right. His came up in an interview I just did. People are realizing that actually we want something that's going to challenge us and like help, you know, work with us to like produce better things or, you know, think better about the thing that we're asking about. Exactly. I have a tweet where I say, I said 22 hours ago, I'm waiting for the day it will say you are wrong. Here's why. Delivers facts with sources. I want that. You're absolutely wrong. I think I want that. I don't want a system that is just like, oh, you're right. Or, oh, it's my bad. My bad what? You're actually right majority of the time. You just have this tiny bug here. And I want you to just fix that one thing. So that is one direction that I'm loving to see companies take on. So they are trying to build new architectures and new training strategies to be able to address that. Second thing is media models. I think over time, the same way last year when I created MLX VLM, there was literally very few vision language models, but nowadays vision language models and including audio models can do function calling. So you can do function calling with your voice, which enables experiences like Marvis. You'll be able to do way more with new types of media. So I'm seeing a lot of interest in audio space. I'm seeing a lot of interest in the video and lip syncing space, as well as like generation of images. I think going towards a world where newer models can do all of those modalities in one and they can do in a very reliable way and efficient in the sense that it could be, you could get a 1 billion parameter model or 100 million parameter model that could do a few modalities, and you can embed these directly into your products and make your life simpler. To kind of illustrate this, before Gemma 3N, the modular speech-to-speech pipeline, you needed ASR and you needed a language model. Now, with Gemma 3N, you could replace the first component and the second component into one. And actually, third component, which is multimodality with vision. So you now have one model that does audio understanding, does text and image. Now imagine if you could also do other types of processing all in one or even output audio as well. So now you can simplify your pipeline and your entire system can just run much faster because you're not loading all of these individual components and try to mesh them together. So I see media models becoming something pretty great. And I intend to pioneer that, at least when it comes to MLX. I really want to have people do this in the best way. To add to this, I actually have a challenge with someone. Julian challenged me in Germany last month to create a package that's able to generate images and video in under 20 seconds with a few gigabytes. And I'm taking on that challenge. So look up to that as well. And the goal there is something along the lines of Sora or Veo? Yeah. Okay. But it runs completely offline. Completely locally? Yes, and locally. So I really want to do that. Yeah, I want to do that. I think the world is going to benefit a lot from it. And who knows, perhaps the whole equipment here for the audio, because it's for the studio, might be replaced. In some videos, I might just take a copy of me and make a video, a generated video and say, hey, guys, at least I thought it through. I made the script and you guys can watch something pretty new. That's awesome. Awesome. Well, Prince, it was great chatting with you. Thanks so much for jumping on and talking a little bit about what you've been up to and how the MLX ecosystem is evolving. Yeah, absolutely. Thank you very much for having me. Thank you. Thank you.
Related Episodes

Rethinking Pre-Training for Agentic AI with Aakanksha Chowdhery - #759
TWIML AI Podcast
52m

Why Vision Language Models Ignore What They See with Munawar Hayat - #758
TWIML AI Podcast
57m

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757
TWIML AI Podcast
48m

Proactive Agents for the Web with Devi Parikh - #756
TWIML AI Podcast
56m

AI Orchestration for Smart Cities and the Enterprise with Robin Braun and Luke Norris - #755
TWIML AI Podcast
54m

Building an AI Mathematician with Carina Hong - #754
TWIML AI Podcast
55m
No comments yet
Be the first to comment