

A Technical History of Generative Media
Latent Space
What You'll Learn
- ✓Phal started as a Python runtime in the cloud, then evolved into an inference system and a generative media platform that optimizes models for developers.
- ✓The company's pivot towards hosting and optimizing Stable Diffusion 1.5 was a key turning point, as they recognized the need for a scalable, API-driven solution for running these models.
- ✓Phal has seen significant growth, reaching over $100 million in revenue and serving 2 million developers with 350 models on its platform.
- ✓The company has strategically focused on the generative media market, which they see as a fast-growing niche with less competition from tech giants compared to language models.
- ✓The technical challenges involved optimizing CUDA kernels and inference performance, which has become increasingly competitive as larger players like NVIDIA have entered the space.
- ✓The release of models like STXL, Flux, and VO3 have been major milestones, driving significant revenue growth and expanding the capabilities of the generative media ecosystem.
AI Summary
The podcast discusses the technical history and evolution of generative media platforms, focusing on the journey of Phal, a company that specializes in optimizing inference for image, video, and audio models. The founders share insights into their decision to pivot the company towards this niche market, the key milestones in the growth of their platform, and the competitive landscape in the generative media space.
Key Points
- 1Phal started as a Python runtime in the cloud, then evolved into an inference system and a generative media platform that optimizes models for developers.
- 2The company's pivot towards hosting and optimizing Stable Diffusion 1.5 was a key turning point, as they recognized the need for a scalable, API-driven solution for running these models.
- 3Phal has seen significant growth, reaching over $100 million in revenue and serving 2 million developers with 350 models on its platform.
- 4The company has strategically focused on the generative media market, which they see as a fast-growing niche with less competition from tech giants compared to language models.
- 5The technical challenges involved optimizing CUDA kernels and inference performance, which has become increasingly competitive as larger players like NVIDIA have entered the space.
- 6The release of models like STXL, Flux, and VO3 have been major milestones, driving significant revenue growth and expanding the capabilities of the generative media ecosystem.
Topics Discussed
Frequently Asked Questions
What is "A Technical History of Generative Media" about?
The podcast discusses the technical history and evolution of generative media platforms, focusing on the journey of Phal, a company that specializes in optimizing inference for image, video, and audio models. The founders share insights into their decision to pivot the company towards this niche market, the key milestones in the growth of their platform, and the competitive landscape in the generative media space.
What topics are discussed in this episode?
This episode covers the following topics: Generative media, Inference optimization, Stable Diffusion, CUDA kernels, Startup growth and strategy.
What is key insight #1 from this episode?
Phal started as a Python runtime in the cloud, then evolved into an inference system and a generative media platform that optimizes models for developers.
What is key insight #2 from this episode?
The company's pivot towards hosting and optimizing Stable Diffusion 1.5 was a key turning point, as they recognized the need for a scalable, API-driven solution for running these models.
What is key insight #3 from this episode?
Phal has seen significant growth, reaching over $100 million in revenue and serving 2 million developers with 350 models on its platform.
What is key insight #4 from this episode?
The company has strategically focused on the generative media market, which they see as a fast-growing niche with less competition from tech giants compared to language models.
Who should listen to this episode?
This episode is recommended for anyone interested in Generative media, Inference optimization, Stable Diffusion, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Today we are joined by Gorkem and Batuhan from Fal.ai, the fastest growing generative media inference provider. They recently raised a $125M Series C and crossed $100M ARR. We covered how they pivoted from dbt pipelines to diffusion models inference, what were the models that really changed the trajectory of image generation, and the future of AI videos. Enjoy! 00:00 - Introductions 04:58 - History of Major AI Models and Their Impact on Fal.ai 07:06 - Pivoting to Generative Media and Strategic Business Decisions 10:46 - Technical discussion on CUDA optimization and kernel development 12:42 - Inference Engine Architecture and Kernel Reusability 14:59 - Performance Gains and Latency Trade-offs 15:50 - Discussion of model latency importance and performance optimization 17:56 - Importance of Latency and User Engagement 18:46 - Impact of Open Source Model Releases and Competitive Advantage 19:00 - Partnerships with closed source model developers 20:06 - Collaborations with Closed-Source Model Providers 21:28 - Serving Audio Models and Infrastructure Scalability 22:29 - Serverless GPU infrastructure and technical stack 23:52 - GPU Prioritization: H100s and Blackwell Optimization 25:00 - Discussion on ASICs vs. General Purpose GPUs 26:10 - Architectural Trends: MMDiTs and Model Innovation 27:35 - Rise and Decline of Distillation and Consistency Models 28:15 - Draft Mode and Streaming in Image Generation Workflows 29:46 - Generative Video Models and the Role of Latency 30:14 - Auto-Regressive Image Models and Industry Reactions 31:35 - Discussion of OpenAI's Sora and competition in video generation 34:44 - World Models and Creative Applications in Games and Movies 35:27 - Video Models’ Revenue Share and Open-Source Contributions 36:40 - Rise of Chinese Labs and Partnerships 38:03 - Top Trending Models on Hugging Face and ByteDance's Role 39:29 - Monetization Strategies for Open Models 40:48 - Usage Distribution and Model Turnover on FAL 42:11 - Revenue Share vs. Open Model Usage Optimization 42:47 - Moderation and NSFW Content on the Platform 44:03 - Advertising as a key use case for generative media 45:37 - Generative Video in Startup Marketing and Virality 46:56 - LoRA Usage and Fine-Tuning Popularity 47:17 - LoRA ecosystem and fine-tuning discussion 49:25 - Post-Training of Video Models and Future of Fine-Tuning 50:21 - ComfyUI Pipelines and Workflow Complexity 52:31 - Requests for startups and future opportunities in the space 53:33 - Data Collection and RedPajama-Style Initiatives for Media Models 53:46 - RL for Image and Video Models: Unknown Potential 55:11 - Requests for Models: Editing and Conversational Video Models 57:12 - VO3 Capabilities: Lip Sync, TTS, and Timing 58:23 - Bitter Lesson and the Future of Model Workflows 58:44 - FAL's hiring approach and team structure 59:29 - Team Structure and Scaling Applied ML and Performance Teams 1:01:41 - Developer Experience Tools and Low-Code/No-Code Integration 1:03:04 - Improving Hiring Process with Public Challenges and Benchmarks 1:04:02 - Closing Remarks and Culture at FAL
Full Transcript
Hey everyone, welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swix, founder of Small AI. Hello, hello. Today we're so excited to be in the studio with Gorkum and Batuhan of Phal. Welcome. Yeah, thanks for having us. Long time listener, first time caller. Gorkum, you and I actually go back a long way to when it was still features and labels and you were just coming out of Amazon. I don't even remember the pitch. I wish I should look at my own notes, but you were optimizing run times. Yeah, it was first like we were building a future store and then we took a step back and then we decided to build a Python runtime in the cloud. And that evolved into an inference system that evolved into what Fall is today, which is a generative media platform. So we optimize inference for image and video models and audio models, but we do a lot more. We try to own this whole generated media space for developers, basically. Yeah, amazing. And we can talk about that journey. I wanted to also introduce Batuhan. We're newer to each other, but you've come to some of my meetups before. You're head of engineering. Yeah, I lead engineering here at FAL. You know, glad to be here. And what's your journey? I met Burkha in 2021 when they were just starting a company. And like just before the seed round, you know, Burkha and Joachim, we met online. We're both Turkish. So I think that was the connection. we just met and then they said, oh, why don't you join us? And I was one of the core developers of Python language. So I had a really good experience with developer tools around the Python language. So I started coming here to build the Python cloud, which evolved into this inference engine and the generating media cloud that we're building today. And now you spend less time with Python and more time with, I don't know, Kuda. Custom kernels. Exactly. Yeah, I remember the dbt file when the modern data stack was out. Can you guys maybe just give a quick sense of the scale of FAL? So you just raised $125 million. Seriously? We can talk about how I passed on one of your early rounds. We can go through that. How many of the developers, how many models do you serve, and maybe any other cool numbers? Yeah, we have around 2 million developers on the platform. And for the longest time, we required GitHub login. It recently changed. So I'm assuming everyone who has a GitHub account as a developer. and we have around 350 models in the platform. These are mostly image, video and audio models. It used to be only image and then we added audio and the space evolved into video as well. And yeah, that's pretty much the scale. We just announced our 3C round and we've been growing a lot in the past year and it's still continuous. Yeah, you had a very nice 3C party. Yeah, thank you. And you guys are over $100 million in revenue, right? This is not just developers kind of kicking the tires. That's correct. Yeah, that's great. When you say 250 models, what percentage of all the models that you could serve is that? There is an infinite amount of iTunes post-trained versions of these models. We are trying to serve the models that fix a gap, that fill a gap in the stack. So we don't add a model that's significantly worse in any aspect compared to other models that we have. We are trying to bring unique models that solve a customer's needs. So that's like, these are 350 models, you know, there's like 20, 30 text image models, but like one of them excels in logo generation, another one excels in human face generation. So like every model has a unique personality, but if a model is like significantly worse in all aspects, we don't add that to the platform. So there's like infinite amount of models that we can add. And do you rely on your own evals or just like what the community tells you? We mainly rely on our own evals as well as, you know, like we are in the community. So we also follow the community very well to see what is going to be the thing that's going to be in the next generation of apps. So if we think something, like we have a good intuition, if we think something is going to pop up, we just add it. Yeah. To my knowledge, you haven't published your own evals, right? No, we don't publish. It's internal. And then the community is Reddit, Twitter? Twitter, Reddit, you know, Hugging Face, seeing how popular the models are in Hugging Face and other demos. Okay. I just want to give people a sense of where to get this info. The best part of the job is the day of a model release, the adrenaline rush that comes with it, the whole team trying to scramble something together and release it. And it happens every week. Every week is exciting. Can we do maybe a brief history of like the models that were like the biggest spikes maybe in usage? You know, you cannot, I think everybody knows stable diffusion, you know, and then you have maybe like the flux models and then you have Black Forest Labs. You have like all these different models. History-wise, I think the biggest, like the initial hit was Stable Division 1.5, which is when we actually pivoted into this new paradigm of fall, Generator Media Cloud. We started hosting it. We noticed like we had the serverless runtime and everyone was running the Stable Division 1.5 by themselves. And we noticed it's terrible for utilization and they are not optimizing it. So let's just offer an optimized version of this that's ready for API to be scaled and doesn't require people to deploy Python code because we want product engineers to start using it. We want mobile engineers to start using it. So we started offering Stable Deficit 1.5. It was very popular. The fine tunes around it was very popular. Stable Deficit 2.1 came. It was a bit of a flop. So it didn't like, you know, got that much attention. And then STXL came, which was like the first major model that brought like our first million in revenue. If you consider that. And with STXL, obviously like the small fine tuning ecosystem also like tried, exploded. People started fine tuning their faces, their objects, whatever. And generations with this, like LORAS started to become very popular. And then after Stable Deficit Excel, there was like a bit of a quietness around it. You know, the SD3, there was like some drama around it. And the team at Stability left to start Black Forest Labs, which released Flux models. And that was the first model to reach the barrier of commercially usable, you know, enterprise-ready great models. Where in the first month of Flux models, we reached from like $2 million to $10 million in revenue. It was like a big jump. Next month, we were at $20 million. It just started going from there. And then Video Models came around. You know, we partnered with Luma Labs. We partnered with other video model companies in China. We partnered with Kling, Kuaisho, Minimax. And with these models, it created another market segment. That was a big jump. And the final biggest thing was VO3, where it actually created this usable text-to-video component, where before text-to-video was a very boring, soundless video where you would get enjoyment out. Whereas now it's such a great experience. You can create all these memes that you're seeing online, all these ads. So that was another big jump for us, partnering with Google DeepMind 403. Yeah, actually, that's a really good history of generative media, SoundBite. So I wanted to double-click on that because obviously we can dive. I think everyone's interested in video, but there's a whole history of the image side that I wanted to cover first. Just definitely wanted to start with was just the decision to pivot. I think I just want to double-click on that. It's not a trivial decision, but obviously the right one. At the time, I would say a lot of people were hosting Stable Diffusion, right? So it wasn't obvious that you can just build an entire company around effectively just specializing in diffusion inference. What gave you the confidence? What were the debates back and forth? Yeah, a couple of decisions we had to make there. We could have evolved the company into more towards GPU orchestration. And like essentially we had this Python runtime. We were running it on top of GPUs like that could have been the company. But we saw every single person, every single company who are using what we had like a little SDK to run Python code on GPUs. They were doing the same thing. They were deploying a stable diffusion application, maybe using some LoRa's on top of it, different versions of it, in-painting, out-painting, things like that. I mean, it was very wasteful. We decided, okay, this needs to be an API where we actually optimize the inference process and everyone benefit from it. And like you can run it multi-tenant, you know, the utilization is much higher than. So that was the decision number one. And then obviously after stable diffusion, I think like four or five months later, LAMA 2 came out and there was a decision point again. You could do language models. Exactly. And a lot of the inference providers at the time, there were maybe a couple of them and they all went all in on language models. And we decided language models, hosting language models is not a good business. At the time, we thought, okay, we are going to be competing against OpenAI and Anthropic and all these labs. Turned out that it was even worse because the killer application of language models is search and you are competing against Google at the end. And Google can basically give this for free if they can because it's so important for them. And, you know, it threatens their business right away. And with image and video models, it was a net new market. We weren't going against any incumbent. We weren't like trying to get market share from someone much bigger than us. And we liked that aspect of it. We thought we could be a leader here. It was a niche market, but it was very fast growing. So we chose to be a leader or play to be a leader in this fast growing niche market rather than trying to go against Google or OpenAI or Anthropic. So that was the decision we made. And turns out it's a good one because we are able to define the market we are in and educate the people and grow with it. And so far, it's been growing fast enough that we are able to build a whole company around it. Yeah. And I think you noted at AIE that, you know, now there's a generative media track. Generative media specialist investors. Thank you for calling it generative media, by the way. Yeah. I mean, obviously, it's a thing and people care about it. And I do think it's going to change the economy. And as a creative person, I think I also wonder what's going to do for us. just so I want to keep it technical and keep thinking about the pivot because I think it's still one of the most interesting pivots I've seen in the AI era. You were not CUDA kernel specialist at the time, right? I come from a compilers background. So my job was optimizing Python bytecode interpreter to make stuff faster, which is performance engineering. And yes, I don't think at the time there was that many CUDA kernel specialists either. So it's like we were at the right time. you know, it was like, actually, like, the space was actually so, so much worse than what we have today, where, like, the running base, like, stable diffusion 1.5 was, like, a unit with convolutions, and the convolution performance on A1Rs was, like, you're getting, like, 30% of the GPU power if you just use RavTosh, because no one cared about it. So there were, like, so many low-hanging fruits that we started to pick up and started optimizing, and it kind of evolved, evolved, evolved. Right now, it's, like, much more competitive space with, like, NVIDIA has, like, a 50-person, 100-person kernel team that's writing kernels. You're competing against that at the time. no one really cared about so it was like a good uh new field for us to go thrive and there's no community effort like a vlm not exactly when these models were first released like no one in the world has ran them in production like it just didn't exist it's like a research output exactly yeah it was stability yeah you had your maybe local gpu maybe you had like a single gpu that you rented from the cloud and basically this was a research interest rather than a product interest and no one at Meta, no one at Google had run this in production. So we also thought this is a good time to start a company around this and actually spend time optimizing it as much as we can because if we can get millions of people to use this, there's a lot of economical value to be created there. Can you talk a bit about how much of a performance boost you got? Because I know when I met you guys, you were about a million in revenue. You were like, well, we're writing all these custom kernels And maybe part of it is like, okay, how many kernels can you actually write as you support all these different models? What's kind of like the breadth of them? Are you writing kernels that you can reuse across models? How much work do you have to do on a per model basis? It really evolved in the past three years. When we first started, there was a single model, Stable Diffusion 1.5. So all of our kernel efforts were, how do we make Stable Diffusion 1.5 as fast as possible? You go from like 10 seconds with PyTorch. At the time, there was not even like a Torch Compile, Torch Inductor, whatever. So you were going from 10 seconds to maybe like two seconds on the same GPU. And we started with that. The next thing, with adding more models, like Stable Diffusion XL was a different architecture. Pixar was a different architecture. All these different architectures started coming around. We said, let's build an inference engine, which is what we call a collection of kernels, parallelization utilities, diffusion caching methods, quantization, all that stuff combined into one package. And so we built this inference engine. the same time PyTorch 2.0 was released with Torch Inductor and Torch Dynamo to do Torch Compile, which is essentially a way to trace the execution of your neural net and generate Triton kernels that are fused, that are more efficient. And I'm a big sucker for just-in-time compilers. I used to work at PyPy, like a just-in-time compiler for Python. And we said, this is a great idea. Let's apply this, but a more specialized, more vertical way for diffusion models. At the time, it was units. Now it's diffusion transformers, which are significantly different than your autoregressive transformers in terms of the profiles of how compute-bounded is, what sort of the kernels are taking the majority of the time, if they're doing bidirectional attention or causal attention. So we started doing that, and now what we have today is an inference engine that's applicable. That gets you 70-80% of the majority of the models on diffusion transformers, and we still have a lot of custom kernels for a lot of models to squeeze out, because they're still small. Every model wants to make an architectural difference. You guys see this on even for stuff like QN, deep seek whatever people want like even if we know an architecture is the best they want to tweak it a little bit just to make sure oh we're releasing something cool so we we saw this and then like for that we have to write like custom kernels for custom rms norms that people are doing or whatever like stuff like that so we we have we have a decent amount of kernels like over 100 of custom kernels this doesn't include the auto-generated ones you know we have templates of kernels that generates like you know for thousands of different shapes problems spaces whatever but like if you consider those you know like we have tens of thousands of kernels obviously at runtime that we are running and dispatching, but that's pretty much like the depth and breadth of it. And on average, a model on file runs 10x faster than I would self-host it. Like if I just take stable diffusion, right? And I put it... I know that this might be a bigger discussion point. Do we consider speed as a mod? This comes to that. Like the existing open source industry will so fast where, you know, like if you go to... Like this might have been true three years ago. Now PyTorch is like already like very, very good for H1Ns, right? What about B200s? When you use PyTorch with B200s, Blackwell chips, you're not getting the best performance so our main objective and our main goal is for whatever gpu type you're using these diffusion models we're going to extract the best performance at any point in time it could be 1.5x it could be 3x it could be 5x for certain models it could be 10x it would be a bit of an unfair thing to say oh we're going to make everything magically 10x faster no one in the world can do that we are lucky that this is a moving target and open source community everyone catches up but at the same time new chips come out new architectures are released. So we are always ahead of like what's possible, but then they catch up, but we have to stay ahead of it And that how we can create differentiation because it a moving target because there so much going on Whenever something new comes up we are the first one to optimize it first one to adapt our inference engine to it So at that time, the fastest place to run it, that helps with margins, things like that. But eventually people do catch up. I think it's very hard to create this differentiation over long term if there is no new architectures, if there is no new chips. But luckily there is all the time. Yeah, and I think with image specifically, you cannot stream a response, so to speak. So when you have a language model, it's like you're kind of bound by like how quickly you can read. So even with like Grock, it's like it's impressive to show a thousand token a second, but it's like I'm not reading that fast, right? So it can go slower versus with images. It's like you just need to see it. That's why mid-journey now has like the draft mode, for example. It just gives you this like very low quality resolution. Yeah, but at least you can see whether or not it's going in the right direction. how much of that is actually true for like your customers what do they care about the most like it's latency like what's the range of latency that matters yeah latency is really important one of our customers actually did a very extensive a b test of like they on purposely slow down latency on file to see how it impacts their metrics and it had a huge part in it and it's it's almost like page load time when the page loads slower you know you make less money i think Amazon famously did a very big test on this. It's very similar. Like when the user asks for an image and, you know, iterating on it, if it's slower to create, they are less engaged. They create fewer number of images and things like that. Yeah. It's the same learning that Amazon has, like every, you know, 10% improvement in speed. Yeah, exactly. The elasticity is high. the other thing i wanted to also dive into you know like um i putting my a little bit of the investor hat on one of the reasons for file success is kind of with not within your control which is how when and how people release release open uh models for defer diffusion which like at the time it was just ability and like there wasn't there was no chinese uh you know output i mean what we did have other image models but they were not great yeah and like and so you you made a bet on when it was just like wasn't super obvious but then the other thing is which is what you're touching on the diffusion workload is very different from the language workload and the language workload is being super optimized where it is diffusion is not so you just like had kind of no competition for a while which is fantastic for you 100 and like the open source we benefit a lot from it obviously but like in the in the past six months a year we started working with some of the closed source model developers as well like behind the scenes helping them but they're not sending you their weights they do they are they are wow yeah what do you have to give security like I mean, we are any cloud provider. What do they think in AWS or Google Cloud or these Neo Clouds? There's like 50 Neo Clouds, right? We are not that different from any other cloud provider. And this is why we package the inference engine in a way that they can self-service and get 80%, 90% of the performance. So they don't even have to show us their code. They deploy to our, we have our own cloud platform where our inference engine is available only in that platform. So they can tap into that when they deploy their code and their model weights to us. And we don't really have to look at it. If they want to collaborate with us, which some companies did in the past, where we would just essentially, we have performance engineers acting as forward deployed engineers on their behalf and writing custom kernels for them. Okay. Have you disclosed who you're doing this for? We disclosed PlayHD, PlayAI. That was one of those. We have like four different companies, four major video companies that we are doing this with. And one image company that I don't think we disclosed. Yeah. As you can imagine, it's a little sensitive for them. So, yeah, I would say like, so like Replicate started serving VO3 models and we were like, OK, are you just wrapping their APIs or something? And I think so. It's not obvious, like how much integration there is going on and how much it's on your infra or like your tech. Just to be honest, like some of that is happening to VO3. Like I think it's just API. Yeah, it is. Yeah. You have a dedicated pool that you can serve to your customers with like different speed SLA guarantees, whatever. It's like how it would work for something like VO3. But your objective is to be one-stop shop, but then also you can do inference better than some of these other, like, PlayHT. Google is obviously hard, but, like, with other vendors, like, our goal is, like, helping them run inference because these are research labs that doesn't necessarily invest heavily on inference optimizations, scaling up infrastructure. That's, like, another challenge that we can talk about. Like, at launch days, like, some of these models, like, in their website, they just, like, explode and File API is working fine because they deploy them. We can scale up to, like, thousands of GPUs instantly. So there is that aspect, too. when we pitch this value prop as well as a distribution that we bring to them, it's a no-brainer for them to just deploy their model to file and use it for both the file marketplace as well as their own distribution channels. Yeah, a couple of follow-up questions. Just on PlayHT, just because you mentioned it, music, audio, is that a different workload than normal diffusion, or is that the same? I can't really comment on their architecture, but some of them are autoregressive models in the open-source world, some of them are autoregressive, some of them are diffusion-based, so there's notorious ones that's known for diffusion, as you guys can guess, like one of the biggest companies. So it's similar workloads, but at the end of the day, you know, our performance, our inference engine is like very versatile and our performance team is very versatile. With PlayHD, we had like very deep collaborations where we had like three engineers at some point, you know, helping them optimize their inference process as well as infrastructures to get them like ADMS end-to-end time-to-first audio chunk, which is like a very impressive thing for real-time text-to-speech workloads. And then the other known hard problem is serverless GPUs, which is a thing that everyone has chased a lot and many people have failed. What can you say about what you've done there to make it happen? So for example, Modo has been talking a lot about their GPU snapshotting. But I imagine it's like a stack of technologies in order to achieve the scaling. It is a stack of technologies. The biggest problem with serverless GPUs is, are you just repping another, if you have a Kubernetes deployment, are you just repping it and giving people access? or are we actually like multi-cloud? Do you manage your own orchestration chain? Do you manage your own container runtime? Do you manage like all this like stack? And in our case, like we started with a Kubernetes version when we were just doing it for ourselves. And Kubernetes version at Google Cloud was fine in 2022 when we wanted to get eight A100s. But when we wanted to go to like thousands of A100s, it's not going to work. It's a terrible position to be bound by a single cloud. So right now we work with six cloud providers and we have 24 different data centers in four different countries. and we now do like long-term data center releases as well to manage like the sum of the hardware chain ourselves. In this world, like we had to build our orchestration layer, we had to build our own distributed file system, we had to build our own container runtimes, all the stack to make sure that the call starts are extremely, extremely fast, which is one of the things when you're scaling up, as well as handle actual scale where, you know, like we are managing over like 10,000 plus H100 decolones today. Yeah, and at CDN for caching. CDN, that's like outside of the serverless infrastructure, but like CDN, content moderation systems, like all of these, like all consists of the platform. Like there's like so many. You also do moderation. We also offer content moderation services to the foundation of all companies for them to like moderate their inputs and outputs. Yeah, I see. I see. As a separate product. Yes. From a GPU perspective, do you always need to be on the latest? You keep mentioning H100s. Majority of our workloads are in H100s because price per price, it makes sense. But like BlackVal is obvious. Like we have five people dedicated to writing BlackVal kernels right now to make sure we can like, Because theoretically, it looks good, right? Like flops dollar-wise, it makes sense. But can you reach the actual flops? No. So we have a dedicated team that's like working with NVIDIA directly to write custom kernels for Blackwell for diffusion transformers to get to the point where it makes per dollar make sense. And then we would start with our own workloads as well as some of our foundational companies. We would ask them, oh, if you want to migrate to Blackwell, here's an inference stack that already works. We are at that point where we should be the ones pushing the boundaries on Blackwells because no one else is doing this work. and maybe it doesn't make sense economically right now, price perf wise, but we know it can. So we are like working towards maybe like a couple of months away from that point. And then whenever it does, we'll probably switch as many workloads to Blackwells as possible. Just to be super crazy, when does it make sense to just work on an ASIC? I don't think it does. That's like honest opinion. This is like one of the most controversial topics, right? like is is is all these like asics a great idea if you're like sdram if you're memory bandwidth bound and like you can put all of sdram is like even economically viable at that point i don't know but like the summits are around like these cheap designs so you see like okay what is the overhead of an nvidia gam instruction right it's like 16 so like you're essentially buying a matrix multiplication machine so at like it doesn't really make sense to specialize it that much and like some of the like like b300s are gonna have like a better soft max instruction that gets like 1.5x, whatever. And that might be one way where NVIDIA gets better performance out of the majority, for the majority of workloads, which is attention heavy stuff. I think it might make sense for NVIDIA to add more specialized stuff. But for us, I don't think it will ever make sense to build Lasix. Just thinking about, from first principles, that the diffusion workload is very different. But also, obviously, there's still a lot of changes in the architecture that you need to just, you need general purpose. We don't have a single model where we are trying to optimize. We are trying to do it for the newest, the best, like always. The flexibility is therefore really important. I was going to pull up the Quen MMDIT, where there's this dual streaming thing, which I think SD3 had it. Yes, SD3 Flux. Is that the standard model now? MMDIT, it's also a controversial topic. Scaling rectified flow of transformers paper, the SD3 paper came up with this architecture. And then one of our research team, actually, like Simo Ryo, he's like our head of research. He found out that like just using MMDITs are inefficient. You need to mix them. And now there's like controversial opinions, you know, like Movie Jam paper were saying, oh, MMDIT is complete unnecessary. You can just use a single stream DIT, whatever. So like there's like controversial opinions happening in terms of architecture changes, which I understand because everyone wants to do a different architecture. No one wants to do the same architecture because it's lame. Otherwise, like it's just a matter of compute and data. and these researchers don't feel proud that their model is an output of data and compute. They want to make a novel research change. So I think the architecture is going to keep changing until this paradigm of researchers changing stuff for the sake of changing finishes. I'll talk about a couple other architectural things just to keep it bounded within this topic. The distillation was a thing for a while. SDXL Lightning, you guys did fantastic demos of Tail Draw, which we've also had on podcast, fantastic episode. But what happened to those things? How come they're not popular anymore? I think it makes for a good demo. You know, you could build real-time applications. You could build these drawing applications, things like that. But I don't think people could build applications that have user retention long-term. Like people couldn't really build useful things with it. Let me play out what I thought was going to happen. And then you tell me why it didn't happen. Which is consistency models for drafting. it's like you use your hand to draw things and it creates the the draft then you you upscale right with with a real model but like that's it like why can't it be a two-stage process instead of one stage yeah and i think one thing that happened is flux like that generational models were not good at image to image when it first came out so like you need a good image to image model to be able to draw and maybe it needs to be revisited around this time with some of the I do think models maybe. Like image to image and control. That's flux. Like STXLR and control. That's where we're very popular, where like people were used to do this stuff, like sketch to image, whatever. And like with Fluxer, I think people cared less about it. One thing that I keep thinking about this is like, is this true for LLM still? You know, like I always default to cloud 4.1 opus, right? Even if it's slower than Sonate, it's just like, I know I'm going to get the best quality. Exactly. That's what's happening here. Yeah. It seems like what's happening here as well. Okay. Anyway, as a creator, I want fast, quick drafts and then, and then I can refine. Right. Like, so I don't know. I don't know why it didn't happen. More true for video models, right? It used to be like five minutes, four minutes for a single five second generation. Now it's mostly under a minute, but you want 10 second, five second generation. And then because the workflows of like creatives when they're working with it, they generate a ton of videos and then like pick one and then create a story around it. So when you watch these people actually generate videos, they generate like hundreds at a time and they have to like kind of sit around and wait and then like iterate on it. Like the faster speeds mean a lot for creators. Yeah, it does. The other thing I wanted to also briefly touch on before we go back to the main topics is the autoregressive models, which you mentioned, right? Like obviously, I honestly, I still think Gemini is underrated because they were first. and then but then obviously OpenAI did the 4.0 image gen and that was a huge thing I actually even wonder if there was a panic for you guys because obviously it's like this is Soda image gen and like no one else has it it's not it's not open source have you passed through those eras so many times you know like man stopped worrying about it yeah your camera's like good stories around Dolly yeah I mean I talk about this app like when the lead two first came out as like, okay, OpenAI is so far ahead of anyone else. Like it's impossible for... And mid-journey. And then people caught up within months and then Stable Diffusion was even maybe better or just as good as the Lee, like a couple of months later and it was open source. So like a year later, same thing happened with Sora. Like they put out those videos. And that time around, like, I think we were excited because now that people see that it's possible, like this is like actually doable, researchers get motivated and they see the hype they see that this is possible so they work on it and within a couple months we had maybe not sora level but much better video models now we have video models that are much better than sora so whenever we see someone actually pushing the frontier it's a reason for excitement because now that's possible other people are just going to do it within a couple months so we don't panic anymore is the fact that entropic doesn't have image generation model tell you anything about what the larger labs care about it tells more about anthropics on like personalities than like the in general like what other lab because every like you if you look at xai if you look at meta if you look at open ai if you look at google they all have really good image models yeah like google in their last announcement they used the word generative media by the way which was a proud moment for us and a lot and you know they focused on generative media as much as their new LLM models. So some labs definitely care about it and some labs, it's not a priority for them. Look at XAI. They keep pushing like images. You like AI slob. Yeah, I know. I know it's crazy. And waifus. And levels of interactivity. You have images, you have video. Now you have Genie, this kind of like more world model. You have kind of like gaming applications of that. How far are we from like FAL getting a lot of traffic on those models Like is it mostly experimental today in open source Obviously Genie is impressive but like it a Google model you know I have a very optimistic take on this and that may be like a normal outcome I think at worst, we are going to have very capable video models that come out of world models, right? It's going to be a very controllable video model and the use cases will be similar to what video models are. You're going to create content, but you're able to control the camera angles. you're able to control the video model a lot better than what you can do today. At worst, we are going to get that from world models. And at best, I think it's very hard for anyone to predict what's going to happen. Yeah, movies and games, like it's going to be something in the middle where you can be part of the whole movie universe that's going to be playable. So it's boundless possibilities. what's what's going to happen at best and how affordable is it going to be like is this ever gonna reach you know mainstream adoption we'll see all that but it's definitely technically incredibly exciting and impressive what's what's coming out of these labs yeah i need to find the paper again but there was this study on like um video models and like image generation understanding physics yeah and i could like predict the orbit of a planet but then when i actually had it draw out the gravitational forces it was like completely wrong yeah you know and so i think like that's my thing with world model is like i understand the creator application which is like you can create consistent world but i don't know if like the other side of people that are like hey these are like the best way to like simulate the world and like get intelligence and things like that i don't know that optimism around it too because whenever you talk to someone who's working on robotics they're bottlenecked by the amount of data they have and from all these you know past three years of AI innovation we've seen that whenever there is an abundance of data that type of models actually like improve a lot and you see so like robotics we expect something similar whenever they figure out this data problem those models are gonna get better as well so that's why people are so optimistic okay maybe this solves the robotics data problem and it's yeah boundless opportunities there. And regarding like the example you mentioned about gravitational forces is like i think this is still the same problem as oh lms can't do 9.9 plus like 9.8 yes it can it like you just need to train it with more data you need to have a better tokenizer it is the reason whatever it's just a matter of like data scale and like the underlying fundamental architectures but like i don't think it's going to change that much we just like we're going to put thousand x more data thousand x more compute and we'll get like the best physics simulators and i think this should be possible with existing signals coming from the data just to double click on video stuff as well yeah uh you had a great slide in at aie where you're like currently 18 of files revenue comes from video models and it might be a that was february so now now it's probably over 80 50 50 okay it's like over 50 yeah yeah 100 wow okay i guess editing models brought some life into the image as well so like both of them grew uh but yeah video grew faster it was like pretty pretty significant and the one of the main drivers is open source models where you know in february there was hunyuan video i think that was like pretty good there was mochi from genmo but like the quality still wasn't there and one from alibaba was like insanely good model and they released a newer version of this in like a month ago i think or like a couple of weeks ago and now it's getting so so popular and like we can run this model like for like 480p like the draft mode version we can run it like five seconds under five seconds so people can have like instant feedback loop and then when they want to go to 720p like full resolution is just like 20 seconds and we're planning to bring it down to 10 seconds yeah that's amazing uh and i want to double click on that like for a while i was kind of bearish on alibaba because they kept releasing papers with very cherry picked yeah yeah and like it was like a it's like okay we're on github and then you go to github it's a readme like yeah i mean you you can't see something really change they've been releasing new image models, new video models. No, no, we haven't talked to them. But it seems like, and now there's competing teams inside Alibaba. One is a really good image model, but they released Kwen as a competing Kwen's image model. We think one's image model is actually very, very good. If you'll do one with single frame instead of 81 frames or whatever, you get a really good text image model out of it. And this is just because of the pure amount of data that you put from the videos. So now Alibaba has two of the really good models from their lab. And then there's smaller labs in China that you might not hear about, but like Stepfun, they released an image editing model, Hydream, VivaGo. There's all these small labs they're releasing because I don't think training these image or editing models are that expensive and video models might be slightly more expensive. My guess is training these costs like a couple million dollars, which is not that much, especially they're probably backed by some sort of entity, other than Alibaba, So, you know, like there's like Stefan, whatever they probably raise a really good amount of money. So training these models will bring you a lot of attention. And it's more attention that you would get releasing a subpar LLM. Because like LLM space has so much more competition. So just training this for like a million dollars a video model and then releasing it, I think that brings you a lot of attention. It is a hack. When you look at Hugging Face, let's look right now. I'm sure the top models are image models. Like it's, I quote, an image editing probably is. Probably up there. Probably up there. Number one. Number one. Hoonyeon, Gamecraft. Is number three? Number four. Then you get Gemma, 270 million. ByteDance had some image stuff. ByteDance has an open source, but they have a really good team. Seed, that's their new lab. They're working on Seedream, Seed Dance, OmniHuman, stuff like that. We have a good partnership going with them to hopefully have their models hosted in the US as well. And the idea is, I think that the team that they were able to assemble is very good. And it's coming from their previous researches, whatever. like ByteTense was doing a really good open source stuff on like SDXL Lightning. They released like SDXL Lightning paper, Animative Lightning. So I'm pretty, pretty hopeful about them. Yeah. First of all, you know, hopefully they don't reach out to you when they launch, they just drop it and you have to rush. At this point, like people reach out to us because like we are the market leader. So like they just like reach out to us for like the getting distribution. Yeah. There's a Chinese platform that they always launch on first, which I forget the name of it, but like you have to... We also get like day zero. launch with the majority of these models. So basically, I think the question is always like, you know, you're the ones making money. Stability did not make money from stable diffusion. I think the thing that Black Forest Labs did was very, very interesting in this aspect. They released three different models. Apache 2 licensed extremely distilled model, which is good for... Dev. Chanel. This is the Chanel version. This is like for four-step generations for like lower quality stuff. They released a dev model with a non-commercial license where their inference partners are, you know, like you're paying a revenue share with like, and this is like a very good way. And then there's a pro version where you can like collaborate for hosting it. And like the revenue share is obviously different for that as well. This is, I think, a very smart choice for labs whose whole premise is releasing models. But if you're a company that is doing a product in the side, you don't necessarily need to make money from the open source models. You're doing it for getting researchers, you know, hiring people, getting distribution, whatever. So it really depends on like the company's goals. Like for Alibaba's case, They don't care if one model is hosted in their API. It doesn't touch Alibaba's top-line revenue, whatever one makes. So for them, it's a no-brainer to release it and get attention and maybe get some leads to their Alibaba cloud offerings. But in general, for Black Forest Labs or companies like that, I think it's a smart move to release a distilled version as fully open source and less distilled or the actual model as non-commercial. And then partner with inference companies and stuff like that. What's the distribution of usage? So like is 80% of your revenue like five models? Like or are people really using like the long tail of all these open models outside of like the initial launch? I think like there is some power law, but not as much as you would think. And it keeps changing. That's the other part. Like it's not like only a single model that's being used a lot. Like month to month, it changes a lot. This summer has been crazy. There have been like just countless amount of new video models, new image editing models. Like the leader kept changing week over week even. But if you like take a step back and look which models are being used, people want to use either the best, most expensive video model or they want to use like a cost efficient, like good, but cheap enough video model. So those two models are usually like used a lot. And whatever those models are, it changes week over week. And yeah. The one good example is like Flux Context was released, you know, like on late May. And Coen Image Edit was released like two weeks ago. And now it's like topping out Context Dev. You know, it's insane how quickly these stuff transition just because there's a better quality model. And that's the value prop. You don't have to set up the infrastructure to manage Flux Context Dev. As soon as like Coen Image Edit is available, you can just switch to that with file. I mean, it seems to me that if some models are open and some models you have to pay a revenue share, you ideally want to move the people off the revenue share models into the open models. Right. Like what's that dynamic? It's all price stuff. I'm also thinking like, OK, we'll do whatever our customers are going to be successful with. We are still like early enough that these small calculations, I don't think it matters. I'd rather people actually go to production and build products with it and be successful rather than like, okay, 20% here, 10% there. I mean, you're doing a hundred million in revenue. Cool. I'll just ask a few more questions we had around just like the, how people really use this. Okay. I'll ask this super obvious question. Yeah. How much is not safe for work? Almost not. Negligible. Yeah. You don't moderate everything and moderation is optional, right? Moderation is optional to a level where illegal content is moderated. And we also track the non-illegal content NSFW moderation. And we haven't seen more than 1%. The models themselves are actually not generating that type of content. Some model providers, especially if you look at Blackforce Labs models, the models are not for, it's incapable of generating because it's annealed in a way that is prevented. And the majority of our customer base, if you look revenue-wise, It's like enterprises are like more on like the higher level of stuff where some of them might be like user space, mobile applications. But, you know, like for the last six months, nine months, we've been transitioning more and more to enterprise where it's like less of a need for them. So what are those enterprises doing? You know, apart from like building a general purpose chatbot that can generate images, maybe Canva, you know, would be like a good use case. But my imagination is a bit limited beyond that. Advertising seems to be absolutely like growing. And if you think about it, it fits very well. And let's talk about video advertising. So like, I keep repeating this, but some companies talk about, oh, we are going to change Hollywood. Filmmaking is going to be revolutionized. Like, I don't think it's that interesting. Like, how many movies do you watch a year? Like, maybe 20, 25 movies. How many movies in the theater you watch? Three, four at most. So if there are like thousands of movies a year, like, people won't be able to watch all of these movies. Like there's just not enough time. It's a max quality game. Exactly. Exactly. And with advertising, it's the exact opposite. The more content there is, the different ways you can create ads, there is always economical value attached to it. So you can create unlimited number of ads, unlimited different versions of it. And the more personalized it is, the more economical value there is behind it. So with ads, it fits really well to this type of technology because there's no limit what you can create. I'll tell you a side comment about a Silicon Valley trend I'm seeing, which I cannot explain, which is that all these YC startups and all these, they're spending between $10,000 to $70,000 per launch video. Yeah. In the age of generative video. Like they're hiring, you know, actual creative directors, hiring a studio, hiring actors. I was in one of them. And like, do you need all that when you have generative video? Like, I think clearly Roy started talking about generative video as well. I don't know if you guys know PJ Ace. I think he's like the absolute killer for this stuff. He launched like a, is it a Super Bowl ad or something? Like a basketball playoff ad. Yeah, NBA finals, right? Yeah, NBA finals. He also did our like series B announcement video. Like we're like pretty close with him. And like that it's insane that like what he's able to come up with and how viral it goes over like, you know, these videos where you spend like hundreds of thousands of dollars, right? You know, like, it's just like, you just need to create viral content. And these general media models are the best way to do it. And we are still at the infancy of this, right? obviously it might not be professional quality. I still think human in the loop, mixed content is the way for today. But in six months, who would know? 12 months, I think 80% of it is going to be generated. We were watching Super Bowl and we were saying, oh, how much of this video is AI generated? It looks like AI generated. It could be, right? It could be. You can't tell. So I think at some point, we're going to have 80, 90% AI generated, all that. It reminds me of, I think, who's the guy, Fofur from Replicate? obviously is the best inspiration for all these workflows. He like lay overlaid some kind of NBA realistic sort of Laura on top of game footage. So you could play like NBA 2K, but it looks like a real video. Yeah. I was like, what the hell? It's pretty cool. So maybe that's the other part of my question. And I wanted to get into a comfy UI, which is how much Laura serving is going on, right? How much custom? A lot. You know, okay. Is it like majority? that's one of the reasons majority but like like if it's like 30 is it like the majority and everyone trains their own lauras or you pick it off of like a laura marketplace that's why open source works very well with image and video models because you tap into this big laura ecosystem everyone like it only like i've never seen a closed source model that can create a good laura ecosystem it just basically doesn't exist like maybe there is mid-journey srefs but I don't know if you can consider them LORAs. Maybe it's like... SQs are just seeds, right? Conditioning, let's call it like another condition, like a prompt. Yeah. Yeah, and then like only the open source models have these rich LORA ecosystems and it's extremely, extremely popular. But even for like the oldest models, it brings a new life. You know, like when you see these cool LORAs, like there's, we have like still a lot of people using STXL with their own LORAs because they're happy with the quality. It's fast enough, it's cheap enough, you know, like it's amazing. These models are not single shottable, like the language models. Even the editing models, you know, GPT image one or like flux context, whatever, current image. If you put your face or like if you put like multiple people, whatever, you can't get the quality. It's going to be like 90% there. But if you train for like 1000 steps with like six to 20 images, you're getting like 99% accuracy. We worked a lot on like fine tuning the right hyper parameters like writing like distributed trainers distributed optimizer stuff and with those people can train their lords under 30 seconds now on the platform run an inference with them in the same job, and get 99% accuracy for the same face, character, which is one of the biggest challenges that maybe more on the enterprise side they're facing and less on the consumer side. If you're creating AI slope, you don't really care who it looks like. But if you're actually doing a product ad, you want it to exactly look like the product. Every single pixel on the products you know banner whatever you want it to look like that so you train like a lacroix laura with like 20 images and then like you know after that you're you have almost a perfect uh pixel perfect model all right we had to train a laura for every guest then we can make thumbnails i actually think that's a very good application because it's a nice way to inject brand but not in a strict style and we are just entering like post training on video models and what's that that's gonna mean because we didn't have a good base video model that it made sense but now we have companies really investing into post-training on 1.2.2 or Hunyuan and creating lip sync models on top of it, creating different video effects, camera angles. Seems like there's a lot of possibilities with creative data sets that people can do. I think in the next six months to a year, we are going to have a lot of companies that are just built on post-training of open source video models. Wow. Let's talk about pipelines. So we are Confianonymous on the podcast. Confi UI is kind of like this community that like, if you're into it, you love it. If you don't know about it, you kind of underestimate in a way, but people create all kind of crazy workflows. One, have you thought about doing pipelines? Obviously you host all the models. There's kind of this question. We do have a pipeline product called File Workflows and you can chain models together, but it's obviously less flexible than Confi where like you can only change like, chain different models outputs and not the intermediary stuff in Confi UI. You can like use like access the latency from like one model and then pass it to like a latent obstacle or whatever. In our case, it's like more limited, but we have a workforce product and we have a serverless Confi product where people can bring their own Confi UI workflow and run it as an API with just like posting the workflow and inputs. And let the models be served by you. Yes. So is that a bullish thing? Is that going to be commoditized by big models? The thing that we saw is as the models get better, like Confi UI was a much bigger thing or like, you know, relatively much bigger thing in two years ago, like a year ago, when the models were like, you know, One of the biggest ConfUI use cases was you were generating an ST1.5 or STXL image, and you were fixing six-finger situation. You were fixing the resolution, you were upscaling. Now that the models are actually so good, the ConfUI workflows are actually getting simpler for image side. For video, it's still very crazy. If you look at some video workflows, there's like 50 nodes, whatever, that you're processing. So I think this is still a matter of how good the models are and how much extra stuff you need to do around it for majority of use cases. For artistic use cases, you still are doing a lot of stuff. and that's like something we want to support. But like that, we don't see that happening at super scale, you know, like super scale. There's not companies that are spending like $10 million plus on running this as an API. So that doesn't seem to be happening yet just because it's like a bit inefficient and the more existing, like it's more reliable to use an existing model than patching together 50 different things because you don't know when it's going to not work. Yeah, but it feels like for things like ads, you want to do, you know, one step, which is like maybe generate the backdrop. One step is like adding copy. are you saying the models are so good? Chaining of models happening for sure. But I think ConfUI did very well is you can also play with the pieces of the model. Yeah, you're busy saying that's all. Yeah. So like chaining of the models, like that's what the file workflow product does. It's basically calls many different APIs back to back or in parallel and then creates a result at the end, I think. And it's very popular. We have like enterprise adoption from it, like from very big names. Yeah, amazing. I was just going to go into the broader topics. the first thing that comes to mind was request for startups if you're not working on fail but you see a lot of things in the ecosystem right what's the most obvious thing that people should be working on? More model companies. Go raise more money and train models. That's obviously good for fail and host them on fail If you're not interested in training models if other people are trained that's amazing go raise more money there's so much money Or like scale AI for image and video models like data collection like more prepared data sets for video models like effects, different camera angles. Everyone seems to be reinventing the wheel when it comes to collecting that data. I think it's a great opportunity for someone to come in and do this at scale. So it's really interesting because I think this is what Together AI did with Red Pajama is they actually built a data set for language models to help people create more open language models that they can serve. So at some point, it actually might make sense for you guys to do that. The image data is a bit more finicky situation. in terms of copyright stuff, whatever, but it's an interesting area. Do it in Japan. I think it requires focus. It requires, like, this needs to be... One connected thing to what Gerkem said is image slash video RL. That's an unknown unknown for us. Say more. I can't. It's like, what does it look like? Can you RL a video model to be a world model? You can, right? Like, if you consider it, it's essentially world models and RL video models where you condition it for like, you know, moving around. So what are the use cases for RLing image and video models? I don't know. But that's like, if I wasn't working at well, that would be something that like, that might be fun to explore. Yeah. And is this is specifically for editing? Because the RL is for, the reward is the edit or? That's the thing. That's what you should look for, right? Like what is the reward function? Like what is the interesting reward functions that you can apply on top of these base models? I see. Interesting. Okay, got it. Actually, I was really asking about like, if I were to build a foul wrapper startup, like on top of that. Because you guys are very low level, which is fantastic. You know, but I also want to give our listeners some ideas, you know, if they're not going to work at that level. I think I'm going to say it again, advertising. There's so much opportunity there and everyone's like still trying to create these horizontal applications where like any creative can come in and do something. but like a lot more targeted to specific industries, a lot more targeted to like different kinds of like ad networks. There's a lot of potential there. And then requests for models. Obviously you want more open models, that's good for you. But like any like specialization in the models, like I think image editing was a huge unlock, which I didn't foresee until this year. Where we're like, obviously we're going to edit images. Even OpenAI didn't like guess it was going to go this big. It's like, it's insane. like how popular it became and then like everyone started like catching up i was part of a group at new rips that we meet at new rips every year and like they were talking about this at the last new rips and so like it's in the air but you have to be at the researcher level and like everyone moved to video left image behind a little bit so there was like a little vacuum of research on on image but luckily people people saw it that it works very well and then they went back to it it's so much cheaper to train image models like right now like if you if you look if you want to train a SOTA image model I don't think is going to cost more than a million dollars. It's extremely cheap. It's like a matter of data, engineering effort, cleaning. I think it's a function of data set. Image models are really, really, really affected by the data set that you use. I think one obvious thing that there's a gap in the market is like VO3 is very expensive and the way the reason why people like it is conversation, right? If you can create maybe a smaller, cheaper video model that is less capable but can do conversation and sound very well, I think there's definitely One open source one that we saw was multi-talk. It was a post-trained version of one. And it's like really, really good for conversations, but it lost the ability to generalize. You know, it's only like talking faces at some point versus like VO3, it could generalize and it could do scenes and whatever. So I think that there needs to be like some middle ground between talking faces and like, you know, extremely generalized video models where it's like much cheaper to run. But at the same time, you know, you get like this conversation because it's very mimetic, you know, people, there's infinite amount of memes that you can post with this infinite amount of ads that you can do with this but you don't see a world in which you have a video model and then you have a separate maybe audio only model that can generate the audio for this is the word question right yeah do you stitch together a whole bunch of things or do you better let people did that before vo3 but what vo3 gets very well is like the timing almost like totally you ask for a joke and you know the delivery and the timing and the laugh and like you know waiting right before like the joke drops like all of that is so perfectly timed i don't think when you do it separately you get it it also matches this human accent sound to the face that is talking right it's like it's an unknown challenge for like other text-to-speech models like you like it feels very natural is vo3 the best text-to-speech model it is also one of the best it is so good like i don't think any model can do what what is the emotional but i would say the counter argument is that we dub movies so there's already you know obviously you can it is also the best lip sync model like vio 3 has the best most accurate lip sync because it's generating very natively there's really good lip sync models i think they're like 95 there but vio 3 is like 100 there like 99 to me this is like the single most bearish thing about workflows right and comfy ui and all this stuff like because just wait for a bigger model it's just pure bit of lesson yeah we love comfy ui but i mean obviously like when the technology doesn't exist yet you have to stitch together things yeah so a request for engineers yeah i mean what i'm sure you're hiring right you're just 125 million hiring like we just recently crossed 40 people but like for like wow for like three months ago we were like 20 so like for the last three months we have been actually accelerating you know best kernel engineers best infrastructure engineers best product engineers best ml engineers If you're the best at what you do, just come join us. I think it doesn't really matter what you do. We're just hiring the best self right now. Even on the go-to-market side, we are hiring account executives, customer success managers. Because we work with very large enterprises, we've got to grow that side of the company as well. On the engineering side specifically, how do you think about how many people you need? There's this whole question of lean AI, coding agent. Our performance team is like seven people. I think seven people focusing on performance. always like some of there's some overlap with our applied ml team which is taking these models productionizing them exposing new capabilities building fine tuners so it's like a and then helping customers adapt these models so that team i think we can scale to like double triple the amount because like there's infinite amount of models and like you know it's better because like we're going to have more customers with more proprietary models so just like helping them optimize it it's just like a really good function that we have that team scales very well because there's always like independent work that can be done oh okay so these three people are working on this new model trying to optimize that and it's completely independent from trying to optimize this other model so we've been hiring a lot for for that applied ml team our aim for a team you know we're probably gonna keep it lean in contrast to the applied ml team and the product team maybe like we want to build more higher level components where people can directly integrate to their applications because like that's even like now not just sdk or sdks but think about with components Imagine you're an e-commerce website designer and you're not really the best component designer. So here's a virtual try-on component that you can put to your app. Stuff like that, more higher-level components. And this is also coming from the fact that wipe coding has been very, very insane. We see significant... Revenue-wise, it's very small. But we see a significant amount of user adoption just coming from people who are... Just from looking at our support tickets. Maybe they need more support, but there's a lot of people who are coding these applications without that much expertise in the product building. So we want to provide them more guardrail experiences where they can integrate much easier without messing with all the other lower-level components. That's really nice care of developer experience. So crack low-level engineers. And crack high-level. Well, yeah, high-level engineers. Crack engineers, crack go-to-market people, crack whatever, just join file. You know, I'm always trying to refine the definition of crack. You know, like both of you, like you leave the technical side of file. Like, what's a really hard technical problem that if someone has the solution, they should talk to you immediately? Maybe that's the way to frame it. Write a sparse attention kernel with FB8 on Blackwell and tell, you know, like, if you can do that, come join us. We already have, like, a good base. Hired on the spot. Hired on the spot, you know, like, stuff like that. I really like picking, like, all these, like, some of these Applied ML people, like, we just picked them from Discords who are working on these sort of genital media who are, like, already interested. We really have a high culture bar too where everyone in the team loves genital media. like they're obsessed with it they would have done this if this wasn't their job we have this great composition we're always it's not like a prerequisite but it's just naturally happened where we hire these people from like discourse twitters hugging face like one of our applied ML engineers is like had the number one top hugging face space with like you know creative workflows whatever so we hired a person who was training lauras on fall like just because they were training and posting cool lauras you know like just do cool stuff and we'll find you or like you can reach out to us that's the master builder is what I've been calling this this person why not make it more explicit so if i go on your career's website right it's like apply to mel engineer it just kind of looks like any job description i feel like there's like this question of like that's why we have to do a podcast but i think it's not it's not just about foul i think in general it's more like if you know you know which i know is not the best way you know people know about fault already so it's like we haven't really cared that much but you're absolutely right like we should make it more explicit if i look at like george hots like on tiny grad you have this balance if like hey we'll just if you can solve this you should probably work here like do you see i'm adding the bounty right it's like this seems like hey look if you can write this kernel it's like yeah you'll just get hired it is also but like one thing that we saw even with like there's a lot of people who are just like vibe coding stuff and reviewing those like there's a limited amount of people who can review those right like so like how like how can you tell it's like not a shitty kernel versus like a good well but then you're spending the time interviewing too right yeah so like we have like first line of defense with our recruiters whatever so you only get like like so there's like trade-offs but i i absolutely agree like maybe we should have like a kernel bench uh version that you can upload your kernels automatically evaluate stability performance whatever and then if you do you get our email unlocked whatever especially email for your for you but yeah great ideas come join us awesome guys anything else parting thoughts yeah i love your rant so this was great yeah i'm happy to run but when is the podcast star yeah no congrats on all your success um i should also say it's fun to do karaoke with you guys yes like you let's do it again both extremely but also like a fun crew that like and i think it's pretty hard to and rare to to see so thank you to see the good guys win awesome guys awesome you
Related Episodes

⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security
Latent Space

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner
Latent Space

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI
Latent Space

⚡️ 10x AI Engineers with $1m Salaries — Alex Lieberman & Arman Hezarkhani, Tenex
Latent Space

Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures
Latent Space

⚡ Inside GitHub’s AI Revolution: Jared Palmer Reveals Agent HQ & The Future of Coding Agents
Latent Space
No comments yet
Be the first to comment