

Sovereign AI in Poland: Language Adaptation, Local Control & Cost Advantages with Marek Kozlowski
The Cognitive Revolution
What You'll Learn
- ✓Leading AI models today are heavily biased towards English and Chinese, with only 10% or less of the training data coming from other languages like Polish.
- ✓This leads to models that can perform well on tasks like answering multiple-choice questions, but struggle to generate fluent, culturally-appropriate text in non-English/Chinese languages.
- ✓Poland's 'Project PLUM' aims to develop smaller, localized language models that can match the performance of larger models in their specific language and cultural context.
- ✓The goal is to achieve high-quality language understanding and generation, while maintaining transparency, control, and cost advantages over relying on global AI platforms.
- ✓Regulatory constraints in the EU, like restrictions on web scraping, force a more curated data approach for building these localized models.
- ✓Government agencies and national companies are often better served by smaller, specialized models than massive generalist models served from the cloud.
Episode Chapters
Introduction
Overview of the discussion on sovereign AI projects, with a focus on Poland's 'Project PLUM' to develop localized language models.
Limitations of Leading AI Models
Explanation of how the dominant AI models today are heavily biased towards English and Chinese, leaving other languages underserved.
Strategy for Localized Language Models
Discussion of the goals and approach behind Poland's 'Project PLUM', including achieving high-quality performance while maintaining transparency, control, and cost advantages.
Technical Challenges and Regulatory Constraints
Exploration of the technical challenges and regulatory environment that shape the development of these localized language models.
Benefits of Specialized AI Models
Explanation of why government agencies and national companies are often better served by smaller, specialized AI models rather than massive generalist models.
AI Summary
This episode explores the concept of 'sovereign AI' and how countries like Poland are developing their own localized language models to compete with global AI leaders. The guest, Marek Kozlowski, discusses how the dominant AI models today are heavily biased towards English and Chinese, leaving other languages underserved. He explains the strategy behind Poland's 'Project PLUM' to create smaller, more specialized models that can match the performance of larger models in their local language and cultural context, while maintaining transparency and control. The conversation covers the technical challenges, regulatory constraints, and strategic motivations behind this approach to AI sovereignty.
Key Points
- 1Leading AI models today are heavily biased towards English and Chinese, with only 10% or less of the training data coming from other languages like Polish.
- 2This leads to models that can perform well on tasks like answering multiple-choice questions, but struggle to generate fluent, culturally-appropriate text in non-English/Chinese languages.
- 3Poland's 'Project PLUM' aims to develop smaller, localized language models that can match the performance of larger models in their specific language and cultural context.
- 4The goal is to achieve high-quality language understanding and generation, while maintaining transparency, control, and cost advantages over relying on global AI platforms.
- 5Regulatory constraints in the EU, like restrictions on web scraping, force a more curated data approach for building these localized models.
- 6Government agencies and national companies are often better served by smaller, specialized models than massive generalist models served from the cloud.
Topics Discussed
Frequently Asked Questions
What is "Sovereign AI in Poland: Language Adaptation, Local Control & Cost Advantages with Marek Kozlowski" about?
This episode explores the concept of 'sovereign AI' and how countries like Poland are developing their own localized language models to compete with global AI leaders. The guest, Marek Kozlowski, discusses how the dominant AI models today are heavily biased towards English and Chinese, leaving other languages underserved. He explains the strategy behind Poland's 'Project PLUM' to create smaller, more specialized models that can match the performance of larger models in their local language and cultural context, while maintaining transparency and control. The conversation covers the technical challenges, regulatory constraints, and strategic motivations behind this approach to AI sovereignty.
What topics are discussed in this episode?
This episode covers the following topics: Sovereign AI, Localized Language Models, AI Bias towards English and Chinese, AI Regulation in the EU, Cost and Control Advantages of Specialized AI Models.
What is key insight #1 from this episode?
Leading AI models today are heavily biased towards English and Chinese, with only 10% or less of the training data coming from other languages like Polish.
What is key insight #2 from this episode?
This leads to models that can perform well on tasks like answering multiple-choice questions, but struggle to generate fluent, culturally-appropriate text in non-English/Chinese languages.
What is key insight #3 from this episode?
Poland's 'Project PLUM' aims to develop smaller, localized language models that can match the performance of larger models in their specific language and cultural context.
What is key insight #4 from this episode?
The goal is to achieve high-quality language understanding and generation, while maintaining transparency, control, and cost advantages over relying on global AI platforms.
Who should listen to this episode?
This episode is recommended for anyone interested in Sovereign AI, Localized Language Models, AI Bias towards English and Chinese, and those who want to stay updated on the latest developments in AI and technology.
Episode Description
Marek Kozlowski, Head of the AI Lab at Poland's National Information Processing Institute, discusses project PLLuM (Polish Large Language Models). PSA for AI builders: Interested in alignment, governance, or AI safety? Learn more about the MATS Summer 2026 Fellowship and submit your name to be notified when applications open: https://matsprogram.org/s26-tcr. He shares how countries like Poland can achieve AI sovereignty by training small, locally-adapted models for specific languages and cultures, ensuring control, privacy, and cost advantages. The conversation delves into challenges like frontier models' English bias, EU regulations, and technical strategies like "Language Adaptation" on base models. Discover how transparently created, locally-controlled AI offers a viable path for nations to maintain their technological destiny. LINKS: National Information Processing Institute Show notes source with images PLLuM open chat service Sponsors: Google AI Studio: Google AI Studio features a revamped coding experience to turn your ideas into reality faster than ever. Describe your app and Gemini will automatically wire up the right models and APIs for you at https://ai.studio/build Agents of Scale: Agents of Scale is a podcast from Zapier CEO Wade Foster, featuring conversations with C-suite leaders who are leading AI transformation. Subscribe to the show wherever you get your podcasts Framer: Framer is the all-in-one platform that unifies design, content management, and publishing on a single canvas, now enhanced with powerful AI features. Start creating for free and get a free month of Framer Pro with code COGNITIVE at https://framer.com/design Tasklet: Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive PRODUCED BY: https://aipodcast.ing CHAPTERS: (00:00) Sponsor: Google AI Studio (00:31) About the Episode (03:17) Sovereign AI in Poland (04:41) The Case for Localization (13:38) The PLUME Project's Mission (Part 1) (20:25) Sponsors: Agents of Scale | Framer (22:47) The PLUME Project's Mission (Part 2) (22:47) Defining Polish AI Values (35:32) Sourcing and Curating Data (Part 1) (35:38) Sponsors: Tasklet | Shopify (38:46) Sourcing and Curating Data (Part 2) (44:40) Small Models, Big Advantage (58:21) Training and Domain Adaptation (01:12:22) Compute, Talent, and Geopolitics (01:22:50) Forming International AI Alliances (01:27:41) Decentralized AI and Conclusion (01:31:47) Outro
Full Transcript
This podcast is sponsored by Google. Hey folks, I'm Ammar, product and design lead at Google DeepMind. We just launched a revamped Vibe Coding Experience in AI Studio that lets you mix and match AI capabilities to turn your ideas into reality faster than ever. Just describe your app and Gemini will automatically wire up the right models and APIs for you. And if you need a spark, hit I'm feeling lucky and we'll help you get started. Head to ai.studio slash build to create your first app. Hello, and welcome back to the Cognitive Revolution. While we often discuss sovereign AI in the Silicon Valley AI bubble, we rarely hear directly from the technical leaders who are actually leading national AI projects. And so today, I'm very glad to share my conversation with Merit Kozlowski, who's leading Project PLUM, which stands for Polish Large Language Models, in his role as head of the AI lab at the National Information Processing Institute of Poland. Poland, with a population of 38 million and GDP of roughly 1 trillion, roughly 10% and 3% of the United States, respectively, is an interesting and in some ways a representative case study. It clearly doesn't have the resources required to compete with the US and China at the AI frontier, but it does have strong technical talent, a real sense of pride in its language and culture, and a deep desire to control its own technological destiny and avoid domination by global superpowers. So what does that mean in practice? As you'll hear, Merrick's strategy relies on the core belief that by training small models for a particular local language and cultural context, countries like Poland and projects like Plum can compete with the latest frontier models, all while retaining control, preserving data privacy, and achieving a major cost advantage. In this conversation, we dig into the strategic realities that motivate projects like Plum and the technical challenges they have to overcome to succeed, including how today's frontier models, which are trained on overwhelmingly English and Chinese data, fall short in other languages. Why this problem is actually getting worse from one generation to the next, as frontier model developers prioritize things like coding performance above support for niche languages. How EU regulation prevents European AI builders from conducting massive web scrapes and instead forces them to rely on more focused data curation projects. How the Polish government is thinking about investing its finite resources across data, compute, and talent. The language adaptation techniques that Merck's team layers on top of Lama and Mistral base models so as to inject local knowledge without needing to start from scratch. Why they haven't yet had to worry about developing a constitution or other explicit articulation of values for Polish AI systems. And why government agencies and national champion companies are often better served by smaller models, fine-tuned for specific tasks and served locally than by massive generalist models served from the cloud. Overall, Merrick's mix of realism about the challenges of competing with global leaders and his positive vision for transparently created, locally controlled AI is a great window into what AI leaders around the world are thinking and doing to maintain AI sovereignty. So with that, I hope you enjoy this deep dive into the meaning and training of Polish AI with Merrick Kozlowski. Merrick Kozlowski, head of the AI lab at the National Information Processing Institute of Poland. Welcome to the Cognitive Revolution. Welcome, everyone. I'm excited for this conversation, too. We met not too long ago at an AI event in Las Vegas, the Enterprise Technology Leadership Summit. And I thought it was really interesting to double click on everything that you're doing, because in the United States and in the sort of Silicon Valley AI circles that I spend most of my time in, there is this ongoing conversation about sovereign AI. And I think it's funny that a lot of this conversation happens in the Silicon Valley bubble and sort of makes a bunch of assumptions about what other countries feel the need to have, you know, aspire to create, you know, what's driving those decisions. And I don't hear too much from primary sources of people that are actually doing the sovereign AI projects around the world. So I was excited to meet you and learn more about what it is that you're doing in Poland. Poland, obviously, I think, obvious to me, you know, is a country with a lot of technical skill and, you know, very distinct culture, obviously its own language, proud tradition. And so I'm really interested to get into it and figure out what sovereign AI means in the context of Poland. Once again, thank you for the introduction and for introducing my person and showing the idea. The idea is that I called it slightly broader, not only the sovereignty, but also creating the localized ALMs. Because the localized, it can be the national ALMs, but also the domain-oriented ALMs. and I create the idea but I am promoting the idea of the localized ALMs it means the ALMs adapted to the language or domain because they can be also a domain and they are in this domain or the language they have higher quality understanding text in this language or domain and have the higher quality and they are able to create the higher quality text in a generation step It means that building the localized LMS, they can be of course adapted to the language or domain, has two goals. First of all, to improve the understanding in this domain of the language, but also give the possibility to generate the higher quality texts, of course, in aspects like the linguistic and cultural aspects. They are the idea. And our goal is to create the models that are in order of magnitude smaller than the closed popular now NLMs. But in the aspects of the language and the culture or the domain, they have the same quality as the 10 times bigger models. And they are open source, transparent, secure, and as much organic as we can. Yeah. Okay. Great. That's a great start. Can we take one step back if we can and just talk about like why this is needed, first of all, from a capabilities perspective. Famously, I think it was, gosh, it's been a minute, but I think it was the GPT Instruct series. I think the model originally was TextDaVinci002, if I recall correctly, one of the first models that OpenAI trained to follow instructions. They reported basically, we just trained this thing to follow instructions in English. and lo and behold, it seemed to be able to follow instructions in other languages too, which was obviously a strong example of emergent capabilities and transfer learning, positive generalization, all these sorts of phenomena that had been kind of elusive, but I think in many ways characterize the phase change that we've gone through from earlier AI systems to these more general AI systems, positive transfer being obviously a huge one. but you know that's where they started in terms of like just English and you know oh my god it works in other languages since then of course they've gone and done a lot of work to try to collect data in other languages to try to even things out and my sense from just kind of benchmark data is that they have made pretty good progress but still like performance is best in English and then you can kind of think of like performance getting worse sort of the farther a language is from English in the language tree, and also just the correspondingly how many resources it has, right? Few low resource languages are obviously going to be a bigger challenge than higher resource languages. That's my sense of like... Yeah, at least because I have to correspond to your insights. First of all, the 90% of the training data is English and Chinese. Yeah, when you even if you look at the biggest open source or the biggest closed LMS, 90% 90% plus data are English and Chinese ones, only 10% or less are the other languages. For example, it varied, but for example in some models the Polish language, I use this example, there is about 1% of the corpora or even smaller. And this decides that the vast majority of the skills and the competences are gained by the English and Chinese instructions. And of course, Because if you have the large model, it has a huge competency transfer learning. You can say they extrapolate. You can very easily extrapolate between the tasks. But for example, even if I have the instructions, I have lots of mathematical calculations prompted or commanded by the English language. And even if I ask to do it in Spanish, the very large model can do it in the intermediate steps. the comments from Spanish to English and somehow map the knowledge from English to resolve the solutions, even if it was not learned in the Spanish examples of how to calculate some mathematical formulas. But what is the most important, that this works very good, but this works the same way as we or our kids learn the language. First of all, we learn how to understand and hear, listen. Next, how to write and how to speak. And of course, if you learn a new language and we get some command in our minds, we try to map these commands to what we know from the primary language, our native language. And the same is going inside the NLMs. For example, I can give you the example that the models that were not trained by the huge volume of Polish texts, texts, they are still able to be communicative and create the text that is understandable. But there are some statements on some phrases that are very easily identified that they are not the natives. For example, I give you the example about writing the emails in Polish. And for example, they have such a formula, typical for English, I hope you stay in good health condition. It's typical for English, but not typical for Polish. And even if you translate this word to word, it's communicative, understandable, but not typical for our language and culture. So is there more to say about how the leading commercial models are underserving the Polish market than that? I mean, I have the sense that there is a little bit more to it than just the cultural idiosyncrasy? Because even when I look at an MMLU benchmark, it does seem like performance degrades across the language spectrum. The highest MMLU score is in English. It does seem to get worse in other... I know. But for example, when you look at the benchmarks, we live in the words that we are biased by the benchmarks. It means, for example, the MMLU benchmark is mostly you choose the solutions A, B, C, D, the multiple choice questions. They are not testing the ability to communicate fluently in the language. Most of the benchmarks don't test how good is the model in producing the longer forms, longer writings or longer sentences. They usually test the understanding, extractive competencies, summarizing competences and knowledge about the facts in the world. But there are, I know, little or almost little, very little, very few benchmarks that test how good the model is in generating the longer forms of text in the other languages than English and Chinese, for example, the niche languages. Because it's much harder. For example, we in Poland create the benchmark PLCC, Polish, Nullistic and Cultural Competency Benchmark. and this benchmark enables us to evaluate how good is the model in different subcategories. For example, there are not only categories of the grammar, vocabulary, but also about our culture, tradition and history and many, many others. But we would like to not only evaluate how good is the wordings of the model, how good the model is in some typical traditional wordings and phrases, about the history also, but we also trying to check how good the models are in the general spectrum of using some ambiguous words in Poland. But this benchmark still doesn't validate how good the longer sentences are in Polish, how good the model is in producing the the longer structures in Polish language. Yeah, interesting. So is it fair to say that the primary focus of your work in creating Polish native models is on these sort of softer skills? It doesn't sound like you're focused on closing the benchmark gap or the sort of reasoning gap that exists between English and Polish. it's more about, as you said, like culture values, tradition, history cultural competence because I think that the language is not only the wordings, the models can have the very broad vocabulary, but they should be able to use it properly in the context and sometimes the language is not only the words, they are the culture, tradition, history, everything is mixed in it and in order to create the model that behaves as natives you have to inject not only the knowledge about how to create grammatically correct sentences, but also how to use special idioms or phrases in special context. Or what places are typical for Polish history, or maybe what places are viral now. Generally, it's like you have to mix the history, grammar, vocabulary, art, entertainment, culture, and tradition, everything into the mix to create the language ability that is somehow similar to the natives. But ask me why we are doing that. First of all, because, as I mentioned, sometimes that we believe in the idea of the localization elements, the localized elements, it is the elements adapted to the language, that there are as much similar to natives as possible. The second issue that the competency gap, that, for example, we believe that, for example, we have to develop our people or our engineers to have the skills to build our own models because maybe in a few years, the market will change. Maybe the mobile will be closed or maybe some models will be forbidden. There are plenty of models currently in the European Union that we are not able to use because of the IAC. Even in the licenses, the LAMA 3.4, the KIMI and many other models, they have the sentence, the statement in their license that they are prohibited to use in the European Union. Maybe we will be forced to use this knowledge to beat our own models. Maybe they will be a little bit worse than the Chinese or USA but they will be our own. And sometimes it's better to have the competences to beat even something a little bit worse but have the ability to do it than don't have this. We can. Sometimes it means more than you think. But also we think that there is in this approach, in the Plume family, because we create the family of models, we also believe in the transparency because we show how we built it from the scratch. We released a few weeks ago, or two weeks ago, sorry. We released the publication of almost 100 pages, how we built these models. And also we not only released the publication, the recipe book, the cookbook, but also we publish on the hanging phase the samples of our data sets, the instructions, preferences, because we would like to show not only the open weights, but the more and more than, because the open source are not only the open weights. There are also some samples of open data and the cookbook, how we do it step by step in a very detailed manner. And what is for us important, because I think even now, are the most popular open source models are the Chinese now, but they are only open weight. There are no samples of the instructions or preferences they use to train the models. And we would like to go a step further to be as transparent as possible. And also we invest lots of in their organic organic data because we believe that we also prove that, that when you you have three stages when you learn the models. First is the pre-training. It's somehow similar to learning the kids a new language that you identify the words, how to create structures from these words and some pieces of information. But children after this type of learning is not able to resolve the mathematical calculations or write the essay. It's like you learn the language but you don't learn the competencies. Next stage is the SFT, supervised fine-tuning. You learn how to resolve some tasks, downstream tasks, write the poem, write the essay, summarize this article, perform some calculations. You learn the competencies, like the children in the school. You have the math, the geography, chemistry and many others. And after all, you have the alignment, the preference learning, that you mark what the children have done during the test, for example, and this information, these marks, makes them what should be corrected or not. And with the same what we have done with the children, with the kids, that we learned the language, the piece of information, the wordings and the structure. Next we learned the competencies and evaluated them and during the feedback loop we tried to improve their abilities. We have the same things that are done with the RLMs. And for example, when you're doing the pre-training you show the model the hundreds of billions of tokens to learn the language and after all in the sft the supervised fine tuning stage you will show the model the syntactic instruction syntactic means that they are produced by the other lms they usually if they are linguistically poor they also degradate the model yeah because if you in any stage of the learning When the models see the poorer data, it will be degraded, it means it's going down, there is quality of the linguistic creation of the sentences. So we focus mainly on creating the organic data sets, the organic instructions and preferences, or even if you use the LLMs to produce such instructions, we check it by the humans to improve their structure and quality. And I think there are the novelty in the end. And the second one is that first of all, there is the open source, open data and open cookbook. The second is transparent because it's written in the cookbook what we have done step by step and we show the samples. We also focus on the organic data, organic instructions, organic differences. And I think this is the reason why the GPTs and other people are so good, because they are also, they have plenty of manual instructions and they don't show them because there is an intellectual property of these companies. And we also, we secured our models on our own, because we discovered that these models, when they are secured for the English speakers, they can be much easier hacked than they're secured for the Polish speakers. I think they're a novelty, maybe briefly speaking. Hey, we'll continue our interview in a moment after a word from our sponsors. If you're finding value in the cognitive revolution, I think you'd also enjoy Agents of Scale, a new podcast about AI transformation hosted by Zapier CEO Wade Foster. Each episode features a candid conversation with a C-suite leader from companies including Intercom, Replit, Superhuman, Airtable, and Box, who's leading AI across their organization, turning early experiments into lasting change. We recently cross-posted an episode that Wade did with OneMind founder and CEO Amanda Calo about AI-led sales. And I also particularly enjoyed his conversation with John Narona, chief product officer of AI Product Pioneer and recently minted double unicorn, Gamma. From mindset shifts to automation breakthroughs, Agents of Scale tells the stories behind the enterprise AI wave. Subscribe to Agents of Scale wherever you get your podcasts. Are you still jumping between multiple tools just to update your website? Framer unifies design, content management, and publishing on one canvas. No handoffs, no hassle, just everything you need to design and publish in one place. Framer already built the fastest way to publish beautiful, production-ready websites, and it's now redefining how we design for the web. With the recent launch of Design Pages, a free canvas-based design tool, Framer is more than a site builder. It's a true all-in-one design platform. From social assets, to campaign visuals, to vectors and icons, all the way to a live site. Framer is where ideas go live, start to finish. And now they've added a Framer AI layer to make it all faster and easier than ever. With Wireframer, you can skip the blank canvas and get a responsive page with structure and starter content ready to edit. With Workshop, you can create new visual effects, cookie banners, tabs, and more. No coding needed. And with AI plugins, you can connect top models from OpenAI, Anthropic, and Google to generate images, rewrite text, generate alt text, and more. Ready to design, iterate, and publish all in one tool? Start creating for free at framer.com slash design and use code cognitive for a free month of Framer Pro. That's framer.com slash design. Use promo code cognitive. framer.com slash design promo code cognitive rules and restrictions may apply i have like seven follow-up questions i want to ask about various parts of that um and maybe we can kind of break it down by inputs to ai for one thing right obviously the big inputs are data compute and talent um and you touched on certainly data and talent there I also do want to come back to the safety training because that always a keen interest of mine But maybe let start with the goal You've spoken about it somewhat, but I think one big challenge that we have, certainly in the United States, and we have all this talk, especially in the context of the geopolitical competition in AI, there's a lot of talk about, well, we want to have AI with democratic values win. We don't want to have Chinese values, or maybe we even are bold enough to say we want American values to be the values that the AIs embody and kind of propagate through the world. That obviously brings a big question, which is, well, what are those American values? And I can certainly say that there's no single agreed upon answer for that. What American values are is hotly contested on an ongoing basis. And that leaves basically the AI companies to try to come up with their own best guess of what that should be. And that, too, is often sharply criticized because it's too woke or it's not woke enough or it's right wing extreme or it's describing itself as Hitler. In some of the LLMs, they are somehow the compressed representation of what we have in the internet. What topics are the most important, what information is the most popular ones, it somehow is reflected by the LLMs. If you have the problems, the political problems, the religious problems, everything that is there is also reflected somehow in the compressed LLMs, because LLMs, there are somehow compressed memory repositories. They are compressed stores of the memory of the internet. Certainly, all that stuff is baked in. Sometimes, I don't know how far American leading companies have come today in terms of filtering the training data. I know that there there are some techniques that are like, we're gonna get rid of all the bad pre-training data, you know, and just try to show this. Yeah, there is a typical step, yeah? Even we in our project, there is called data curation. That even because, as I mentioned, in the pertaining stage, the 90% of the data, the web data, and the web data, plenty of them is creepy, yeah? You are not able to use them because your model will be not stable. And in this data curation step, is there are two sub-stages. The duplications, you remove the same information is repeated very often in the internet. Sometimes it's scaled to two times your because there's plenty of duplicates in the internet and also there is the filtering out. We filter out the data that are very crappy, it means they are low quality. It means, for example, there are plenty of special characters, plenty of interpunctions, plenty of not words from from not being recorded in our vocabulary. Yeah, there are plenty of such a disturbed data that should be truncated because it will have impact on the stability of the quality of models. And I think you mentioned that, that the big companies, they have the such tools that are not even able to eliminate some poor quality data, but they even eliminate some, for example, the theories, some points of use and much broader section, not only the linguistic aspects of the data. So it's the same for censorship. The Chinese models, if you ask about what has happened in Tiananmen Square, they are not able to give you any information. Yeah, the general idea, the people who build models, they can able to isolate or how it's called the black, black the bank some important information that for people who are not aware about that will be the reflection of the world without some part of it. Yeah. But there's at least like two layers to this, right? I mean, there is all this data pre-trained or pre-filtering and I genuinely, I could certainly believe that the Chinese models are trained on data that's so thoroughly filtered as to never have seen, you know, any document about Tiananmen. I feel that even not about in the pre-training stage. I think in the last stage, because as I mentioned, there are three stages during the learning models. The pre-training, the SFP supervised fine tuning and the preference learning, sometimes called the reinforcement learning with the human feedback, but there are other methods like DPO or ORPO. And I think in this stage, they secure the model how to not behave. Right. So that's what I want to get at for what you're doing in the Polish context, because in, I don't know what, you know, the Chinese companies are doing. I do know that the American companies are developing their model specs or their constitution. You know, it's basically this super long document that says, this is how we want our AI to behave. And to their credit, you know, they're starting to be reasonably transparent about what those are. So that at least the public has a sense of like what they're going for. But again, it's like it's in the US context, it's like pretty contentious because, you know, everything is contested here. in the Polish context is it like that or is it an easier time like do you have a constitution for what you want Polish AI to be? There are some strategies how the AI should behave or maybe not, how it should not behave maybe in this case for example there should be ethical that should not blame anyone or for example to avoid some topics that are for example very risky topics the topics about the hate speech. There are some places that are typical. The same for the models from the Chinese or the USA. There are the places when there is a risk that the model behaves in an unethical way or in a way that we can be blamed. That is not rude, at least rude. But of course, there we go with some political tensions. but generally I think we don't have such a huge now currently very huge constraints as you mentioned we don't have a constitution plenty of points that you have to obey I feel we are mostly that the models should be ethical as much as we can but we don't give too many constraints to the model because I think we are on the other level of development of China, Chinese or the USA government or the companies because we are, I think, a few years behind them. Or two years or three years, it's hard to say. But generally, I think we have our own regulations, but not the regulations containing how the model should behave, but rather what kind of data we are able to use for training. We have many constraints on subjects like that, focused on the data, then how the model should behave. Yeah, interesting. So, I mean, I've never even been to Poland, so I obviously should be very humble in terms of my ability to describe it. But one high-level fact that I know is that the large majority of Polish people identify as Catholic. It seems to be. For many years, it used to be a good statement. Currently, I think it depends on how the city is big, the inhabitants of the cities versus inhabitants in all the villages. I think they should be valued. Yeah. So how do you think about that dimension? You know, should the, I just happened to have done an episode not long ago about Catholic AI with a company that is literally building AI that embodies Catholic values, you know, specifically for religious Catholics. But in your context, you know, you've got this sense that like, okay, well, you know, maybe a majority of people are Catholic, but maybe that's on the decline and maybe it depends on, you know, an urban rural divide. do you have is there some sort of decision making process where you think okay like how catholic should our polish ai be um and and you know does it vary in different situations is that are you guys getting explicit about i feel articulating goals there i think we are much more liberty now i think there now now we don't have such ideas to create the reflection of our world. But as I mentioned, I think the models can be very easily constrained by the preference learning and you can learn it to behave in such a special way. But now currently when we produce the family of the models, we produce not only the chat models but also the instruct models, the base models, we give the possibility to companies to use any kind of the models because we know that some constraints may have a disruption effect on some business cases. But generally I don't think that we, as the producer, as the leasers, or the builder of the models, we should get all of them, the people, and we decide how to skew them. Gotcha. Okay. Yeah, very interesting. Do you envision that this will become something, as you presumably go on to train more in future models and they become even more powerful and potentially, I don't know to what degree you sort of aspire to serve a consumer use case versus empowering businesses in the country. But do you think that this becomes a challenge at some point? Do you envision a future where there is a sort of Polish constitution for AI that actually seeks to answer that question? And if not, how do you think you ultimately get around that? Because it seems to be a very central thing that the American companies feel they need to grapple with. So if you think you can avoid that problem indefinitely, I'm kind of wondering. I think we have much harder problems because, for example, you have the AI constitutions in the companies and the USA market. But, for example, you can very easily use all data you have without any constraints. Of course, there is a problem with some suits and many other cases, but maybe it's a long process. I think for most of the companies in the USA, they can take this risk because they are still profitable enough even to pay some bad decisions by the jettison and so on. I have to say the arbitral decisions. but generally I think in Poland and European Union we have the AI Act and our local regulations like for example the Acts concerning the authorship rights and I think these problems, legally speaking, these documents, they have much harder impact on our quality of models than we can I think it's enough huge constraint not to go in further because as I mentioned in the European Union you have the AI Act concerning the general purpose elements we have also our local regulators like for example the AI Act about the the act about the authorship rights and both of them combined create a much harder constraints than for example any kind of constitution that is much I think flexible more flexible than our regulations and I think we don't think currently about the AI constitution but I know that maybe been at one, two years, something like that a few. But currently in the European Union and Poland we fight with the currently existing and obeying the regulations. Yeah. And I think they are much harder, much harder, and they are much impactful than those you mentioned in the USA. Because, for example, you have the AI Act or the authorship rights regulations. For example, they can eliminate the 80% of the data from your training data sets. And so it's a huge, they have a huge impact on the quality of models. Yeah. The worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails. What started as a time saver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs 24-7. Just describe what you want in plain English, send a daily briefing, triage support emails, or update your CRM. And whatever it is, Tasklet figures out how to make it happen. Tasklet connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklet actually does the work for you. And unlike traditional automation software, it just works. No flow charts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with Tasklet founder and CEO, Andrew Lee. Try Tasklet for free at tasklet.ai and use code COGREV to get 50% off your first month of any paid plan. That's code COGREV at tasklet.ai. Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just one of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right one and the technology can play important roles for you. Pick the wrong one and you might find yourself fighting fires alone. In the e-commerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all e-commerce in the United States. From household names like Mattel and Gymshark to brands just getting started. With hundreds of ready-to-use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team. And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world-class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha-ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com slash cognitive. Visit shopify.com slash cognitive. Once more, that's shopify.com slash cognitive. Okay, that's interesting. Hey, we'll continue our interview in a moment after a word from our sponsors. So turning to data then, we can check back in on the state of the Polish AI constitution in a year. On the data front, you had mentioned that in the biggest open source models, maybe 1% of the data is Polish. Quick back of the envelope math, I think the LAMA models are like maybe up to 15 trillion tokens that they've been trained on. I don't know if they disclose their data mix, but that would cash out to something roughly on the order of 100 billion tokens in Polish that the biggest projects might be using. I understand you have quite a bit more data than that, but also your last comment. We don't have the trillion of tokens, yeah, because even, as I mentioned, even in the Lama, there was about, as I mentioned, the one person with the Polish language or maybe less. We have now the several hundred billion tokens. We don't have even the trillion of tokens because, as I mentioned, the duplication stage, the duplication stage and the filtering out stage, they eliminate lots of data. And we don't have one trillion tokens after these data curation steps. So where are you getting your data? And your comment about the difference, the sort of regulatory arbitrage that the American companies are potentially taking advantage of, are they able to use some Polish data that is on the internet? No. Because I remember that it was some times ago there was the people who analyzed the crawlers that are going on the websites in the Polish, in Poland, they identified there are plenty of anthropic crawlers and there was plenty of robotics pages on the websites that disallow these anthropic crawlers to get data. I mean, there are plenty of the crawlers from the US companies, research companies that are crawling the Polish data, even if they are not allowed to do it. Because as I mentioned, it's much harder, for example, to have the court in the USA and to have some accused them for using the data and able to fight with them in the USA court. Even if they have some proofs that they use data, that they have the desire of clauses. and so in that way you're if there's 100 billion tokens that they're getting off the internet it sounds like you can only use a fraction of that and then you had to go elsewhere to find the hundred a few hundred billion tokens that you've yeah example they're using the the central libraries with some some sources of data that are not web yeah because as i mentioned the vast majority of data used by the big vendors, also by us, my web data. But for example, we have also some data that are not published in the web. And we can use to somehow, of course, to some extent, use them. But as I mentioned, there is a minority of data. Still, even for us, even if you have some access to the local organizations and so on, the vast majority of data we use, there is web data. And the problem is the same for all other players. Maybe we can much easier identify some web sites that are not easily crawled by the external crawlers. But generally, I think most of the companies in the OpenAI or the Anthropic, they have maybe 80 or 90% of our data still. Yeah. So where else are you going to get data? What is your data process? We have, for example, the first of the massive mesh of data, but also we have, for example, there is a called, how is it called, the library of the science. There is plenty of publications and so on. We also have some private, bilateral agreements with the publishers but not published on the web. But as I mentioned, there's a little diffraction of data we have in our corporate. And are you also, So you mentioned kind of doing a lot of human review. Yeah, the huge, what we have, our advantage is not in the data used for the pre-training stage, because as I mentioned, I think the 80% of them is still in the anthropical open repositories, but that we have dozens or even the hundreds of annotators that create the manual instruction preferences. because they give us the ability to create the new data that is not published on the internet still. Is there like a Polish equivalent of like scale AI or label box that you're working with to do this or is this a project? No, we have made our internal tools, not so crowdsourcing once. So you guys have built your own platform for human preference data. Fugan preferences and the human instructions are built locally, internally. Of course, some of them, some samples of them will be published to show what is the structure of our instruction and preferences and some examples. But most of them is still the closed asset. So how do you think about that? I mean, one question that I've been thinking about in the context of this whole sovereign AI discourse is, obviously, you know, as a national government, you can have different strategies or different goals and different strategies for what you're trying to do. one goal is, as you alluded to we want to make sure we have our own base, our own data, our own talent base compute, we'll get to in a minute so that if we get cut off or who knows what might happen we have some sovereignty over what's going on some possibilities to develop in a different way but also there is the second, because this is the one as I mentioned, we can something we can mean something more than you think. If you can do something even a little bit worse there is still the the competency and possibility to do new ways and new movements. But I think there's also the second issue that I believe that the air agentic revolution will be based on the smart localized models. Why? Because first of all there are some branches or sectors in the economy or even in our public sector that we are not allowed to use the cloud-based solutions. There are some regulations or the risk is too high and there are some demands to have the on-premise models. When you have the on-premise models, you always have some challenges. Like, for example, the GPUs you have to buy, the energy consumptions and usually when you realize that for example you need to buy the 16 GPUs and pay for energy it always goes to the downscaling to use the as small model as possible to achieve the expected goal and from our experience for example the people and especially the businesses but also the public sector they don't demand exactly the chat GPT the general purpose LLM that is able to resolve 1000 tasks Usually in the business and the public sector we have the demands for the 10 or 20 use cases And we are able to create the smaller models that are able to resolve these tasks from the same level as the few short-use big LLMs and hosting them on the previous solutions. And I think when we go to the agentic solutions, there are plenty of agents. It means there are plenty of models that are used to resolve some complex scenarios. you really have to downscale the models to using only the as small as possible models to be energy effective and also the the economy aspects are crucial now and i think this is the place where the small localized models are can play well yeah that makes a lot of sense in the business context and yeah i you know in my experience i would say the same has been true Like when I'm really trying to dial in performance for a particular use case, and that's all I care about, and I kind of know that this model is going to be deployed in a controlled environment where, you know, because of the way the system is set up, I know what the inputs are going to be. I know what the outputs are going to be. I know that, you know, I have other layers of control. Then, yeah, I can just kind of dial into one task or a few tasks. And often a cheaper model, you know, with the right training can do just as well. Especially when you have the, because mostly the people are now using the cloud LMS in a few-shot manner, yeah? Because they are very powerful. They are able to resolve a very broad number of tasks, thousands of tasks, and they use them as the, in a zero or few-shot scenario. It means they are integrating the API of their own systems with the cloud-based LMS and they use them as they can out of the box easily because you could create the prompt and use the output, that's all. But when you need to create a much more controllable solution, the closed solution, the on-premise solutions, you're not able to use cloud solutions, you have to go with the different, as I mentioned, aspects of the decisions. Do I need the multimodular or textual-only model? Do I have a training data set? Maybe if I have a new data set, I can supervise, fine-tune the smaller model and achieve the same quality as the few-shot applied cloud-based solutions. And mostly, we create many deployments currently. We identify that when you have the few or 10 different use cases and you create for them at least 1,000 instructions or 3,000 instructions and you SFT supervise fine-tune the smaller models, you achieve almost the same quality or maybe sometimes higher quality than using in a Q-shot approach the very big, large cloud-based LLMs. But of course, you have to prepare training datasets, at least 1,000 instructions, the higher number of restrictions is better, but 1,000 is enough. Mostly organic ones, or maybe the SEMI are automatically created with the human factor. And then you have 1,000 or more of the instructions. You can SFT smaller models. And in this one task or a few tasks, you will have the same quality as using the zero-few-shot cloud-based SEMS. And I think this is the future because I think if they will go to the business, the business will calculate the risks, the money, the possibilities, how they control the solutions, what is the impact of their decisions, They finally, in these AI-agentic environments, choose the small localized models and fight them to their demands. And there is also one risk I will show you today, because we discussed it last time, lastly with the Soviet Labdas and of my colleagues, that, for example, in the Atropics models, the CLOT, for example, the LLMs from the Atropics, their quality in the ability in knowledge about the Polish language and cultures going down. Because, for example, they decided to focus mostly more on the software developer assistance. And if they focus on this kind of fraction of the market, they go down with the qualities in generating texts in the niche languages. And imagine that, for example, in Poland, you apply such a model from Anthropik, you integrate it with your environment, with your ecosystem, and after the next months or next years, the next releases of this model are going down on your competencies to demand. And you have to choose the other model or you roll back if you can because unless you aren't able to roll back, revert the previous models or you have to choose another vendor. I think creating the huge integrations based on the cloud LMs is a huge risk because as I mentioned, during the next years, they can change their target objectives and don't need to focus on the Polish or Czech Republic languages because they are not the market for them. Yeah, that's really interesting. I've never heard that before. So just to make sure I understood correctly, you are seeing worsening performance over time in Polish on issues of sort of Polish culture, general world knowledge, as the cloud models have progressed through generations. Yeah, because we have the... I can say there's the anthropic-based models, I think the cloud and the haiku, or maybe I think the haiku, but also we identify this problem with the GPT models. For example, the GPT models that don't go up with the quality of the Polish language and cultural competencies. There's some of them even going down with the next releases. I think the problem is much more broader. I think that the creators of the models, they analyze the market. And for example, if they need to focus on some competencies and they improve these competencies, sometimes there is a trade-off that other competencies are going down. For example, in this case, the Polish cultural and linguistic competencies. I can check it. It was the cloud model from Antropic. And this version is going down on our PSC benchmark when you compare to the previous releases. Wow. Okay, that's a really interesting data point. Yeah, and I guess it maybe sort of answers the next question that I had for you, which is my general working model has been that the AI frontier model developers want as much data as they can get, you know, and if you had any data for them, they would be happy to take it and maybe even pay you for it. But what you're saying sort of suggests, well, maybe not always because, you know, they're trying to do the smallest models that they can as well while, you know, they're doing all this sort of distillation. They're trying to, you know, go for efficiency. They're trying to obviously serve the core use cases that they're getting paid for, which is, you know, a lot of coding. And so maybe if you showed up at their doorstep with a few hundred billion tokens worth of Polish data and said, hey, would you like to use this? Maybe they, at this point, might say not really because we aren't that focused on that use case versus we'd rather go do another however many billion generations of coding tasks and use those tokens instead. I guess, would you guys ever consider, I know you have some, it sounds like you have some open data, but not all of it is open. if I was the, if I'm thinking as the government, I guess another goal that I might have is I want my users, retail users, you know, just my general public to be as well served by AI as possible. And I don't know if you have stats on like what the Polish, you know, retail consumer is using right now. Are they going to chat GPT? Are they going to Gemini or something else? Mr. All, who knows? Um, you may know, I don't know, but if I was the Polish government and I was saying, okay, here's what my people are doing. They're using these other companies. We've gone and collected all this data. Is there some sort of deal to be made with the AI companies where you might say, hey, we'll either give you this data or perhaps license you this data. You pay us for it. And that way you can incorporate it into your process. And that way you can serve the Polish market better. I've wondered if there's a trade that... It's a good point of view. I think the next natural step, that we are not able to get more data without some assistance or the cooperation with other players. But as I mentioned about the previous statement about the to serve as good as we can for our citizens and I think that this is still there, there is one of the goals of our projects because about the PLUM, the family of models that they call PLUM, Polish language models, it's a family of models but not only the family of models but also the assistance and chat votes for citizens and city inhabitants. They will not only focus on the models as itself, because the models are a very good asset, but how to build, based on those models, the chatbots, the RAC approach solutions that can work for the citizens nationwide, but also for city inhabitants, for the local chatbots, for example, in the cities, the halls. And as I mentioned, it's very good to mention that, that sometimes we are trying to create better and better models, but the problem is, for example, on the digital level on another place. For example, there is not a problem that the model is a little bit worse or better, but for example, there is no chatbots for the city's municipal halls and so on. There are two issues. First of all, the deployment issue. This model should be a little bit customized, supervised, fine-tuned, to be able to work as a part of the citizens and boards and assistants for the city inhabitants. And this is the one issue. And the second issue, as I mentioned, is that sometimes maybe there's time for some cooperation because only the cooperation gives you the ability to improve your data sets and improve your models. I think it's a very good step. And I think, as I mentioned, we are developing in our place but we know that there is a place where you are not able to go further and you have to be supported by someone else. It's normal. The same is in the business. To some level of your development, you reach some points but you have to be supported by some better players or diverse players to be better finally. Yeah. So do you know what that kind of market share breakdown is today? and is there a sort of established goal that you have to win market share with the models or is it it's very hard and I think because we are the the models are not the corporate initiative it's not the private money and the funds this is the project supported by and funded by the Ministry of Digital Affairs it's a consortium of the six institutes and now the eight because we enlarged it in the second year, institutes and universities. And we are the public initiative. If you are the public initiative, you don't think too much about the internal investment or the number of customers. We are much more focused on how to be as open as possible, how legal, because you have to be legal compliance with two regulatory holders, transparent as possible, organic because it improves the linguistic possibilities, and secure and how it can be used by the public sector as much as we can, because mostly in the public sector, the models can be should be closed on premise deployments. Yeah, gotcha. You've shared a lot about how you train these models, but what is the base? You're not doing all the pre-training from scratch, right? And my understanding is starting with a base. I can say about it a little more. I can say it and explain it much more in detail. We're trying to create the models from the scratch, it is from the random weights, but the problem is in the pre-training stage the number of tokens you have. As I mentioned, even there are, you look at the APR reports from the DSLM models, they prove that even if you have the 8 billion parameters models, you have at least one trillion tokens to have the stable training, it means stable that gives you the the moderate or high quality base model. And in our case as I mentioned the duplication stages and filtering out give us around 200 billion tokens. It was too little to create the model from the scratch and we used of course the LAMA of course now the LAMA are much more closed but one year ago they are they were still open and the LAMA license was not so prohibited in European as is now. But usually we use the LANA-based models and the MISTRA-based models, and we continue pre-training. We perform the language adaptation. Language adaptation means we continue pre-training them on our corpora of polished text. And after all, we have the new base model, and these new models can be SFT and difference optimization learning in the next second and third stage. But as you mentioned, we are not able to create the moderate quality or the good enough quality model without one trillion tokens and we don't have one trillion tokens in Polish language. Now we have made some experiments with the mixture of languages. Yeah, we made not only the Polish language, but we also use the other languages, the mixture of language to get this one trillion tokens and to make some random pretraining from the scratch. But the efforts will be in a few weeks. Hmm, okay. On this language adaptation step, I have a couple questions there. One is, do you continue to mix in English? Do you try to preserve the model's ability to speak English? Or after this language adaptation, does it just only speak Polish? Of course, when you have the cascade learning, because if you use the base model that was already created and you perform the language adaptation, of course there is always in the cascade learning the problem of the forgetting. Some knowledge is forgotten from the previous learning stages. But there is still some persistence now. Even if we train a few epochs on our Polish data, the models still have the competencies, for example, to write something in English. We don't prune any other competencies on other languages. Somehow they are forgetted, because there is a problem of forgetting in the cascade learning. But generally speaking, we don't prune it manually. We only take the base model and perform the language adaptation, continue to pre-training a few epochs. And this is how we make better the abilities in the Polish language. Of course, there is a trade-off. Some other languages are going down, but they are not pruned at all. And that's also where the world knowledge comes from, right? And by world knowledge, obviously, there's all these sort of local details of life, right? And you can maybe give better examples. I'm sure you can give better examples than what I would give. But I'm thinking just like, what are the names of the Polish candies that kids like? And how does one file a document if you want to sell a car to somebody else? there's surely some filing process. All these sort of little details, they're absorbed in that stage as well, right? Yeah, and you have to know that usually when you, the knowledge, the general knowledge is pertain. But then you ask me about the factuality, how model is good in some facts for the regulatory issues, some low issues that are changing in the time. It's always the problems in any kind of ALMs. Yeah, because you pre-train the ALM, for example, on the data that are till March 2025, and you don't have in this memory store information about the changes in the law or some regulations or even some situations, accidents and the names of the new politics after this time point. But generally we use it in Iraq approaches for data. It means you have the retrieval stage, when you have the database, the knowledge base up to date, it's much easier to be updated. And after all, when you get some retrieval states, you use the models to synthesize or generate the answer based on that. And in this case, we use this factuality issue, because I think none of the providers of the NLMs, even the big ones, are not able to pre-train it, and a new interval of data is coming up. Yeah. Do you think this would work for companies? This has been sort of a, this is a bit of a digression, but it's been a question I've had in my mind for a long time. Almost two years ago now, I did an episode of the podcast with a company called Mosaic LM. And what they were doing, among other things, was this sort of continued pre-training for businesses. They would go into a business and say, let's get all your tokens. And this could be all the Google Docs that you've got and the Slack history and all these various things. Let's compile that. And now we can pre-train on that. and hopefully the model will start to speak your internal native dialect of whatever language you're speaking. I called it, as you mentioned, the domain adaptation. For example, you have some closed data. For example, I have my internal closed data for the documents about my insurance, my customers, about some reports that are not open. And I would like to pre-train the model on this data to make it more adapted to the domain. and we have done such a project now for the biggest bank in the central eastern europe the pko bp is one of the biggest bank in europe the biggest one in the central eastern europe and we have performed what you mentioned the domain adaptation we adapt the models to their domain they have some they have said their own data domain data that is close data and we pre-trained continue pre-trained the models on their data and i think it's a very good very good approach, we proved that in different tasks it varied, but there are some tasks that this domain adaptation gives you a huge gain in the quality and in some financial measures and so on. But I think there is one remark I have to mention. I think that only the huge companies have enough data to be worth to perform a domain adaptation, because we know that for example, you have to at least have, after the duplication and filtering out, you have to reach at least around 10 billion tokens. If you don't have 10 billion tokens, it's not worth to perform the adaptation. And I think if you would like to have the 10 billion tokens in a domain Coquera, you have at least 30 billion tokens before the duplication and filtering out stage. And the 30 or 40 billion tokens, I think there are only a few companies, maybe not a few, but less than 100 in Europe that has 100 billion tokens, very close internal data. Yeah, I guess that my intuition is that, I guess it depends what you count, right? I mean, excuse me, I started a company that's, you know, 40 people. And I don't know how many tokens we have, but over all the Slack messages, all the Google Docs, all the JIRA tickets, all the contract proposals that we've sent and the revision history on all of those, I do feel like it adds up pretty quick. So I guess maybe one of the barriers is just like exactly how deep are these companies willing to mine into their own data? Like if they're actually willing to go get email data from their employees. I think if you count this data, there are maybe the plenty of billions of conversations, the hundreds or thousands of agreements or the proposal of agreements. But if you sum up them and they count them, and after all, they duplicate them and filter out, it's very hard to get 10 billion tokens. Yeah, interesting. Yeah, you can count it. 10 billion tokens is about 10 billion watts. You can imagine that after the duplication, you have at least 30 or 40 billion tokens to have finally 10 billion tokens in a domain corporal. I think it's not so easy. It seems to be easy for many, many, but when we start counting them, it's not so easy to get 10 billion tokens. And I assume that that sort of what you're counting probably doesn't include, like, individual employees, email histories, all that sort of stuff. That stuff is, like, kind of out of scope. I think the emails, when there are emails between, for example, the sales forces, for example, the call center or the sales emails, you can use them. But, for example, there are also some kind of emails that are not keen to be used because of some undefined problems with the intellectual property or the risky cybersecurity risks and so on. I think it's not so easy to use any kind of email because always there is that some emails are too, I would say, too risky or too, not too delicious. Sensitive, yeah. Yeah. Sensitive. Yeah, that's really interesting. Yeah, that also does sort of make me wonder if new organization structures are going to be advantaged in some of these dimensions. Because I totally understand the difficulty that would arise if you said, okay, hey, everybody, we know you've been working here for all these years and sending all these emails. By the way, we're going to take all that and put it into our training process. You might have a revolt. It's a bit different that there is a, most of the, even the big organizations, they are not aware about what kind of data they have. And it's before you adapt the AI or before you train the AI, you have to, first of all, you have to clean your data stores. identify what data you have what is there creepy or not creepy, the high quality or low quality and the data curation process or data organization process, everything what is around the topic how to organize data to find the high quality fraction of the data I think is the problem itself the companies very often the company try to integrate or deploy AI or even train the AI with data curation data organization process and usually it collapses Yeah I think this is the most important, to be aware about what kind of data inventory you have, what kind of data level of qualities, what kind of data you can use without any regulations, internal regulations, external regulations. And this is the most important step. So after this step, when you have the properly identified data sets, described and well organized, you can go up to invest in the AI training and AI deploying based on that. That's really all very interesting. Thank you. Definitely great food for thought for me. What about just going back one more question on models? I know you had mentioned that some of the Chinese models have licenses that don't allow you to use them in the EU Mostly they were the LAMA 3.4, the first model in the license there is a prohibition to use them in the they are forbidden to use them in the European Union, but now there are Kimi models they have the same they are not able to use them in the European Union I think it's a problem with the AI Act because the AI Act that was released the second chapter was released August 2025 they demand from you to create for the general purpose models the cart of model. What data were used on training, how it was secured what are the data sets used and what are the resources used for creating them and many other points in the model cart and I think those kind of models they don't want to evitate what was used by them and what kind of even the telling stages looks like. Yeah. If that weren't a problem, would you be open to using Chinese models or are Chinese models not appealing for other reasons? I think it depends on the task you have. For example, I will be very, as you say, I will have a huge aversion to the risk when I would like to use the Chinese model to create the long structures of histories or essays and emails and so on. Because I think when you create the longer forms of text, the ability that the censorship's evidence will be easily noticeable is going up. But when you, for example, use some model for the task like the understanding, analytical, extractive, for example, to extract information from the documents, or for example, maybe something that you get the information, specific information from the documents, I am open to use the Chinese model. Because the risk of the censorship in the tasks that are typical analytical extractive ones is very low. Yeah, that makes sense. Turning to compute and talent, the other two big legs of the AI stool, I guess how AGI-pilled would you say the Polish government is? I mean, it's pretty remarkable that all this is going on at the governmental level already. I would say that that speaks to a pretty situationally aware and generally agile government. but how you know how committed is the government or how big of a deal does the government understand this to be and you know downstream of that are going to be questions around like how much funding is there and is there there are different as you mentioned during the first minutes of our speech for example there are three pillars of the revolution the data as mentioned before the data organization data curations and generally looking at the data as the crucial point for the AI training. Next, the compute powers, the GBUs and the AI factories and others, the DCs and the data centers and so on, and the talents, it means the people. The three pillars together, combined, create the fuel for the AI revolution. And as I mentioned, I think in Poland we are focused currently mostly on the AI factories, It means to buying as much as we can the GPUs and creating the DC centers for GPU tasks. Of course, we have some projects like the Plume. That is a very good example. One of the, I think, maybe there are three or two projects similar to the Plume in the European Union. The Ministry of Digital Affairs, they have funds and they funded the consortium of the universities and institutes that are able to develop the models and the competences and so on. And somehow they support the talents in this case. because you have the money for the people somehow. But I think we don't be able to be competitive against the US market, yeah? Because in the US market, the AI engineers, they are paid like the NFL players, yeah? I heard something like that, the best AI engineers or AI researchers, they have the contracts like the quarterbacks in NFL, yeah? They are at least as stars. And I think we don't have such a maturity in set the decisions to pay people, to overpay people for their niche competences, because I think it's much harder in the European Union, especially in Poland, to say what is the final objective function we would like to get. For example, 10 million customers, or I don't know, the $10 million each week in our subscriptions, it's much harder for the public sector to define such objectives that are very easily monetized and very easily evaluated. And I think in USA they pay some huge contracts because they are able to evaluate in some way, even making some margins, buffers for the future, how this talent can give you what kind of innovation and how this innovation will pay you back. And I think this is the problem. that in the USA everything is there as I mentioned, when I mentioned I was in Las Vegas, in ETLS when I mentioned that we are working on the ALMs in the public ecosystem that the Ministry of Digital Affairs they funded us we created the consortium and this is funded by the public sector most of the people I met in Las Vegas, they were surprised that the public sector invests in ALMs but in the USA it's something it's hard to imagine that the public sector is that we have enough, I don't know, how to say, the intuition, the enough knowledge, the enough money to invest in such a sexy and very revolutionary topic like AI. So how do you think this will evolve over the next couple of years? Obviously, the amount of resources that the frontier companies are putting into their current and future models just continues to grow, right? That's expected to be... We have to cundle. Even now we see that. When I'm looking, for example, on the GPT models, and when you compare them, of course, the reasoning abilities are going up in different kind of models. But generally, the GPT-5, it was not such a huge... There's not a huge improvement over GPT-4. Of course, there are some reasoning abilities, but generally, there's a platter. The models are going up, but since some level, they are going the improvements are very steadily there is like the horizontal improvement not the vertical one it isn't going up but much more plateau and their development is not so sexy as it used to be remember the Chad GPT 2022 the GPT-4 the multimodality 2023-2024 there are some places in our history of the revolution that they were so shocking and they also they are so making our imagination work so that each year we have something huge something changing the rules of the game now i think the models are going more much more steadily there are not a huge dumps but generally i think now we are starting to counting the costs, the cost of the energy and the costs of what the model are used for. There is the place of their, I don't know how to say, the same is meeting people. That when you meet someone or for example you have some five minutes that you will be adored or maybe not adored. And the same after all you have to evaluate is it worth to meet with these people or not. Maybe it's in five minutes, maybe it's hours. But generally I think now is coming to the time where we're trying to evaluate what is the real cost of these tools and how we can use them and if we use them properly, what they give us. Like the verification stage. And I think the verification stage will give us a lot of information that we don't need to have very huge LLMs, we have to invest in the smart localized LLMs, especially when you are working on premise solutions. Yeah, I sort of have mixed feelings about that. I mean, on the one hand, I do think already the models that exist are amazing artifacts for one thing. And very often, especially if you do take the time to do the supervised fine tuning and really dial in their performance, they can totally work perfectly well for all sorts of use cases. At the same time, of course, we've got the leading companies saying, we're nowhere near done. This is definitely going to keep going. You should expect more progress. We're going to have AI scientists. We're going to have AI researchers. How much do you think your strategy depends on or will change if it turns out that there's not so much a plateau, but that you do still see significant capabilities jumps, albeit with even exponentially more resources required to achieve them? How do you think you navigate that world if it really is the case that a $10 billion training run is actually that much better than a $1 billion training run? I think it's much the most important issue is the first one the demands and what is expected by our customers because I also mentioned that we are very often we are very often biased by the benchmarks general purpose benchmarks but they don't match the benchmarks and expectations that the business has and for example if you have your business, the company and you know you would like to use the AR and in these places I should have at least this kind of metrics, it gives you the very good information what kind of benchmarks you should create to evaluate what kind of model is able to reach these benchmarks, these expected metrics. And I think this is the most important, that very often we analyze the general purpose benchmarks, which are, for example, the factual, the reasoning, I don't know, the extractive infertile competencies they are evaluating there. But, for example, for the business, the problem is slightly different. For example, they need something that writes that beautiful email to the customer or something that makes the email that enables you to cross-selling. That very often we don't know what should be done properly because there is no business. They don't define the requirements very explicitly. I will start from there. First and foremost, what would we like to improve in your business? What kind of tasks you'd like to send to the AI? Next, create the benchmarks for these tasks. And next, you choose the airlines. Because I think that most of the business cases I have seen, they don't demand reasoning. You can do it with the normal airlines without the reasoning stages. I think we should try to... We should slow down a little bit and analyze what is needed to be done. And what is especially because the huge business factor. not only because it's sexy and the public relation likes that but what gives the money to the business or what makes some savings yeah there's a massive disconnect I think often between general business culture and the culture of pink there are two trains but they are not in the sequence one after one but they are next to each other one is much faster the second is going on their base but they are not, as you mentioned, very often there is no crossing between them. There are two roads but the crossing is very far before us. Yeah, that's really interesting. Who are your allies in this? You mentioned using multiple languages and I assume that that's sort of in some sort of partnership with maybe other neighboring country national institutes. I guess I'm curious as to how you think the international dynamics will play out. Historically, in the Cold War, we had the US and the USSR, and these two great powers were sort of engaged in proxy conflict and whatever all over the world. And a lot of other countries understandably said, this is bullshit from our perspective. And there was a movement of countries that were like, we don't really want to be in either of your camps. We would rather be just independent. And, you know, the beef that you guys have between yourselves, like, we don't really want to be, you know, a pawn in that game. Now it's the U.S. and China, obviously, that are kind of the two, you know, big poles of AI power. How do you think countries that are, I sometimes say countries 3 through 193 on the AI power rankings, how do you think they will react? react? Do you see alliances forming or countries working together to share resources, share data sets, to try to create some sort of way in the AI space? There are some movements in the European Union. For example, there are projects that are international. They gather different kinds of people from different countries and try to do something together. But I think it's a huge problem that generally when you would like to get very fast products or very fast the outcomes, you have to centralize. The problem is always the same. Generally, the best way is to have the federation. Everything should be spreaded out. You have different people in different countries. They collaborate each other. They build the wealthiness is going up everywhere. In somehow distributed way. But a normalized way. But generally, when you would like to get very fast incomes, very fast outcomes, and have the products in months, not in years, Usually you have to centralize the assets in one place and there's a problem. Because there are two different ways. If you'd like to do it in an ideal way, you should create the unions. The unions of countries, the unions of states, the unions of the partners, the networking, the consortium, the consortium with plenty of hundreds of stakeholders to get this knowledge everywhere and to distribute this knowledge and this power everywhere. but usually when you have to get some outputs very very fast you have to centralize the problems are there are two opposite ways you're not able to do it the same way both of them and i think this is the problem yeah because generally when the for example when you have some very huge pressure on the outcomes on new models you always prefer the centralizations yeah like with the silicon valley yeah you have the huge usa but most 90 percent of the startups in the silicon valley yeah this the centralization in one place where there are the money assets and everything but from the economic point of view the best place will be distribute these companies across the whole USA yeah and I think that is the problem yeah that if you are if you would like to monetize something and you would like to get the very fast outputs you have to centralize but generally the best for the economy and social aspects is to distribute in a normalized way across the country and across the the continent. What about geopolitical? I think there are still two players, the Chinese and the USA. They have the two biggest economies. They have money. The Chinese companies, I heard that they pay the same money to the researchers as the USA ones. The contracts are now currently they are somehow similar. It means they pay very well. They don't have to compete. There is no risk that people will be taken over by the USA companies because they are well paid in China. In Europe, I don't think there is still... There is a mistral, the European funded startup. Now it's not the startup but it was the startup two years ago. But now I heard that the shares 30 or 40% of the shares are they in Microsoft, yeah? They're not so open as it used to be because there are some stakeholders from the USA. I think generally the problem is slightly different. The question is whether the Chinese and USA they are going to be the rivals still, or there also is a chance for the cooperation. This is the question. Maybe it's a chance for cooperation still. Yeah, from your lips to God's ears. Maybe just one little follow-up, and I think this has been excellent. I really appreciate all your time and all these thoughtful answers. Is there anything that you have seen that on sort of a technical or socio-technical level can help with the cooperation of decentralized AI? Here I'm thinking about things like the Nier Protocol. I recently did an episode with the creator of the Nier Protocol, Ilya Polisuhin. or there's also the intelligent internet, which is Imad Moustak's project. There's others as well. These things sort of have this idea that if we create the right scheme, it might be somewhat cryptographically enabled. There is a topic called federational learning, yeah? Federated learning that you can use the different data sets somehow anonymized and secured and use them in a way that in networks that you are able to identify the decisive data, but you can use the data to train your models. There are different kinds of ideas about this federated learning, but generally I don't know that there is some huge deployments of such approaches. Even though there is a very good approach to have such networks for federations to cooperate with each other and to share data in some, as I mentioned, the secured way. But I think still there is, we are not on, if you talk about the business and economy, we are not on the level to use it. because I think we are still the problems that the companies mostly they don't know what data they have what is the quality of their data what is the value of their data if you aren't able to measure your own in-house data repositories how you can go farther and create some data mixture or the network of data repositories I think this is maybe in the future in the future I think there is a chance with the distributed data repositories with the secure levels and anonymizations used by the huge consortia to use as much data as we can. But I think this is not for the next year. Yeah. So much depends on whether there really is a plateau or whether the frontier companies are just going to continue to keep scaling successfully. And now I heard that the plateau is caused by because there is a lack of the organic data. The biggest companies, they collected already almost all organic data that was able to be collected from the internet. Even some of the companies, they scan the books that they haven't appeared in the internet to make this organic data gains relevant. But there is still the problem that I think that this repository, the resource of the organic data is almost full. Yeah, that's why we're now seeing all these simulated worlds and the strategies to overcome that are going to be definitely fascinating to watch. I, for one, will bet on them working, but it does, to a certain degree, certainly remain to be seen. Again, really fascinating conversation. It's been awesome to get your perspective. Anything else you want to share or anything we didn't touch on you want to comment on before we break for today? I can show that I can recommend to our archive paper, the Plum Family Models. It will be the archive. It's now the archive paper released in November, the first days of November. And I recommend readers and the viewers of this podcast to look inside this paper. The Plum Family is the archive, the title of this paper. Yeah, that's P-L-L-U-M. And the P of course is for Polish. It's the Baelish large language models. Merit Kozlowski, this has been amazing. Thank you for being part of the cognitive revolution. Yeah. Thank you very much. If you're finding value in the show, we'd appreciate it. If you take a moment to share with friends, post online, write a review on Apple podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guests and topic suggestions, and sponsorship inquiries, either via our website, CognitiveRevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of A16Z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at AI podcast dot ING. And thank you to everyone who listens for being part of the cognitive revolution.
Related Episodes

China's AI Upstarts: How Z.ai Builds, Benchmarks & Ships in Hours, from ChinaTalk
The Cognitive Revolution
1h 23m

AI-Led Sales: How 1Mind's Superhumans Drive Exponential Growth, from the Agents of Scale Podcast
The Cognitive Revolution
51m

What AI Means for Students & Teachers: My Keynote from the Michigan Virtual AI Summit
The Cognitive Revolution
1h 4m

Escaping AI Slop: How Atlassian Gives AI Teammates Taste, Knowledge, & Workflows, w- Sherif Mansour
The Cognitive Revolution
1h 40m

Is AI Stalling Out? Cutting Through Capabilities Confusion, w/ Erik Torenberg, from the a16z Podcast
The Cognitive Revolution
1h 37m

A Positive Vision for the Future: Part 2 with Illia Polosukhin of NEAR
The Cognitive Revolution
1h 28m
No comments yet
Be the first to comment