Building an AI Mathematician with Carina Hong - #754

TWIML AI Podcast

Tuesday, November 4, 202555m

Spotify Apple

TWIML AI Podcast

0:0055:52

What You'll Learn

✓Large language models have advanced reasoning and math capabilities, setting the stage for an 'AI mathematician'
✓The Lean programming language has seen growing adoption among mathematicians for writing formal proofs
✓Applying code generation techniques to mathematics could lead to performance gains, but data scarcity is a challenge
✓Lean allows mathematicians to write proofs in a rigorous, programming-like way, with the compiler catching errors
✓The goal is a bidirectional system that can convert between natural language math proofs and formal Lean proofs

AI Summary

This episode discusses the work Karina Hong and her company Axiom are doing to build an 'AI mathematician' by bringing together AI, programming languages, and mathematics. The key points are that the timing is right due to advancements in large language models, the growth of the Lean programming language used for formal mathematical proofs, and the potential to apply code generation techniques to mathematics. Lean allows mathematicians to write proofs in a rigorous, programming-like way, with the compiler catching errors. The goal is to create a bidirectional system where the AI can convert natural language math proofs into Lean and vice versa, allowing human intuition and formal verification to inform each other.

Key Points

1Large language models have advanced reasoning and math capabilities, setting the stage for an 'AI mathematician'
2The Lean programming language has seen growing adoption among mathematicians for writing formal proofs
3Applying code generation techniques to mathematics could lead to performance gains, but data scarcity is a challenge
4Lean allows mathematicians to write proofs in a rigorous, programming-like way, with the compiler catching errors
5The goal is a bidirectional system that can convert between natural language math proofs and formal Lean proofs

Topics Discussed

#Large language models#Lean programming language#Code generation#Mathematical reasoning#Formal verification

Frequently Asked Questions

What is "Building an AI Mathematician with Carina Hong - #754" about?

What topics are discussed in this episode?

This episode covers the following topics: Large language models, Lean programming language, Code generation, Mathematical reasoning, Formal verification.

What is key insight #1 from this episode?

Large language models have advanced reasoning and math capabilities, setting the stage for an 'AI mathematician'

What is key insight #2 from this episode?

The Lean programming language has seen growing adoption among mathematicians for writing formal proofs

What is key insight #3 from this episode?

Applying code generation techniques to mathematics could lead to performance gains, but data scarcity is a challenge

What is key insight #4 from this episode?

Lean allows mathematicians to write proofs in a rigorous, programming-like way, with the compiler catching errors

Who should listen to this episode?

This episode is recommended for anyone interested in Large language models, Lean programming language, Code generation, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

In this episode, Carina Hong, founder and CEO of Axiom, joins us to discuss her work building an "AI Mathematician." Carina explains why this is a pivotal moment for AI in mathematics, citing a convergence of three key areas: the advanced reasoning capabilities of modern LLMs, the rise of formal proof languages like Lean, and breakthroughs in code generation. We explore the core technical challenges, including the massive data gap between general-purpose code and formal math code, and the difficult problem of "autoformalization," or translating natural language proofs into a machine-verifiable format. Carina also shares Axiom's vision for a self-improving system that uses a self-play loop of conjecturing and proving to discover new mathematical knowledge. Finally, we discuss the broader applications of this technology in areas like formal verification for high-stakes software and hardware. The complete show notes for this episode can be found at https://twimlai.com/go/754.

Full Transcript

I'd like to thank our friends at Capital One for sponsoring today's episode. Capital One's tech team isn't just talking about multi-agentic AI. They already deployed one. It's called Chat Concierge, and it's simplifying car shopping. Using self-reflection and layered reasoning with live API checks, it doesn't just help buyers find a car they love. It helps schedule a test drive, get pre-approved for financing, and estimate trade-in value. Advanced, intuitive, and deployed. That's how they stack. That's technology at Capital One. started. Head to ai.studio slash build to create your first app. Math and coding are two important, perhaps the two biggest part of digital work. And coding is heavily invested. Math is not. Not that math is not trackable, but that math has not been turned into programming language yet. And there will be new markets and new use cases that get unlocked because of this. all right everyone welcome to another episode of the twimble ai podcast i am your host sam charrington today i'm joined by karina hong karina is founder and ceo at axiom before we get going be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Karina, welcome to the podcast. Thank you for having me. Great to be here. I'm excited to dig into our conversation. We'll be talking about mathematical reasoning, which is what you are working on there at Axiom. And it is a very timely topic. Before we dig in, I'd love to have you share a little bit about your background. I love math as long as I can remember, when most people don't. And I feel like Olympic math training gives you this sense of like constant dopamine hit. Like you solve a problem and then you feel so good about yourself and then you move on to the next one. A bit later, I went to MIT and started research mathematics career. And that was a lot more pain and suffering. You usually are stuck on a problem for months. And I remember I was working on a really hard problem for about like half a year and I would meet with my advisor every week and have nothing to report. That's kind of the life of a research mathematician. What particular field in mathematics? Yeah, so I work on number theory and combinatorics. A lot of people who work in combinatorics come from the Olympiad math background, so they can solve problems, yeah, a lot faster than other combinatorics researchers. But number theory, hopefully, is better. You can read a lot of literature and make really interesting contributions without having this sort of outlier problem solving skill. Nice. I took a real analysis class in grad school, and I think that was about enough for me. I love graduate analysis. I remember back then at the graduate analysis class, I don't know if it's because no one is quite following. We were just like passing note and saying that the professor looks like Winnie the Pooh. that must mean that the class was a lot easier for you than it was for me or maybe we're just completely lost i don't know nice so um so you're working on the this hard problem at mit it took you half a year yeah and then i think i realized at one point that wouldn't be great if you can just let your intuitions take you. You have some lemma that you want to prove instead of being stuck there constantly verifying the details. Wouldn't it be great to just, you know, think that there's a high chance if your intuition is right, that the lemma is correct and move on to the next one. A bit like, I mean, those creative AI arts startups, right like pika like you want to put a fox um on the cartoon and then have the fox talk to a bunny instead of actually sketching it out like the cartoon say it and let it be so yeah exactly i think that might be the difference between say terry tao and karina hong which is perhaps both can have some sort of lemma they want to prove but terry can just move on to the next one and karina need to be stuck there for about six months before moving to the next one you know One thing that strikes me about what you're doing is that we seem to be in a particular moment with regards to math and AI and this confluence of math and reasoning. I recently had Christian Zegedi on the podcast. A week after we talked, he left to start another, or he started another startup, Math Inc, to focus in this area. There's also Harmonic working on this area and probably Untold Others. I think the first thing I want to get your take on is why now? Why is so much investment and effort more broadly going into mathematics and AI? I think what we feel at Axiom is the timing is right. And the timing is right because of three reasons. We believe in AI mathematician being built by bringing three fields together, and that's AI, programming languages, and math. If you look at the current landscape, informal models are really good at reasoning. They're really good at math and coding. Informal models like LLMs. Yeah. So informal models, large language models, the old series of OpenAI really marked an era of post-training, reasoning, reinforcement learning. Like this is great technique breakthroughs. So that's number one. Number two is on the four-bone side. Lean 4, which was rolled out September 2023, has been widely celebrated by the community. The number of people who learn Lean, use Lean on a daily basis have pretty much expanded. And a lot of mathematicians, field medalists are joining force and using Lean and seeing, OK, well, how can this formal language, this programming language help us verify math and, you know, hopefully have more and more digitalized mathematics. So that's number two. Number three, I would say, and this is maybe like Axiom's hot take, is by turning math into programming language, you really need a lot of the code gen techniques. And code generation has been crossing a threshold. We have seen really impressive performance of reinforcement applied to coding. And the hope is by turning math proofs into programming language and using a lot of the code gen techniques that we think are a bit kind of overlooked, we can have really amazing performance gains on math as well. So three things are really converging together. Are these code gen techniques overlooked in code gen or overlooked in application to math? In application to math. And for good reasons. If you look at, for example, there are probably more than a trillion tokens of Python code in the Internet. Yeah, that's a lower bound. I think there are about 10 million tokens of Lean code. So we have the Lean Universe, which is the largest Lean dataset. The developer of Lean Universe is currently at Axiom. And we kind of are estimating, wow, there is a 100,000 times data gap. And that's ridiculous. Like scaling can't quite work with this amount of data scarcity. Yeah. Talk a little bit about Lean. I see that come up frequently in the conversation around mathematical reasoners. You know, it's a programming language. You know, what does it kind of look and feel like? You know, what problems is it trying to solve? Who's using it? Yeah, 100%. I think the magic of Lean is human mathematicians have been doing mathematical reasoning on pen and paper, literally, for thousands of years. And because of how they are trained, they do it in a very rigorous way. It is by humans to make sure that each step strictly follows a previous one. Lean makes this more like how computer programs work. You can write a proof in Lean, and the Lean proof will hopefully compile if the proof is correct. And if it's not, then it will report a bug, say a bug at line 37. Then you look at line 37, and you try to analyze exactly what the bug is about. It can be that you get the syntax wrong. So like every Python beginner gets the syntax around sometimes. It could be, you know, not the mathematical merit of the proof that's wrong, but just the syntax. Or it can be that the math is just wrong. And there's like a third kind of problem that's a bit more stubborn. That is, there's something about the types that are not matching. So Lynn is built on something called dependency type theory. So then you have types associated to every object. And sometimes these sort of bugs are harder to fix. So by trying to feed the error message back into, say, your prompt, you can do inference with feedback and you can do the standard like RLES type of stuff. And hopefully you can have the machine see more lean, learn more lean, and generate more lean. So it kind of stops making those mistakes that's at the syntax level. But whenever there is a mistake, you can be more confident that something about the math is wrong. And what does it look like? You know, I'm thinking about the math that we see in papers. And I'm also thinking about one of the first computer programming languages I learned in education was APL. Which has all kind of like weird symbols and stuff like that. And I'm wondering if Lean looks like that. Yeah, Lean has some interesting math symbols, like the inverse A, which is like for all. It basically looks like a programming language. It's definitely not quite readable. APL looked a little bit like hieroglyphics. Oh, okay. Yeah, it's very interesting to think about the abstraction of these languages, like with natural language being very readable, but not quite easy to compute, right? You can't actually do computation with English. and then you kind of go down the stack, you have programming language like Python, like Lean. We now think about, for example, Python being kind of translated almost to an intermediary language and then intermediary language and eventually the machine code that is doing the actual computation that's going down the stack. But it doesn't quite, there's some local examples where it goes up the stack a little, but it doesn't kind of like go fully up back to the Python stack. That's not how like compilers like work. On the other hand, if you think about the highest stack being like natural language math proof, and at the bottom you have like Lean proof, and in the middle you have things that are needed to convert, to formalize a natural language proof into Lean. You have maybe finer and finer blueprints, like math that are 10 times longer than the original paragraph of proof, it actually is able to go down and up the stack freely. And I think that's quite interesting. So auto-formalization is for a model to convert English proof into lean proof, and informalization is going back up. So then you have like a full bidirectional circle, which is very interesting because then you have human mathematicians, reasoning, high-level intuitions at the English layer, And then lean is proving at the bottom layer. And these sort of formal proving signals then can feed back in to advise the mathematician, okay, well, maybe this part of verification isn't quite working, to then guide the continuation of these high-level intuition. That process is very fun. It can actually be transferred to a lot of other settings as well. If you think about formal verification, in other contexts, such as chip design, such as code verification. Those are quite interesting examples. Are you able to say conclusively that if a program compiles and runs in Lean, then the proof is without error mathematically? It seems like that in and of itself would be a research topic or a significant achievement, like having that one-to-one mapping between validity and the lean space and validity and the mathematical space? Yeah. So I think it's quite interesting that a lot of the mistakes actually happen at a misformalization of the statement. So if you're sure that your statement is like correctly formalized and your proof compiles, then congratulations, you have a correct proof Now there might be multiple correct proofs and there might be for the same proof multiple ways to write it in lean so it not one matching per se but you do have the sort of provable guarantee that if your statement is correctly formalized and then you have a working lean program then you have a proof right there you talked a little bit about like the the three elements that are coming together to make this an interesting time for mathematical reasoning, lean being one of those pieces. And then we're talking now about auto-formalization. Does auto-formalization, is that part of lean as a package or is it something that folks are working on that kind of utilizes lean as a foundation? And what's the current state of auto-formalization from your perspective? We think of auto-formalization, which is the process, again, for a model to convert natural language math proofs into lean. It's like surprisingly very hard. I mean, it's not just a translation problem. So I think there are two ways to think about auto-formalization. One is as a data bet, right? So you can convert a lot of math data in the informal space to the formal space, and then you kind of solve the data scarcity problem of formal math data. And so one way to think about it is it's a way to gather data. The other way to think about it is it is also a core technology in and itself. So all the formalization and proving have very sort of entangled capabilities. Like I think a model that can auto formalize very well likely can do better in proving and the other way around. So we strictly don't think that all the formalization and proving should be trained separately. Like they should be in one loop. They should be in one model. So I think these are the two ways we think of auto-formalization. If you look at the entire mathematical archives, there are about like 30 to 70 million statements and proofs pairs. And none of these are like, you know, in Lean. I mean, very basic stuff and domains that happen to be the domains that the Lean developers, the Lean MATLAB contributors like, such as algebraic number theory. Those are in, but like differential topology, for example, is not in. If you can do auto formalization well, you can get these very high quality seed data at all levels of difficulties of math. And then you can convert these seed data into even more synthetic data. Then you can try to close the data gap we just talked about of math and coding. Lacking those, how do you approach closing that data gap? Yeah, I think auto formalization is important. We also think that the other parts of, you know, the synthetic data generation, like after you have a lot of high quality data, how do you make it 100 times, 1000 times more? I think those are interesting as well. Maybe I'm not hearing this correctly, but part of what I'm hearing there is that we need more data to do to like do auto formalization, but then more data and order formalization is needed to do proving. But, you know, in the first place, how do we get the data required to do auto formalization or more broadly, what approach are you taking to tackle auto formalization? I think it is a cold start problem, right? Because machines have not seen enough lean, they don't know how to do lean well. So you do have that sort of like cold start chicken and egg problem. I think there are mature research techniques out there in the publication space that are trying to solve this problem by bringing the gap between natural language proofs and formal proofs closer. We think, you know, broadly reinforcement learning is an important part of that. How do you see RL playing in this space? So we have seen really amazing scores on various benchmarks from models that use reinforcement learning to do proving well. We have seen, for example, Seed Prover is a really strong paper by Dan Seed Foundation. There's that Hilbert Prover. Before that, there is Kimmina Prover, Dipsic Prover. We have seen reinforcement learning applied on formal math having pretty good results. And is there a general approach that is being taken to apply RL to formal math? I think people try different things, actually. There are many different sort of designs. There are more Monte Carlo tree search based models and there are more sort of creative agent design one can do. And there are interesting RL training recipes that each lab chooses to take. It's not, I think, it's more like an art than a science. How do folks approach like objective design with proofs? Is there a kind of standard way to think about that? Yeah, I think there is a difficulty. I think it's that the signal can be quite sparse, right? So once the proof gets incredibly complicated, like, you know, it's a little bit hard to get it to work. So some people do see that as a challenge. But I mean, it's a very like rapidly evolving field. and people have been trying to, you know, define good curriculums for the model to hill climb. So currently, math models, formal math models are doing quite well. In the high school, Olympiad math level, you have seen two formal models actually do spectacularly in the IMO. And then beyond that, you can have more fine-grained curriculum to try to teach the model harder and harder math actually across two axes. One is how abstract the mathematical object is. The other one is how sort of eureka moment, how creative the proof is. And IMO is something that is low abstraction, high creativity, requires a lot of genuine, like, you know, innovative approach. And then you have, for example, PhD qualifying exam. That could be relatively straightforward if you know all the concepts. And that's at a higher abstraction level than IMO. And you can define curriculums for the model to do better and better on both axes. And what are the two models that have done well on the IMO through formal methods? I think, for example, the seed prover model is doing pretty well and Aristotle is also doing pretty well. What can you say about those models? Like what approaches have they taken that you think differentiate them among other attempts at this? I think so. Both are industry labs with more compute and data budget resources. Always helps. Always helps. We have seen, for example, the Aristotle one being more Monte Carlo tree search based. And then the SeedProver one having pretty, I think, intricate designs, system design. So they have a lemma pool, they have very different sort of designs for their light, medium and heavy inference tiers. So I think really there are many ways. And eventually, I think people will just try everything and see like which one works better. And so taking it back to Axiom, it sounds like you've got this, you know, this big problem that you're going after and a lot of tools and various tool chests to kind of apply to the problem, a lot of levers to pull. Like, how do you think about, you know, where you invest or, you know, what bets you place or what needs to work well for you in order to be successful? Like what general direction are you taking through all this? We believe in a system that can prove and conjecture and self-improve by having the prover and conjecture talk to each other. I think verification, which, you know, exists here and in the domain of formal math and self-play really tie together. You can have the prover as the reward signals for the conjecture. And that is not a new idea. It hasn't been an idea that worked well in other domains before. And we want to take pretty bold bets on the data end as well through auto-formalization and synthetic data generation. So I think of like our approach as sort of we do a lot of things and they all eventually fit together pretty nicely and cognizantly. You know, use auto formalization to generate a lot of data and to expand that data by 100 times, 1,000 times through synthetic data generation methods. and then have a really strong prover that shows capabilities across benchmarks that has different capabilities, do benchmark and evaluation very carefully, and then eventually have the prover talk to the conjecture for the conjecture to propose new problems and new curriculums, and the prover to then prove it and then form like a self-improving loop as the prototype of self-improving AI. And do you approach all of those in parallel or is there a particular starting place and that's kind of foundational that you tackle first? We currently are doing more than one thing for sure. But as you're right, that startup is always like a kind of constraint optimization problem. We currently feel like the different parts are all very important. and by kind of not, you know, like constantly thinking about these ideas and staffing those teams. Actually, it's a very kind of cohesive picture, even though like obviously for an early stage startup, there's too many ideas for us to execute all at once, but we're trying our best. You mentioned that one of the elements of a solution that you think is important is self-play. I think we've seen some significant successes there through like the AlphaGo, AlphaZero style of models. Like, once you have the pieces, is it, you know, a straightforward thing to set them against each other and say, you know, go off and talk to one another and figure this out? Or is there also a lot of technical challenges just in like setting up your self-play system and environments? For sure. I think that is, there are quite a few difficulties there. For example, right, if you, okay, if they just cannot prove, like, is it the fault of the conjecture or is it the fault of the prover? Is it that, okay, well, this problem is just bad or is it just that it's a good problem and you're proving it? I think disentangling them. There are a lot of open research directions, I think, that are underexplored. But I think the general idea, which I think is quite creative and novel, is that through math, we can try all these things. It's a sandbox for trying self-improving AI. and the conjecture proposes new conjectures. And based on the current knowledge base of math and that knowledge base is evolving, it is expanding and contracting, expanding because of new things being proven, contracting because, well, maybe some of them are, you know, low merit. And then provers prove or disprove these conjectures. The results added back to the knowledge base and the cycle repeats and expanding the body of knowledge and math. I think we're going to see some difficulties in judging what is good mathematics by the model. What is elegance? What is novelty? There are certain sort of rough proxy measures that we can try, but not like a conclusive definition. I mean, there are other papers out there. I think Yosha Bengio has this paper on, you know, looking at building AI mathematicians through the lens of compression. Well, is the first order logic viewpoint satisfying or not really, right? If you have a theorem and then because of including that in your knowledge base, many theorems are then born. Is that, you know, is that end all be all for like elegance? You know, there may be probably some statements that are extremely elegant from the perspective of human mathematicians that don't actually, that kind of like conclude things, right? Like close like a threat, like, you know, nuke a field. Like after we know this, like there's nothing new to be born. So I think that might be a human mathematician sort of like alignment problem with like what is good. Like taste, I think it's still very, very hard to measure. Yeah, taste is something that I think we're grappling with across a lot of different fields in AI, and it's not clear how we necessarily get there, you know, beyond, you know, more data that somehow exemplifies better taste, you know, which kind of brings us back to the data problem in your field. and i think the data problem in our field is like for the really not so complicated not like frontier research math like just undergrad textbook there haven't been damaged data we having this conversation in the the context of uh you know the the post llm you know ai era if that what we want to call it And a lot of that was kind of this unlock of self through next token prediction, that kind of thing. Like, do you see an opportunity for self-supervised data exploitation in mathematics? Like we've got this tremendous, you know, there are tons of papers, you know, many of them are digitized and mathematics, even the, you know, the ML papers that we look at, like have lots of math in them. You know, there's narrative about those proofs. You know, can we use that without necessarily requiring Axiom or Math Inc or everyone else to like take those manually, convert them into lean and like build these supervised data sets? Do you think anything's there or? We at Axiom actually believe in the informal part of reasoning being also important in the loop. I think if you have the informal data and quite aligned with your formal proof and you train them on the verified pairs, it's great. We think that's a good idea. If you look at even not talking about math verification, just talking about Python, right? There is this paper, R-star Math, I remember was earlier this year in April or something. Microsoft Asia paper, they see that a small model can achieve on par with O1 performance. Back then, O1 was still very, very impressive. It's amazing how fast the field moves. But training on the pair data of informal solution to high school math problems and the Python one, yeah, I think that's interesting, right? And you also see papers like Lego Prover trying to have the informal proof as aligned with the formal proof as possible. We don't believe in only formal. We don't believe in purely informal for sure. We believe in kind of taking a hybrid approach and the strength of the informal models, especially maybe in high-level intuitions that could be very interesting to do that together with the formal side as well. I have the impression that your team is not exclusively mathematicians or mathematically focused folks, maybe mathematically interested folks. But for folks that are coming into it with a background in AI and an interest in math and this field, what's the way to get started? I think a shameless plug of we are hiring applied AI researchers who have not worked in AI for math. I think we really believe in the transferability of techniques and just broader, like mainstream AI and also like AI and cogeneration. We think there are a lot of techniques that can be borrowed from these other fields and applied to math. In fact, we find mathematicians being really helpful, but for only some parts of this effort, not for, for example, training the prover. We find mathematicians being helpful to give us strategic guidance on what are interesting auto-formalization targets. We find mathematicians being helpful in working on the mathematical discoveries part, constructing examples or counter-examples that settle long-standing conjectures. So very specialized problems that have nothing to do with lean. So we haven't actually talked about that part of the Axiom roadmap led by Francois Charton's world-famous AI for math researchers on the discovery side. We find mathematicians being extremely helpful to the team. But, you know, for folks who come from, you know, the strawberry line of work, I think there's like immediate sort of transferability of techniques. And we also find we have really strong team coming from applying deep learning to CodeGen. And that team has been authoring many amazing work like LM compilers, kernel LM, and are also working. They have also worked on CWM, the computational word modeling. We find those also quite interesting. But I think to answer your question, for someone who is not super familiar with AI for math, there are many amazing survey papers. I think there's one paper called Formal Mathematical Reasoning and Next Frontier in AI. And first of all, there's Kai Yu Yang. They survey the field really well, how different pieces fit together. And they also have many calls to action in the last, I think, Section 5. Like, what are the open problems to consider? I find that very helpful. There is an AI for math GitHub ripple listing, like, all the papers. and also including some that have nothing with Lean, just the misnomer discovery side, for example. I find those very helpful as well. There is an amazing community. I mean, there are so many conferences each year that talk about AI for math, like ICERN and I think Brown, and there's Big Proof and recently Albert Warfuck, many conferences. And also I think there was something at the Simons Institute at Berkeley. Just, you know, it's a very welcoming and friendly community. People like to go and just learn stuff. You mentioned some of the work that your team is doing, Francois is doing on the discovery side. Can you talk a little bit about that? So mathematical discovery tackles specialized problems. So it will be one specific problem. Instead of the benchmark, usually have problems coming from all branches. of math, right? There's some algebra, some analysis, conectorics. Francois and also his team's work tackles specialized problem that is known to be very hard. Usually these conjectures are open for about, you know, like there's one that's open for 130 years. The other one is open for 30 years. And shocking, the 30-year-old one is actually disproved. So it's actually not true. It's a very different landscape from the lean during proving side there are many techniques that are helpful to tackle those problems pattern boost for example is a work that is quite it's a it's a general method that can tackle these sort of specialized problems and i think there are more works coming for that that i can't talk about yet well can we talk about pattern boost for a second what's the general idea there Pattern Boost, for example, uses Transformers to generate examples that satisfy a certain pattern. And there are global techniques and local techniques that, you know, when you look at the examples, you will perturb them like either locally or globally to synthetically generate more examples. And that has led to the discoveries of many interesting mathematical results. Palin boost kind of extended beyond the paper itself, and it can be a toolkit for, you know, generated methods that can be extended to evolutionary algorithms and diffusion methods, potentially. But those are all like under development, are very early stage. So similar to the techniques in Palin boost, you can actually have, say, diffuse boost. That's like, you know, work that is upcoming. And there are also other papers, like there is this paper called End to End. It is an open source transformer for translation, you know, translation of integer sequences. And it's like developed and open source by Francois and improved by Alberto Aferino, who's also at Axiom. And that is sometimes considered the first transformer kit for many mathematicians. And I think there's like, you know, one other branch of mathematical discoveries that is underexplored. For Lean, for example, it does not quite cover graph theory very well. It's very hard to represent graphs in Lean, but there are like, you know, you can potentially do graph foundation models or word models for graphs or other objects. And these can probably derive from into end. But it's a little bit unclear, you know, how that line of work will go and whether that will end up as a toolkit. So with regards to auto formalization as kind of a linchpin technology and all of this, what is the current state of play? Like, are we trying to get it to work at all? Do we, or, or, um, I take that back because I'm imagining that if you can strain your problem sufficiently, you can create a trivial auto formalizer. So there are things out there, but like, how do you characterize like how far along we are with auto formalization? And what do those models look like? Do they look like LLMs where you give them, you know, some texts, you give them a natural language description of a proof and they spit out, you know, some representation of that proof or does it look different than that? I think auto formalization, benchmarks are hard. And I don't think there have been a good auto-formalization benchmark yet in the community that's widely recognized. So when it comes to auto-formalization, there are auto-formalization of statements and there are auto-formalizations of proofs. Once you're sure that your statement is correctly formalized and lean, auto-formalization of proof is actually easy to verify, right? You see if it compiles or not. When we talk about statements, are we talking about natural language or lean statements? So statements as in like auto-formalization of informal statements into lean statements. Now in lean, you can only know if the syntax of the statement is correct or not. You don't actually, it's very hard to verify the statement. So you can verify a link proof is correctly serving the purpose of proving that certain statement, right? What's an example of a statement that you think of as kind of a canonical thing that we might want to prove? The statement is, say, like the problem, right? Like, you know, for Maslow's term, without the proof, when it was unsolved, that is a statement. And that's what I'm kind of using. So if you think about a benchmark of 100 problems, all these 100 problems are statements. When you don't have the proof, you don't actually know if your statement is correctly formalized or not. You still need experts to eyeball, and that can be very difficult. So all the formalization of statements, I think there haven't been a benchmark for testing that capability per se. but that once your statement you know that your statement is correctly formalized you want to check if the auto-formalized proof compiles as in whether it is serving the you know, it is proving the statement but you want to be sure that it is the same underlying mathematics as your informal language proof, not another proof that also is correct so you know one idea in the community is to test the capability of auto formalization of proofs you select the problems that likely only have one proof that like you select the problems that you just don't believe there's like you know the large language model will just the auto formalized model will just like generate another proof that no human knows So that's kind of the way to know. But auto formalization, measuring auto formalization capabilities is hard. And that's why people have been trying to auto formalize very difficult results, right? The bet is that we don't know another proof of Fermat Law Theorem, because it's already so much effort to find one. So yeah, I think auto formalization, measuring the capability of that is hard. Yeah, there's something in there that I'm trying to get at with regards to auto formalization. So are you saying that there is no to date auto formalizer model, like this is a goal. And even if we constrain the problem to like high school geometry proofs, like we can't auto formalize those. And that was my conjecture that like, at some point you can like limit your universe to proofs that are sufficiently simple and you can automate those even if you're just using regular expressions or what. Like you've got to be able to do it at some level of scale. And the issue is that you're trying to just solve harder problems. Is that right? So I think, you know, what I was saying that is one, there's no good auto formalization benchmark. That is the test that has not been quite established in the community. Second is if you look at auto formalizer, Kimina, a collaboration effort between Moonshot AI and Numina, a nonprofit organization, they have an auto formalizer and you can play with it. It's a demo, but it can only, I think, from my experience of playing with it, only auto formalize short statements or extremely short proof. So like I think five lines and more proof it does actually struggle So there that So that high school or I guess middle school level math There are auto models that claim to auto and actually release a lean proof of very big theorems, but that is with a lot of human intervention. And I think that the statements and proofs that are being auto-formalized actually rely on the author sometimes of the paper to break that mass down into very fine-grained pieces that is workable for a model to auto-formalize. So far, we haven't seen a completely human intervention-free hard result being auto-formalized yet. It's a little surprising to me that there's such a gap in that I'm thinking about high school geometry. Like there's got to be, you know, if we're talking about geometry, like there's a whole bunch of proofs. They're kind of known. There's a whole bunch of like informal descriptions of those proofs. There's the opportunity to generate synthetic data, I think, because like where the relationships are constrained, you know, by geometry and they're like really well understood. And so we should be able to have some black box where you give it like a natural language description and it'll spit out a proof. Does that not exist? I think your intuition is very right that there's something special about geometry. Geometry is actually geometry. High school Euclidean geometry is actually not, you know, covered by Lean as a formal language. So if you look at alpha geometry, that is actually another domain specific language for Euclidean geometry. like it's using a vector-based language. So other formalization that generally is described, I think, converting the natural language proof to lean, like it does, I guess it's a very different system. So for all the formal mass- It's kind of presupposing a different universe than- Yeah, that is a different universe. The graph foundation model and discoveries I was talking about for graph theory is another universe. For some specialized problems, I rely heavily on constructions, not proof. you know like patent booths right like that is another universe they're like okay yeah but there's like a big main universe is like lean based formal they're improving for say like algebra analysis these like mainstream matlib areas got it got it and so that universe kind of presupposes a certain degree of i think both complexity but also abstraction yeah i think so and it makes it more difficult than when we talk about like a broad universe sometimes you also have interesting problems you run into that maybe the proof relies on a dependency that matlib doesn't have uh so the the you know the proof is for an algebra problem it relies on like combinatorics dilemma and combinatorics is not in matlib yet majority of combinatorics is not unlike you know as much coverage in algebra so then you're like stuck library learning which is the ability for model to come up with definitions or bring a new definition into the knowledge space that has been also hard so for the model to say okay well i'm creating a definition which again is not a proof right so it cannot be like verified it's like the statement being hard to check the definition is hard to check as well so there are a lot of like i think roadblocks that are like in this is there a role for simulation in these kind of uh proofs so yeah simulations are going to be a bit hard um in that like you think about formal verification it's quite rigorous uh it's you know it's like a proof it's like almost like a you know many screwdrivers working together to make sure that the wheel actually is like turning as it is. I think you would need other stuff to help you do simulations or say approximations and computations. I think like it will eventually be many parts working together, but proving that the property is sound is going to be an inevitable part of the engine. I think what I hear you saying is that all roads lead back to auto-formalization and then proving. Yeah, I think we need all the formalization proving. We also need other things. I think we also need things that help us compute things faster. We need, you know, eventually if you think about simulations that it will definitely involve other things as well. I think there's a certain degree of, like when we talk about in the context of AI, you know, mathematics and reasoning, there's a degree to which it just kind of makes sense. Like, of course, this would be a way to go. Like mathematics is like a formal formalization, you know, like if we can solve math and it addresses a lot of the issues that LLMs have. But then at the same time, it's like, you know, mathematics and mathematicians are these like kind of, you know, geeky things in the corner. and like we don't really want to think about that stuff and it hasn't been, you know, like the commercialization of that, like it always expresses itself through some other more practical thing. Like what do you think the, you know, the end benefits of these systems that you're trying to create are? I think that's a great question. You know, some people think of lean as like almost like a subculture, like the people who love lean and they use Zulip. It's not even like Discord. It's definitely not Slack. It is a group of people who are very passionate. Some of my friends, and they're now also joining Axiom, are the initial Mathlib contributors. And I remember in 2020, it was during COVID. And that friend is teaching me how to play chess. And I asked him, what do you do all day besides school and COVID? It's so boring. He said, I'm doing chess and I'm doing lean. So, you know, it is really, a lot of people think of it as a subculture. I would say that a lot of things look like subculture when they were just early. Like I remember, say GPU used to be subculture. It used to be just for like, you know, graphics. And then the applications of that, you know, that is down the road in the future. I think, you know, this sort of geeky things, it is representing something that I think is going to be big. It is going to be a fundamental technology that for the first time, you have provable guarantees of things in large language models. I think math and coding are two important, perhaps the two biggest part of digital world. And coding is heavily invested. it. Math is not. Not that math is not trackable, but that math has not been turned into programming language yet. And by Curry Howard correspondence and Lean, you are able to do that. It is sort of the natural direction, but also there are existing markets that will generate commercial value once of technology works, formal verification of software, formal verification of hardware. And that is only the formal verification part, right? And also not the super intelligent part. Yeah, I think there are two things going on here, which is one, the mean capability of large language models not fully trusted for the most high stake, you know, like safety critical domains. The other thing is they are not insanely smart. They're not super intelligent yet. And both angles, you know, unlock huge number of fields as applications. You know, one can even push further and say, hey, Jovan's paradox is our opportunity. Once the price of outlier, abstract mathematical quantitative reasoning gets elastic, you will have elastic demand as well. And there will be new markets and new use cases that get unlocked because of this. Yeah, the pushback that comes to mind for me is that with so much investment in AI right now, someone going directly after, you know, verification for code or verification for hardware, you know, might get there a lot quicker than going broadly after mathematical verification and then trying to apply it to this. So maybe the problem is solved by the time you get there. I think the fundamental technology problem is so stubborn, you know, that it's very hard to get around. I think it's like, you know, pre-chat GPT, we are building a wrapper, like that seems a bit odd to me, at least. Are you arguing that you need the mathematical verification to do software and hardware verification? Like we're doing it to some degree now, right? Well, not quite well. I think the fully automated way, like, you know, the no or very minimal human intervention part, it's pretty bad. We do believe that you can feel free to say, hey, I want to in 2025 be a product company and I will develop better capabilities once my product is going and then I will make the model better, which is, I think, for example, the cursor way. Or you can say, well, I think the technology is like so far behind that I can't really build a good product if I don't have it working. And so therefore, I choose to be a model layer company and then unlock, you know, product in these vertical application. I'll be diligent in doing sort of product market fit iteration going down deeply in each of the vertical domains, which, you know, could be the code way. You have a really good model and then you have a delightful, delightful product. And we'd like to go the code way, then the cursor way. Yeah, that's kind of the lab versus product company argument, right? But I also agree with you that because like Lean is like somehow like, you know, geeky and, you know, there needs to be good product design. We're not kind of slagging on that. Like, you know, for example, I think Lean should be mostly hidden in the any sort of interface with mathematicians because it's just not that readable. like, you know, the right level of abstraction that one communicates should be at the natural language layer. But then there needs to be lean system that is working to sound, you know, to make sure that it's like something to it's provable and make sure that it is actually sound and correct and rigorous. Right. Right. That's a really interesting point and insight. And I think it kind of speaks to this question, like, you know, thinking back to, you know, chat GPT, right? We had generative models. The unlock was to a larger degree UI and making it accessible. Like what's the unlocking UI for a mathematical reason or an auto formalization model? Like I don't think we want that on the internet as a chatbot. It's got to be something else, right? Yeah, I think we have interesting like product brainstorming sessions already. We think that mathematicians work in a very specific way. And researchers with a large amount of quantitative reasoning in their daily workflow also work a very specific way. And it's a little bit more like dreaming. It's a very interesting product problem, right? It's like a technology that hasn't quite existed yet. There are people, the audience generally are not the product people. very hard to find intersections like product and mathematicians um and so yeah but i think there are there are cute designs that can make this a very frictionless like delightful experience i don't know that there's an answer for this that's different from the entirety of our conversation but in terms of like you know priorities and the next things that are you know on your mind for figuring out like how do you think about that yeah um so are you like we're so back to the technology right like i think we want to have a very good system uh that can prove very well that can other formalize very well and that can conjecture very well um even though the very well part is still under explorations um it is it is it is a deep tech company kind of profile um a startup that we expect to you know like spacex having many rocket crashes before before it surely work and like you know have the eventual renaissance uh we expect like very sort of intense rnd focused period of the startup cycle as well um and we'll try to like you know reach these milestones one at a time and eventually have a comprehensive system that is really similar to what the word AI Massachusetts represents. Awesome. Awesome. Well, Karina, thanks so much for jumping on and sharing a bit about what you're up to. Thank you so much. This is a really delightful conversation. you

Share on X Share on LinkedIn