> AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions
So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.
I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.
We're told that formal verification tools like Lean are not used to solve the actual IMO problems, but are they used in training the model to solve the problems?
We know from Google's 2024 IMO work that they have a way to translate natural language proofs to formally verifiable ones. It seems like a natural next step would be to leverage this for RLVR in training/fine-tuning. During training, any piece of reasoning generated by the math LLM could be translated, verified, and assigned an appropriate reward, making the reward signal much denser.
Reward for a fully correct proof of a given IMO problem would still be hard to come by, but you could at least discourage the model from doing wrong or indecipherable things. That plus tons of compute might be enough to solve IMO problems.
In fact it probably would be, right? We already know from AlphaProof that by translating LLM output back and forth between formal Lean proofs, you can search the space of reasoning moves efficiently enough to solve IMO-class problems. Maybe you can cut out the middleman by teaching the LLM via RLVR to mimic formal reasoning, and that gets you roughly the same efficiency and ability to solve hard problems.
It seems very likely from the description in the link that formal verification tools for mathematical proofs were used in part of the RL training for this model. On the other hand, OpenAI claims "We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling." Which might suggest that they don't specifically use e.g. Lean in their training process. But it's not explicitly stated. All we can really do is speculate unless they publish more detail.
> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit
I interpreted that bit as meaning they did not manually alter the problem statement before feeding it to the model - they gave it the exact problem text issued by IMO.
It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.
I'd also be curious as to why not use Lean. Is it that Lean use at this point makes the problems too easy to brute force? Or is it that Lean at this point just gets in the way of things?
Lean is an interactive prover, not an automated prover. Last year a lot of human effort was required to formalise the problems in Lean before the machines could get to work. This year you get natural language input and output, and much faster.
The advantage of Lean is that the system checks the solutions, so hallucination is impossible. Of course, one still relies on the problems and solutions being translated to natural language correctly.
Some people prefer difficult to read formally checked solutions over informal but readable solutions. The two approaches are just solving different problems.
But there is another important reason to want to do this reliably in natural language: you can't use Lean for other domains (with a few limited exceptions). They want to train their RL pipelines for general intelligence and make them reliable for long horizon problems. If a tool is needed as a crutch, then it more or less demonstrates that LLMs will not be enough in any domain, and we'll have to wait for traditional AI to catch up for every domain.
I think because you want to input mathematical proof intuition (heuristic) into models so they can grasp our reality better than just use tools without much clue.
I wonder if "not tool use, no internet access" means it can run without google inf, and offline. Meaning it could be deployed locally for people that need that.
This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions
I think I have a minority opinion here, but I’m a bit disappointed they seem to be moving away from formal techniques. I think if you ever want to truly “automate” math or do it at machine scale, e.g. creating proofs that would amount to thousands of pages of writing, there is simply no way forward but to formalize. Otherwise, one cannot get past the bottleneck of needing a human reviewer to understand and validate the proof.
If a Language Model is capable of producing rigorous natural language proofs then getting it to produce Lean (or whatever) proofs would not be a big deal.
Lean use in AlphaProof was something of a crutch (not saying this as a bad thing). Very specialized, very narrow with little use outside any other domain.
On the other hand, if you can achieve the same with general RL techniques and natural language then other hard-to-verify (a whole lot) domains are on the table.
If a Language Model is capable of producing rigorous natural language proofs then getting it to produce Lean (or whatever) proofs would not be a big deal.
This is a wildly uninformed take. Even today there are plenty of basic statements which LLM’s can produce English language proofs of that have not been formalized.
I'm a mathematician, although not doing research anymore. I can maybe offer a little bit of perspective on why we tend to be a little cooler on the formal techniques, which I think I've said on HN before.
I'm actually prepared to agree wholeheartedly with what you say here: I don't think there'd be any realistic way to produce thousand-page proofs without formalization, and certainly I wouldn't trust such a proof without some way to verify it formally. But I also don't think we really want them all that much!
The ultimate reason I think is that what really lights a fire under most mathematicians is the desire to know why a result is true; the explanation is really the product, much more so than just the yes-or-no answer. For example, I was never a number theorist, but I think most people who are informed enough to have an opinion think that the Riemann Hypothesis is probably true, and I know that they're not actually waiting around to find out. There are lots of papers that get published whose results take the form "If the Riemann Hypothesis is true then [my new theorem]."
The reason they'd still be excited by a proof is the hope, informed by experience with proofs of earlier long-standing open problems, that the proof would involve some exciting new method or perspective that would give us a deeper understanding of number theory. A proof in a formal language that Lean says is true but which no human being has any hope of getting anything from doesn't accomplish that.
A proof written in a formal language can absolutely be illuminating to a human, but you have to pick the correct formal language and ecosystem.
Writing proofs in Agda is like writing programs in a more expressive variant of Haskell. Abelson said that “programs must be written for people to read, and only incidentally for machines to execute”, and by the Curry-Howard isomorphism, proofs can be seen as programs. All the lessons of software engineering can and indeed should be applied to making proofs easier for humans to read.
For a quick example, check out my mechanization of Martin-Löf’s 2006 paper on the axiom of choice:
Thanks for the reply. I am also a no-longer-practicing mathematician :)
I completely agree that a machine-generated formal proof is not the same thing as an illuminating human-generated plain-language proof (and in fact I suspect without further guidance they will be quite different, see my other stream of thought comment). However, I do think machine-generated formal proofs would be interesting for a few reasons:
1. Sometimes the obvious thing is not true!
2. I think the existence or non-existence of a machine-generated proof of a mathematical claim is interesting in its own right. E.g. what kinds of claims are easy versus hard for machines to prove?
3. In principle, I would hope they could at least give a starting point for a proper illuminating proof. E.g. the process of refinement and clarification, which is present today even for human proofs, could become more important, and could itself be machine-assisted.
> The ultimate reason I think is that what really lights a fire under most mathematicians is the desire to know why a result is true; the explanation is really the product, much more so than just the yes-or-no answer
Of course, but a formal system like Lean doesn't merely spit out a yes-or-now answer, it gives you a fully-fledged proof. Admittedly, it may be harder to read than natural language, but that only means we could benefit from having another tool that translates Lean proofs into natural language.
Ideally we'd be able to get a little of both. A proof of such magnitude should likely come with new definitions that are easy to search for in the code and start to reason about independently. Even without looking at the rest of the proof, I imagine we'd be able to deduce a fair amount of the structure just by understanding the new things that are being defined, what existing theorems are being imported, and connecting the dots.
Your comment reminds me of Tao's comment on the ABC conjecture: usually with a big proof, you progressively get new tools and examples of how they can be applied to other problems. But if it's hundreds of pages of formulas that just spits out an answer at the end, that's not how math usually works. https://galoisrepresentations.org/2017/12/17/the-abc-conject...
If these provers do end up spitting out 1000-page proofs that are all calculation with no net-new concepts, I agree they'll be met with a shrug.
I have always wondered about what could be recovered if the antecedent (i.e. in this case the Riemann hypothesis) does actually turn out to be false. Are the theorems completely useless? Can we still infer some knowledge or use some techniques? Same applies to SETH and fine-grained complexity theory.
That said, letting machines go wild in the depths of the consequences of some axiomatic system like ZFC may reveal a method of proof mathematicians would find to be monstrous. So like, if ZFC is inconsistent, then anything can be proven. But short of that, maybe the machines will find extremely powerful techniques which “almost” prove inconsistency that nevertheless somehow lead to logical proofs of the desired claim. I’m thinking by analogy here about how speedrunning seems to often devolve into exploiting an ACE glitch as early as possible, thus meeting the logical requirements of finishing the game while violating the spirit. Maybe we’d have to figure out what “glitchless ZFC” should mean. Maybe this is what logicians have already been doing heh).
The horror scenario you describe would actually be more valuable than the insinuated "spirit" I believe:
Suppose it faithfully reasons and attempts to find proofs of claims, in the best case you found a proof of a specific claim (IN AN INCONSISTENT SYSTEM).
Suppose in the "horror scenario" that the machine has surreptitiously found a proof of false in ZFC (and can now prove any claim), and is not disclosing it, but abusing it to present 'actual proofs in inconsistent ZFC' for whatever claims the user asks it. In this case we can just ask for a proof of A and a proof of !A, if it proves both it has leaked the fact it found and exploits an inconsistency in the formal system! Thats worth more than a hard to find proof, in an otherwise inconsistent system.
These problems are designed to be solvable by humans without tools. No reason we can't give tools to the models when they go after harder problems. I think it's good to at least reproduce human-level skill without tools first.
Oh so to be clear, I view formal methods as less of a useful tool, and more as enforcing a higher standard of proof. E.g. it’s not clear to me that having access to Lean would actually help a human in the IMO; certainly most professional mathematicians are not yet getting a positive ROI from formalization. But I’m sort of an armchair expert here; I could be wrong!
Typically formalization is actually harder than solving a problem. You almost always solve before formalizing. And it can be surprisingly hard to formalize problems that are easy to solve.
For example, is there a polygon of area 100 that you can fit 99 circles of area 1 inside it, without overlapping? Yes, obviously, it's very easy to prove this informally. Now try formalizing it! You will find it takes a while to formalize a number of fairly obvious geometric statements.
Comparing the answers between Openai and Gemini the writing style of Gemini is a lot clearer. It could be presented a bit better but it's easy enough to follow the proof. This also makes it a lot shorter than the answer given by OpenAI and it uses proper prose.
Gemini is clearer but MY GOD is it verbose. e.g. look at problem 1, section
2. Analysis of the Core Problem - there's nothing at all deep here, but it seems the model wants to spell out every single tiny logical step. I wonder if this is a stylistic choice or something that actually helps the model get to the end.
Based on the Google presented answers, its possible that the report is generated post-hoc as a summarization of the prior thoughts. One could also presume that this summarization step is part of the mechanism for running the Tree of Thoughts too, so that this wasn’t some manual “now give the final answer” command.
I think we are having a Deep Blue vs. Kasparov moment in Competitive Math right now. This is a large progress from just a few years ago and yet I think we still are really far away from even a semi-respectable AI mathematician. What an exciting time to be alive!
Terrence Tao, in a recent podcast, said that he's very interested in "working along side these tools". He sees the best use in the near future as "explorers of human set vision" in a way. (i.e. set some ideas/parameters and let the LLMs explore and do parallel search / proof / etc)
Your comparison with chess engines is pretty spot-on, that's how the best of the best chess players do prep nowadays. Gone are the multi person expert teams that analysed positions and offered advice. They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill them to their players.
The Wikipedia doesn't have much info on the results, but from other reading I got the impression that the combination produced results stronger than any individual human or computer player.
The difference here is that no amount of brute force can produce the winning score in the timeframe. This is more of a legitimate technical milestone and less of a moral 'in principle' one that we saw with deep blue
Both OpenAI and Google pointed this out, but does that matter a lot? They could have spun up a million parallel reasoning processes to search for a proof that checks out - though of course some large amount of computation would have to be reserved for some kind of evaluator model to rank the proofs and decide which one to submit. Perhaps it was hundreds of years of GPU time.
Though of course it remains remarkable that this kind of process finds solutions at all and is even parallelizable to this degree, perhaps that is what they meant. And I also don't want to diminish the significance of the result, since in the end it doesn't matter if we get AGI with overwhelming compute or not. The human brain doesn't scale as nicely, even if it's more energy efficient.
> They could have spun up a million parallel reasoning processes
But alas, they did not, and in fact nobody did (yet). Enumerating proofs is notoriously hard for deterministic systems. I strongly recommend reading Aaronson's paper about the intersection of philosophy and complexity theory that touches these points in more detail: [1]
> We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought.
Super interesting that they moved away from their specialized, Lean-based system from last year to a more general-purpose LLM + RL approach. I would suspect this likely leads to improved performance even outside of math competitions. It’ll be fascinating to see how much further this frontier can go.
The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.
@tao I think there is a broader problem wherein competitions (math, programming, games, whatever) are meant to measure something difficult for humans, but tools work so fundamentally differently from us that success for a tool isn't even necessarily meaningful. AI companies have long viewed the IMO Grand Challenge as a sign of achieving "AGI," but no matter what set of rules a machine follows, there's no reason to believe success for a machine will correlate with broader mathematical or "reasoning" abilities in the way it does for human participants.
I'm a little sad about how this genuine effort - coordinated with and judged by IMO - is "front ran" by a few random tweets of OpenAI, though. Says a lot about the current situation of this industry.
So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.
I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.
Edit: "No tool use, no internet access" confirmed: https://x.com/FredZhang0/status/1947364744412758305
We know from Google's 2024 IMO work that they have a way to translate natural language proofs to formally verifiable ones. It seems like a natural next step would be to leverage this for RLVR in training/fine-tuning. During training, any piece of reasoning generated by the math LLM could be translated, verified, and assigned an appropriate reward, making the reward signal much denser.
Reward for a fully correct proof of a given IMO problem would still be hard to come by, but you could at least discourage the model from doing wrong or indecipherable things. That plus tons of compute might be enough to solve IMO problems.
In fact it probably would be, right? We already know from AlphaProof that by translating LLM output back and forth between formal Lean proofs, you can search the space of reasoning moves efficiently enough to solve IMO-class problems. Maybe you can cut out the middleman by teaching the LLM via RLVR to mimic formal reasoning, and that gets you roughly the same efficiency and ability to solve hard problems.
> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit
It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.
I don't think tool use would detract from the achievement, necessarily. I'm just interested to know.
The advantage of Lean is that the system checks the solutions, so hallucination is impossible. Of course, one still relies on the problems and solutions being translated to natural language correctly.
Some people prefer difficult to read formally checked solutions over informal but readable solutions. The two approaches are just solving different problems.
But there is another important reason to want to do this reliably in natural language: you can't use Lean for other domains (with a few limited exceptions). They want to train their RL pipelines for general intelligence and make them reliable for long horizon problems. If a tool is needed as a crutch, then it more or less demonstrates that LLMs will not be enough in any domain, and we'll have to wait for traditional AI to catch up for every domain.
I think I have a minority opinion here, but I’m a bit disappointed they seem to be moving away from formal techniques. I think if you ever want to truly “automate” math or do it at machine scale, e.g. creating proofs that would amount to thousands of pages of writing, there is simply no way forward but to formalize. Otherwise, one cannot get past the bottleneck of needing a human reviewer to understand and validate the proof.
Lean use in AlphaProof was something of a crutch (not saying this as a bad thing). Very specialized, very narrow with little use outside any other domain.
On the other hand, if you can achieve the same with general RL techniques and natural language then other hard-to-verify (a whole lot) domains are on the table.
This is a wildly uninformed take. Even today there are plenty of basic statements which LLM’s can produce English language proofs of that have not been formalized.
I'm actually prepared to agree wholeheartedly with what you say here: I don't think there'd be any realistic way to produce thousand-page proofs without formalization, and certainly I wouldn't trust such a proof without some way to verify it formally. But I also don't think we really want them all that much!
The ultimate reason I think is that what really lights a fire under most mathematicians is the desire to know why a result is true; the explanation is really the product, much more so than just the yes-or-no answer. For example, I was never a number theorist, but I think most people who are informed enough to have an opinion think that the Riemann Hypothesis is probably true, and I know that they're not actually waiting around to find out. There are lots of papers that get published whose results take the form "If the Riemann Hypothesis is true then [my new theorem]."
The reason they'd still be excited by a proof is the hope, informed by experience with proofs of earlier long-standing open problems, that the proof would involve some exciting new method or perspective that would give us a deeper understanding of number theory. A proof in a formal language that Lean says is true but which no human being has any hope of getting anything from doesn't accomplish that.
Writing proofs in Agda is like writing programs in a more expressive variant of Haskell. Abelson said that “programs must be written for people to read, and only incidentally for machines to execute”, and by the Curry-Howard isomorphism, proofs can be seen as programs. All the lessons of software engineering can and indeed should be applied to making proofs easier for humans to read.
For a quick example, check out my mechanization of Martin-Löf’s 2006 paper on the axiom of choice:
https://research.mietek.io/mi.MartinLof2006.html
Recent HN discussion:
https://news.ycombinator.com/item?id=44269002
I completely agree that a machine-generated formal proof is not the same thing as an illuminating human-generated plain-language proof (and in fact I suspect without further guidance they will be quite different, see my other stream of thought comment). However, I do think machine-generated formal proofs would be interesting for a few reasons:
1. Sometimes the obvious thing is not true!
2. I think the existence or non-existence of a machine-generated proof of a mathematical claim is interesting in its own right. E.g. what kinds of claims are easy versus hard for machines to prove?
3. In principle, I would hope they could at least give a starting point for a proper illuminating proof. E.g. the process of refinement and clarification, which is present today even for human proofs, could become more important, and could itself be machine-assisted.
Of course, but a formal system like Lean doesn't merely spit out a yes-or-now answer, it gives you a fully-fledged proof. Admittedly, it may be harder to read than natural language, but that only means we could benefit from having another tool that translates Lean proofs into natural language.
Your comment reminds me of Tao's comment on the ABC conjecture: usually with a big proof, you progressively get new tools and examples of how they can be applied to other problems. But if it's hundreds of pages of formulas that just spits out an answer at the end, that's not how math usually works. https://galoisrepresentations.org/2017/12/17/the-abc-conject...
If these provers do end up spitting out 1000-page proofs that are all calculation with no net-new concepts, I agree they'll be met with a shrug.
That said, letting machines go wild in the depths of the consequences of some axiomatic system like ZFC may reveal a method of proof mathematicians would find to be monstrous. So like, if ZFC is inconsistent, then anything can be proven. But short of that, maybe the machines will find extremely powerful techniques which “almost” prove inconsistency that nevertheless somehow lead to logical proofs of the desired claim. I’m thinking by analogy here about how speedrunning seems to often devolve into exploiting an ACE glitch as early as possible, thus meeting the logical requirements of finishing the game while violating the spirit. Maybe we’d have to figure out what “glitchless ZFC” should mean. Maybe this is what logicians have already been doing heh).
Suppose it faithfully reasons and attempts to find proofs of claims, in the best case you found a proof of a specific claim (IN AN INCONSISTENT SYSTEM).
Suppose in the "horror scenario" that the machine has surreptitiously found a proof of false in ZFC (and can now prove any claim), and is not disclosing it, but abusing it to present 'actual proofs in inconsistent ZFC' for whatever claims the user asks it. In this case we can just ask for a proof of A and a proof of !A, if it proves both it has leaked the fact it found and exploits an inconsistency in the formal system! Thats worth more than a hard to find proof, in an otherwise inconsistent system.
For example, is there a polygon of area 100 that you can fit 99 circles of area 1 inside it, without overlapping? Yes, obviously, it's very easy to prove this informally. Now try formalizing it! You will find it takes a while to formalize a number of fairly obvious geometric statements.
Google https://storage.googleapis.com/deepmind-media/gemini/IMO_202...
OpenAI https://github.com/aw31/openai-imo-2025-proofs/
Your comparison with chess engines is pretty spot-on, that's how the best of the best chess players do prep nowadays. Gone are the multi person expert teams that analysed positions and offered advice. They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill them to their players.
I was recently researching AI's for this, seems it would be a huge unlock for some parts of science where this is the case too like chess
The Wikipedia doesn't have much info on the results, but from other reading I got the impression that the combination produced results stronger than any individual human or computer player.
Both OpenAI and Google pointed this out, but does that matter a lot? They could have spun up a million parallel reasoning processes to search for a proof that checks out - though of course some large amount of computation would have to be reserved for some kind of evaluator model to rank the proofs and decide which one to submit. Perhaps it was hundreds of years of GPU time.
Though of course it remains remarkable that this kind of process finds solutions at all and is even parallelizable to this degree, perhaps that is what they meant. And I also don't want to diminish the significance of the result, since in the end it doesn't matter if we get AGI with overwhelming compute or not. The human brain doesn't scale as nicely, even if it's more energy efficient.
But alas, they did not, and in fact nobody did (yet). Enumerating proofs is notoriously hard for deterministic systems. I strongly recommend reading Aaronson's paper about the intersection of philosophy and complexity theory that touches these points in more detail: [1]
[1]: https://www.scottaaronson.com/papers/philos.pdf
Seems like they actually did that:
> We achieved this year’s result using an advanced version of Gemini Deep Think – an enhanced reasoning mode for complex problems that incorporates some of our latest research techniques, including parallel thinking. This setup enables the model to simultaneously explore and combine multiple possible solutions before giving a final answer, rather than pursuing a single, linear chain of thought.
The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.
@tao I think there is a broader problem wherein competitions (math, programming, games, whatever) are meant to measure something difficult for humans, but tools work so fundamentally differently from us that success for a tool isn't even necessarily meaningful. AI companies have long viewed the IMO Grand Challenge as a sign of achieving "AGI," but no matter what set of rules a machine follows, there's no reason to believe success for a machine will correlate with broader mathematical or "reasoning" abilities in the way it does for human participants.
I'm a little sad about how this genuine effort - coordinated with and judged by IMO - is "front ran" by a few random tweets of OpenAI, though. Says a lot about the current situation of this industry.