As a former problem designer for IMO and similar contests, I deeply enjoyed reading this paper. At the same time, I'd like to point out that it was clear Geometry had to be the first topic to give up against AI (i.e., smart knowledge and inference-method indexing)
Among math olympiad topics, Geometry problems are often the most "mechanical." Once you can express the problem in terms of coordinates (think XY or complex plane) you're looking at a finite set of steps a computer can use to find a solution. Of course, time limits and human error get in the way of this being practical during the IMO itself. Back in the day, I'd use WolframAlpha to verify geometry proofs for problems I designed (+conjectures) using this approach.
Algebra (especially inequalities) can be similar – pull off some intense calculations and you often have your answer.
Where I'm really excited to see intelligent systems make progress is with Number Theory and Combinatorics problems. The search spaces are far more complex and they often require proving that something is impossible. These are the problems that would be difficult to solve with brute force computation.
As a consumer of those problems, first of all, thank you! Even decades on from high school I occasionally enjoy working on them.
But agree that geometry was obviously going to be first. From what I’ve gathered here, it’s not “brute forcing” in terms of relying on algebraic geometry, vectors, or complex number solutions, but it is brute force in that it’s exhaustively looking for “interesting” constructions.
Geometry was always my worst subject, but even so I felt like if given the right construction the problem was much easier. Unfortunately I never developed the intuition to hit on those constructions quickly. It seems this AI doesn’t either, but it can churn through them much faster — There are only so many perpendiculars and parallels and bisectors you can construct which you can more or less mechanically evaluate (map out all angles and ratios, try power of a point, etc.)
While this is incredibly impressive, it seems like Deep Mind:Kasparov::AlphaGeo:Terry Tao in the “engine vs AI” sense.
I agree Algebra is likely next. Like geometry, more often than not you “just” need a clever substitution or 3, of which there are only so many options to choose from.
Some combinatorics problems I think would also be amenable to this search strategy (e.g., finding things to count 2 ways), but that seems like a bridge further away, and only gets you a subset of problems.
Number theory would be my guess for the final frontier before a perfect 42.
I like your positive attitude to this. Do you feel any sense of loss over the possibility your skill (being extremely good in math) might soon be overtaken by machines? Or do you see no possibility of that happening anytime soon?
How did you get into such a position? Is there an application process of some sort?
After verifying solvability, what is the process for selecting the specific problems that will show up in the finished set? Is it by vote or some other evaluation?
Each year, national math olympiad committees submit their toughest problems to the IMO. A select group makes the "shortlist," a pool of ~32 questions (8 per geometry, algebra, number theory, combinatorics). From this pool, the final six IMO problems are chosen through a rigorous voting process.
What makes a great IMO problem? It's all about originality. Ideal problems aren't just about applying advanced theorems, but rather test creative thinking and mastery of fundamental principles. Of course, there were cases in which the IMO committee didn't realize a problem was an obvious corrolary of a grad-level theorem, and some well-read (highschool) students benefited from knowing that!
My Journey: I was once a top-ranked competitor myself. After high school, I joined my country's math olympiad committee.
I expected three-number inequalities to give in first, as there is less of a question about what counts as a proof. I didn't know that the latter problem was already solved in 2000 ( http://www.mmrc.iss.ac.cn/~xgao/paper/jar-gdbase.pdf ).
Someone really should make an adventure game out of synthetic geometry, as it allows for a proof-writing language simpler than Lean's and allows for nice visuals.
If I read their paper right, this is legit work (much more legit than DeepMind's AI math paper last month falsely advertised as solving an open math research problem) but it's still pretty striking how far away the structure of it is from the usual idea of automated reasoning/intelligence.
A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work. [edit: as corrected by Imnimo, I got this backwards - the brute search is just pure brute search, the transformer is used to predict which extra geometric construction to add]
Also (not mentioned in the blog post) the actual problem statements had to be modified/adapted, e.g. the actual problem statement "Let AH1, BH2 and CH3 be the altitudes of a triangle ABC. The incircle W of triangle ABC touches the sides BC, CA and AB at T1, T2 and T3, respectively. Consider the symmetric images of the lines H1H2, H2H3, and H3H1 with respect to the lines T1T2, T2T3, and T3T1. Prove that these images form a triangle whose vertices lie on W." had to be changed to "Let ABC be a triangle. Define point I such that AI is the bisector of angle BAC and CI is the bisector of angle ACB. Define point T1 as the foot of I on line BC. Define T2 as the foot of I on line AC. Define point T3 as the foot of I on line AB. Define point H1 as the foot of A on line BC. Define point H2 as the foot of B on line AC. Define point H3 as the foot of C on line AB. Define point X1 as the intersection of circles (T1,H1) and (T2,H1). Define point X2 as the intersection of circles (T1,H2) and (T2,H2). Define point Y2 as the intersection of circles (T2,H2) and (T3,H2). Define point Y3 as the intersection of circles (T2,H3) and (T3,H3). Define point Z as the intersection of lines X1X2 and Y2Y3. Prove that T1I=IZ."
>A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work.
I don't think this is quite right. The brute force search is performed by a symbolic solver, not the transformer. When it runs out of new deductions, the transformer is asked to suggest possible extra constructions (not randomly added).
> it's still pretty striking how far away the structure of it is from the usual idea of automated reasoning/intelligence.
How so? Reasoning is fundamentally a search problem.
The process you described is exactly the process humans use: i.e. make a guess about what's useful, try to work out details mechanically. If get stuck, make another guess, etc. So the process is like searching through a tree.
People figured out this process back in 1955 (and made a working prototype which can prove theorems): https://en.wikipedia.org/wiki/Logic_Theorist but it all hinges on using good 'heuristics'. Neural networks are relevant here as they can extract heuristics from data.
What do you think is "the usual idea of automated reasoning"? Some magic device which can solve any problem using a single linear pass?
I think the intuition here is that an “intelligent” system would lean much more heavily on the heuristics than on the search, like humans. It’s hard to quantify in this case, but certainly in the case of chess, when engines were about as good as best human players, they were doing orders of magnitude larger search than humans. Which made them feel more like chess _engines_ than chess AI. AlphaZero certainly made a step in the right direction, but it’s still far from how humans play.
> When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work
That's exactly how I was taught geometry in school and I hated it with my all guts. Only after making it into the math department of the university I learned to do it properly and to enjoy it.
The problem is that using LLM as a role for drawing auxiliary lines is too inefficient. It is hard to imagine people deploying a large number of machines to solve a simple IMO problem. This field must be in the early stage of development, and much work remains unfinished. A reasonable point of view is that the search part should be replaced by a small neural network, and the reasoning part should not be difficult, and does not require much improvement. Now is the time to use self-play to improve performance, treating the conclusions that need to be proved in plane geometry problems as a point in the diagram and the conditions as another point in the diagram. Then two players try to move towards each other as much as possible and share data, so that the contribution made by each player in this process can be used as an analogy for calculating wins and losses in Go, and thus improve performance through self-play.
You can edit a comment for about an hour if I remember correctly, regardless of it being visible. You can't delete it after someone responds to it though.
Though this particular model doesn't sound generalizable, the neuro-symbolic approach seems very promising to me:
- linking the (increasingly powerful) "system 1" tools that are most of current ML with more structured "system 2" tools, like logical proof generation, which can plan and and check the veracity / value of output.
- System 2 chugs along 'till it gets stuck, then system 1 jumps in to provide an intuitive guess on what part of state space to check next.
- Here they leveraged the ability to generate proofs by computer to create a data set of 100m proofs, enabling scalable self-supervised learning. Seem to me the symbolic domains are well formed to allow such generation of data, which while low value in each instance, might allow valuable pre-training in aggregate.
Putting these elements together is an approach that could get us quite far.
The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.
> The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.
You do not need to solve everything at once. This approach has the potential to revolution both math and programming by moving formal verification from being a niche tool into a regular part of every practitioners toolbox.
It also completely solves (within the domain it applies) one of the most fundamental problems of AI that the current round is calling "hallucinations"; however that solution only works because we have a non AI system to prove correctness.
At a high level, this approach is not really that new. Biochem has been using AI to help find candidate molecules, which are then verified by physical experimentation.
Combinatorical game AI has been using the AI as an input to old fasion monte carlo searches
>however that solution only works because we have a non AI system to prove correctness
But this is actually a really common scenario. Checking a solution for correctness is often much easier than actually finding a correct solution in the first place.
> however that solution only works because we have a non AI system to prove correctness.
I'm not sure why you are calling it non-AI. There's no reason why some AI system has to be some single neural network like GPT and not a hacked together conglomeration of a bunch of neural networks and symbolic logic systems. Like is it cheating to use a SAT solver? Is a SAT solver itself not artificial intelligence of a kind?
I appreciate that the authors released code and weights with their paper! This is the first high-profile DeepMind paper I can recall that has runnable inference code + checkpoints released. (Though I'm happy to be corrected by earlier examples I've missed)
I don't yet see a public copy of the training set / example training code, but still this is a good step towards providing something other researchers can build on -- which is after all the whole point of academic papers!
Yeap. I'm missing the datasets as well. They have generated 100M synthetic examples ... Were these examples generated with AlphaGeometry? Where is the filtering code and initial input to generate these synthetics?
Im I'm wrong that they are using t5 model? At least they are using the sentencepiece t5 vocabulary.
How many GPU hours have they spend training this model? Which training parameters were used?
Don't get me wrong, I find this system fascinating it is what applied engineering should look like. But I'd like to know more about the training details and the initial data they have used as well as the methods of synthetic data generation.
I'm very curious how often the LM produces a helpful construction. Surely it must be doing better than random chance, but is it throwing out thousands of constructions before it finds a good one, or is it able to generate useful proposals at a rate similar to human experts?
They say in the paper, "Because the language model decoding
process returns k different sequences describing k alternative auxiliary
constructions, we perform a beam search over these k options, using
the score of each beam as its value function. This set-up is highly parallelizable across beams, allowing substantial speed-up when there are
parallel computational resources. In our experiments, we use a beam
size of k = 512, the maximum number of iterations is 16 and the branching factor for each node, that is, the decoding batch size, is 32."
But I don't totally understand how 512 and 16 translate into total number of constructions proposed. They also note that ablating beam size and max iterations seems to only somewhat degrade performance. Does this imply that the model is actually pretty good at putting helpful constructions near the top, and only for the hardest problems does it need to produce thousands?
Interesting that the transformer used is tiny. From the paper:
"We use the Meliad library for transformer training with its base settings. The transformer has 12 layers, embedding dimension of 1,024, eight heads of attention and an inter-attention dense layer of dimension 4,096 with ReLU activation. Overall, the transformer has 151 million parameters, excluding embedding layers at its input and output heads. Our customized tokenizer is trained with ‘word’ mode using SentencePiece and has a vocabulary size of 757. We limit the maximum context length to 1,024 tokens and use T5-style relative position embedding. Sequence packing is also used because more than 90% of our sequences are under 200 in length."
It's not tiny, this is a quite normal size outside the field of LLMs, e.g. normal-sized language models, or also translation models, or acoustic models. Some people even would call this large.
This suggests there may be some more low hanging fruit in the hard sciences for transformers to bite at, so long as they can be properly formulated. This was not a problem of scaling it seems.
The real TIL (to me) is that the previous state-of-the-art could solve 10 of these! I'd heard there was a decision algorithm for plane geometry problems but I didn't know it was a practical one. Some searching turned up http://www.mmrc.iss.ac.cn/~xgao/paper/book-area.pdf as a reference.
Yes, and even the non-neural-network, symbolic plus linear algebra component of AlphaGeometry is able to outperform the previous state of the art. So a decent amount of work here went into the components that aren't neural networks at all.
This is nice, but I suspect that a barycentric coordinate bash with the formulas in Evan Chen's book could also do 30% of the IMO (seeing that most of it is about triangles) on a modern laptop.
(via https://news.ycombinator.com/item?id=39030186, but we'll merge that thread hither)
Algebra (especially inequalities) can be similar – pull off some intense calculations and you often have your answer.
Where I'm really excited to see intelligent systems make progress is with Number Theory and Combinatorics problems. The search spaces are far more complex and they often require proving that something is impossible. These are the problems that would be difficult to solve with brute force computation.
But agree that geometry was obviously going to be first. From what I’ve gathered here, it’s not “brute forcing” in terms of relying on algebraic geometry, vectors, or complex number solutions, but it is brute force in that it’s exhaustively looking for “interesting” constructions.
Geometry was always my worst subject, but even so I felt like if given the right construction the problem was much easier. Unfortunately I never developed the intuition to hit on those constructions quickly. It seems this AI doesn’t either, but it can churn through them much faster — There are only so many perpendiculars and parallels and bisectors you can construct which you can more or less mechanically evaluate (map out all angles and ratios, try power of a point, etc.)
While this is incredibly impressive, it seems like Deep Mind:Kasparov::AlphaGeo:Terry Tao in the “engine vs AI” sense.
I agree Algebra is likely next. Like geometry, more often than not you “just” need a clever substitution or 3, of which there are only so many options to choose from.
Some combinatorics problems I think would also be amenable to this search strategy (e.g., finding things to count 2 ways), but that seems like a bridge further away, and only gets you a subset of problems.
Number theory would be my guess for the final frontier before a perfect 42.
After verifying solvability, what is the process for selecting the specific problems that will show up in the finished set? Is it by vote or some other evaluation?
What makes a great IMO problem? It's all about originality. Ideal problems aren't just about applying advanced theorems, but rather test creative thinking and mastery of fundamental principles. Of course, there were cases in which the IMO committee didn't realize a problem was an obvious corrolary of a grad-level theorem, and some well-read (highschool) students benefited from knowing that!
My Journey: I was once a top-ranked competitor myself. After high school, I joined my country's math olympiad committee.
Someone really should make an adventure game out of synthetic geometry, as it allows for a proof-writing language simpler than Lean's and allows for nice visuals.
A transformer is trained on millions of elementary geometry theorems and used as brute search for a proof, which because of the elementary geometry context has both a necessarily elementary structure and can be easily symbolically judged as true or false. When the brute search fails, an extra geometric construction is randomly added (like adding a midpoint of a certain line) to see if brute search using that extra raw material might work. [edit: as corrected by Imnimo, I got this backwards - the brute search is just pure brute search, the transformer is used to predict which extra geometric construction to add]
Also (not mentioned in the blog post) the actual problem statements had to be modified/adapted, e.g. the actual problem statement "Let AH1, BH2 and CH3 be the altitudes of a triangle ABC. The incircle W of triangle ABC touches the sides BC, CA and AB at T1, T2 and T3, respectively. Consider the symmetric images of the lines H1H2, H2H3, and H3H1 with respect to the lines T1T2, T2T3, and T3T1. Prove that these images form a triangle whose vertices lie on W." had to be changed to "Let ABC be a triangle. Define point I such that AI is the bisector of angle BAC and CI is the bisector of angle ACB. Define point T1 as the foot of I on line BC. Define T2 as the foot of I on line AC. Define point T3 as the foot of I on line AB. Define point H1 as the foot of A on line BC. Define point H2 as the foot of B on line AC. Define point H3 as the foot of C on line AB. Define point X1 as the intersection of circles (T1,H1) and (T2,H1). Define point X2 as the intersection of circles (T1,H2) and (T2,H2). Define point Y2 as the intersection of circles (T2,H2) and (T3,H2). Define point Y3 as the intersection of circles (T2,H3) and (T3,H3). Define point Z as the intersection of lines X1X2 and Y2Y3. Prove that T1I=IZ."
I don't think this is quite right. The brute force search is performed by a symbolic solver, not the transformer. When it runs out of new deductions, the transformer is asked to suggest possible extra constructions (not randomly added).
How so? Reasoning is fundamentally a search problem.
The process you described is exactly the process humans use: i.e. make a guess about what's useful, try to work out details mechanically. If get stuck, make another guess, etc. So the process is like searching through a tree.
People figured out this process back in 1955 (and made a working prototype which can prove theorems): https://en.wikipedia.org/wiki/Logic_Theorist but it all hinges on using good 'heuristics'. Neural networks are relevant here as they can extract heuristics from data.
What do you think is "the usual idea of automated reasoning"? Some magic device which can solve any problem using a single linear pass?
You mean like a calculator?
Deleted Comment
That's exactly how I was taught geometry in school and I hated it with my all guts. Only after making it into the math department of the university I learned to do it properly and to enjoy it.
- linking the (increasingly powerful) "system 1" tools that are most of current ML with more structured "system 2" tools, like logical proof generation, which can plan and and check the veracity / value of output. - System 2 chugs along 'till it gets stuck, then system 1 jumps in to provide an intuitive guess on what part of state space to check next. - Here they leveraged the ability to generate proofs by computer to create a data set of 100m proofs, enabling scalable self-supervised learning. Seem to me the symbolic domains are well formed to allow such generation of data, which while low value in each instance, might allow valuable pre-training in aggregate.
Putting these elements together is an approach that could get us quite far.
The key milestone will be moving away from the need to use specific formal / symbolic domains, and to generate a pretrained system that can generalize the skills learned from those domains.
You do not need to solve everything at once. This approach has the potential to revolution both math and programming by moving formal verification from being a niche tool into a regular part of every practitioners toolbox.
It also completely solves (within the domain it applies) one of the most fundamental problems of AI that the current round is calling "hallucinations"; however that solution only works because we have a non AI system to prove correctness.
At a high level, this approach is not really that new. Biochem has been using AI to help find candidate molecules, which are then verified by physical experimentation.
Combinatorical game AI has been using the AI as an input to old fasion monte carlo searches
But this is actually a really common scenario. Checking a solution for correctness is often much easier than actually finding a correct solution in the first place.
I'm not sure why you are calling it non-AI. There's no reason why some AI system has to be some single neural network like GPT and not a hacked together conglomeration of a bunch of neural networks and symbolic logic systems. Like is it cheating to use a SAT solver? Is a SAT solver itself not artificial intelligence of a kind?
I don't yet see a public copy of the training set / example training code, but still this is a good step towards providing something other researchers can build on -- which is after all the whole point of academic papers!
Im I'm wrong that they are using t5 model? At least they are using the sentencepiece t5 vocabulary.
How many GPU hours have they spend training this model? Which training parameters were used?
Don't get me wrong, I find this system fascinating it is what applied engineering should look like. But I'd like to know more about the training details and the initial data they have used as well as the methods of synthetic data generation.
They say in the paper, "Because the language model decoding process returns k different sequences describing k alternative auxiliary constructions, we perform a beam search over these k options, using the score of each beam as its value function. This set-up is highly parallelizable across beams, allowing substantial speed-up when there are parallel computational resources. In our experiments, we use a beam size of k = 512, the maximum number of iterations is 16 and the branching factor for each node, that is, the decoding batch size, is 32."
But I don't totally understand how 512 and 16 translate into total number of constructions proposed. They also note that ablating beam size and max iterations seems to only somewhat degrade performance. Does this imply that the model is actually pretty good at putting helpful constructions near the top, and only for the hardest problems does it need to produce thousands?
But let's try -- TL;DR 262,144, but don't take it literally:
- The output of a decoding function is a token. ~3/4 of a word. Let's just say 1 word.
- Tokens considered per token output = 262,144 Total number of token considerations for 1 output token = beam_size * branching_factor * max_iterations = 512 * 32 * 16 = 262,144.
- Let's take their sample solution and get a word count. https://storage.googleapis.com/deepmind-media/DeepMind.com/B...
- Total tokens for solution = 2289
- Total # of tokens considered = 600,047,616 = 262,144 * 2289
- Hack: ""number of solutions considered"" = total tokens considered / total tokens in solution
- 262,144 (same # as number of tokens we viewed at each iteration step, which makes sense)
"We use the Meliad library for transformer training with its base settings. The transformer has 12 layers, embedding dimension of 1,024, eight heads of attention and an inter-attention dense layer of dimension 4,096 with ReLU activation. Overall, the transformer has 151 million parameters, excluding embedding layers at its input and output heads. Our customized tokenizer is trained with ‘word’ mode using SentencePiece and has a vocabulary size of 757. We limit the maximum context length to 1,024 tokens and use T5-style relative position embedding. Sequence packing is also used because more than 90% of our sequences are under 200 in length."