1. We had a no-data retention agreement with them. We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing
2. We only tested o3 against the semi-private set. We didn't test it with the private eval.
These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"
Have any of the technical contributions used to win the past competition been used to advance general AI in any way?
We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?
To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson
We had 40 papers submitted last year and 8 were awarded prizes. [1]
On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]
Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.
[1] https://arcprize.org/blog/arc-prize-2024-winners-technical-r...
[2] https://github.com/MohamedOsman1998/deep-learning-for-arc/bl...
1. Public Train - 1,000 tasks that are public 2. Public Eval - 120 tasks that are public
So for those two we don't have protections.
3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.
4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.
Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.
In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.
ARC-AGI-2 targets test-time reasoning.
My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.
Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.
Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.
Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated
The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.
The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition
We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.
Happy to answer questions.