gkamradt (u/gkamradt)

gkamradt commented on OpenAI o3-pro help.openai.com/en/articl... · Posted by u/mfiguiere

chad1n · 3 months ago

The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these models to check for drops in quality.

gkamradt · 3 months ago

o3-pro is not the same as the o3-preview that was shown in Dec '24. OpenAI confirmed this for us. More on that here: https://x.com/arcprize/status/1932535380865347585

gkamradt commented on Arc-AGI-2 and ARC Prize 2025 arcprize.org/blog/announc... · Posted by u/gkamradt

zamadatix · 5 months ago

Sorry, I probably phrased the question poorly. My question is more along the lines of "when you already scored e.g. OpenAI's o3 on ARC AGI 2 how did you guarantee OpenAI can't just look at its server logs to see question set 4"?

gkamradt · 5 months ago

Ah yes, two things

1. We had a no-data retention agreement with them. We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing

2. We only tested o3 against the semi-private set. We didn't test it with the private eval.

gkamradt commented on Arc-AGI-2 and ARC Prize 2025 arcprize.org/blog/announc... · Posted by u/gkamradt

zamadatix · 5 months ago

What prevents everything in 4 from becoming a part of 3 the first time the test set is run on a proprietary model, do you require competitors like OpenAI provide models Kaggle can self host for the test?

gkamradt · 5 months ago

#4 (private test set) doesn't get used for any public model testing. It is only used on the Kaggle leaderboard where no internet access is allowed.

gkamradt commented on Arc-AGI-2 and ARC Prize 2025 arcprize.org/blog/announc... · Posted by u/gkamradt

FergusArgyll · 5 months ago

I'd love to hear from the ARC guys:

These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"

Have any of the technical contributions used to win the past competition been used to advance general AI in any way?

We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?

To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson

gkamradt · 5 months ago

Good question! This was one of the main motivations of our "Paper Prize" track. We wanted to reward conceptual progress vs leaderboard chasing. In fact, when we increased the prizes mid year we awarded more money towards the paper track vs top score.

We had 40 papers submitted last year and 8 were awarded prizes. [1]

On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]

Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.

[1] https://arcprize.org/blog/arc-prize-2024-winners-technical-r...

[2] https://github.com/MohamedOsman1998/deep-learning-for-arc/bl...

[3] https://www.youtube.com/watch?v=mTX_sAq--zY

gkamradt commented on Arc-AGI-2 and ARC Prize 2025 arcprize.org/blog/announc... · Posted by u/gkamradt

artninja1988 · 5 months ago

What are you doing to prevent the test set being leaked? Will you still be offering API access to the semi private test set to the big model providers who presumably train on their API?

gkamradt · 5 months ago

We have a few sets:

1. Public Train - 1,000 tasks that are public 2. Public Eval - 120 tasks that are public

So for those two we don't have protections.

3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.

4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.

gkamradt commented on Arc-AGI-2 and ARC Prize 2025 arcprize.org/blog/announc... · Posted by u/gkamradt

gkamradt · 5 months ago

Hey HN, Greg from ARC Prize Foundation here.

Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.

In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.

ARC-AGI-2 targets test-time reasoning.

My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.

Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.

Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.

Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.

Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated

The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.

The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition

We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.

Happy to answer questions.