This is part of the ethical morass of why some more serious researchers aren't touching the benchmark. People are not going to take it seriously if it continues like this!
In the case of TTT, I wouldn’t really describe that as a ‘new AGI reasoning approach’. People have been fine tuning deep learning models on specific tasks for a long time.
The fundamental instinct driving the creation of ARC - that ‘deep learning cannot do system 2 thinking’, is under threat of being proven wrong very soon. Attempts to define the approaches that are working as somehow not ‘traditional deep learning’ really seem like shifting the goal posts.
The new and surprising thing about test-time training (TTT) is how effective it is an approach to deal with novel abstract reasoning problems like ARC-AGI.
TTT was pioneered by Jack Cole last year and popularized this year by several teams, including this winning paper: https://ekinakyurek.github.io/papers/ttt.pdf
It is a great unit test for reasoning -- that's fantastic! And maybe it is indeed the best way to test for this -- who knows exactly. But the claim is a little grandiose for what it is, this is somewhat similar to saying that testing on string parity is the One True Test for testing an optimizer's efficiency.
I'd heartily recommend maybe taking down the marketing vibrance down a notch and keep things a bit more measured, it's not entirely a meme, though some of the more-serious researchers don't take it as seriously as a result. And that's the kind of people that you want to attract to this sort of thing!
I think there is a potentially good future for ARC! But it might struggle to attract some of the kind of talent that you want to work on this problem as a result.
This is fair critique. ARC Prize's 2024 messaging was sharp to break through the noise floor -- ARC has been around since 2019 but most only learned about it this summer. Now that it has garnered awareness, it is no longer useful, and in same cases hurting progress like you point out. The messaging needs to evolve and mature next year to be more neutral/academic.
One big update since June is that progress is no longer stalled. Coming into 2024, the public consensus vibe was that pure deep learning / LLMs would continue scaling to AGI. The fundamental architecture of these systems hasn't changed since ~2019.
But this flipped late summer. AlphaProof and o1 are evidence of this new reality. All frontier AI systems are now incorporating components beyond pure deep learning like program synthesis and program search.
I believe ARC Prize played a role here too. All the winners this year are leveraging new AGI reasoning approaches like deep-learning guided program synthesis, and test-time training/fine-tuning. We'll be seeing a lot more of these in frontier AI systems in coming years.
And I'm proud to say that all the code and papers from this year's winners are now open source!
We're going to keep running this thing annually until its defeated. And we've got ARC-AGI-2 in the works to improve on several of the v1 flaws (more here: https://arcprize.org/blog/arc-prize-2024-winners-technical-r...)
The ARC-AGI community keeps surprising me. From initial launch, through o1 testing, to the final 48 hours when the winning team jumped 10% and both winning papers dropped out of nowhere. I'm incredibly grateful to everyone and we will do our best to steward this attention towards AGI.
We'll be back in 2025!
We still have a long way to go for the grand prize -- we'll be back next year. Also got some new stuff in the works for 2025.
Watch for the official ARC Prize 2024 paper coming Dec 6. We're going to be overviewing all the new AI reasoning code and approaches open sourced via the competition [3].
[1] https://deepmind.google/discover/blog/ai-solves-imo-problems...
I don’t think that’s true though, it’s hard to be more fair and explicit than:
> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.
Ie. it’s just not that great, and it’s enormously slow.
That probably wasn’t what people wanted to hear, even if it is literally what the results show.
You cant run away from the numbers:
> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.
(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)
Curiosity is the first step towards new ideas.
ARC Prize's whole goal is to inspire curiosity like this and to encourage more AI researchers to explore and openly share new approaches towards AGI.
So, how well might o1 do with Greenblatt's strategy?
https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...
Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.