OpenAI O3 breakthrough high score on ARC-AGI-PUB

Efficiency is now key.

~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.

We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)

Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!

bluecoconut · 8 months ago

some other imporant quotes: "Average human off the street: 70-80%. STEM college grad: >95%. Panel of 10 random humans: 99-100%" -@fchollet on X

So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)

Also, some other back of envelope calculations:

The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.

The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)

I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.

miki123211 · 8 months ago

It's also worth keeping in mind that AIs are a lot less risky to deploy for businesses than humans.

You can scale them up and down at any time, they can work 24/7 (including holidays) with no overtime pay and no breaks, they need no corporate campuses, office space, HR personnel or travel budgets, you don't have to worry about key employees going on sick/maternity leave or taking time off the moment they're needed most, they won't assault a coworker, sue for discrimination or secretly turn out to be a pedophile and tarnish the reputation of your company, they won't leak internal documents to the press or rage quit because of new company policies, they won't even stop working when a pandemic stops most of the world from running.

zamadatix · 8 months ago

I don't follow how 10 random humans can beat the average STEM college grad and average humans in that tweet. I suspect it's really "a panel of 10 randomly chosen experts in the space" or something?

I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).

bloppe · 8 months ago

Other important quotes: "o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training)."

So ya, working on efficiency is important, but we're still pretty far away from AGI even ignoring efficiency. We need an actual breakthrough, which I believe will not be possible by simply scaling the transformer architecture.

xbmcuser · 8 months ago

You are missing that cost of electricity is also going to keep falling because of solar and batteries. This year in China my table cloth math says it is $0.05 pkwh and following the cost decline trajectory be under $0.01 in 10 years

iandanforth · 8 months ago

Let's say that Google is already 1 generation ahead of nvidia in terms of efficient AI compute. ($1700)

Then let's say that OpenAI brute forced this without any meta-optimization of the hypothesized search component (they just set a compute budget). This is probably low hanging fruit and another 2x in compute reduction. ($850)

Then let's say that OpenAI was pushing really really hard for the numbers and was willing to burn cash and so didn't bother with serious thought around hardware aware distributed inference. This could be more than a 2x decrease in cost like we've seen deliver 10x reductions in cost via better attention mechanisms, but let's go with 2x for now. ($425).

So I think we've got about an 8x reduction in cost sitting there once Google steps up. This is probably 4-6 months of work flat out if they haven't already started down this path, but with what they've got with deep research, maybe it's sooner?

Then if "all" we get is hardware improvements we're down to what 10-14 years?

cchance · 8 months ago

I mean considering the big breaththrough this year for o1/o3 seems to have been "models having internal thoughts might help reasoning", seems to everyone outside of the AI field was sort of a "duh" moment.

I'd hope we see more internal optimizations and improvements to the models. The idea that the big breakthrough being "don't spit out the first thought that pops into your head" seems obvious to everyone outside of the field, but guess what turns out it was a big improvement when the devs decided to add it.

acchow · 8 months ago

> ~doubling every 2-2.5 years) puts us at 20~25 years.

The trend for power consumption of compute (Megaflops per watt) has generally tracked with Koomey’s law for a doubling every 1.57 years

Then you also have model performance improving with compression. For example, Llama 3.1’s 8B outperforming the original Llama 65B

agumonkey · 8 months ago

who in this field is anticipating impact of near AGI for society ? maybe i'm too anxious but not planning for potential workless life seems dangerous (but maybe i'm just not following the right groups)

m3kw9 · 8 months ago

Don’t forget humans which is real GI paired with increasing capable AI can create a feed back loop to accelerate new advances.

bjornsing · 8 months ago

> are we stuck waiting for the 20-25 years for GPU improvements

If this turns out to be hard to optimize / improve then there will be a huge economic incentive for efficient ASICs. No freaking way we’ll be running on GPUs for 20-25 years, or even 2.

noFaceDiscoG668 · 8 months ago

Maybe another plane with a bunch of semiconductor people will disappear over Kazakhstan or something. Capitalist communisms gets bossier in stealth mode.

But sorry, blablabla, this shit is getting embarrassing.

> The question is now, can we close this "to human" gap

You won’t close this gap by throwing more compute at it. Anything in the sphere of creative thinking eludes most people in the history of the planet. People with PhDs in STEM end up working in IT sales not because they are good or capable of learning but because more than half of them can’t do squat shit, despite all that compute and all those algorithms in their brains.

spencerchubb · 8 months ago

> Super exciting that OpenAI pushed the compute out this far

it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?

lolinder · 8 months ago

> the fact that you even can use more compute to get more intelligence is a breakthrough.

I'm not so sure—what they're doing by just throwing more tokens at it is similar to "solving" the traveling salesman problem by just throwing tons of compute into a breadth first search. Sure, you can get better and better answers the more compute you throw at it (with diminishing returns), but is that really that surprising to anyone who's been following tree of thought models?

All it really seems to tell us is that the type of model that OpenAI has available is capable of solving many of the types of problems that ARC-AGI-PUB has set up given enough compute time. It says nothing about "intelligence" as the concept exists in most people's heads—it just means that a certain very artificial (and intentionally easy for humans) class of problem that wasn't computable is now computable if you're willing to pay an enormous sum to do it. A breakthrough of sorts, sure, but not a surprising one given what we've seen already.

echelon · 8 months ago

Maybe it's not linear spend.

empiko · 8 months ago

I don't think this is only about efficiency. The model I have here is that this is similar to when we beat chess. Yes, it is impressive that we made progress on a class of problems, but is this class aligned with what the economy or the society needs?

Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?

edanm · 8 months ago

I mostly agree with your analysis, but just to drive home a point here - I don't think that algorithms to beat Chess were ever seriously considered as something that would be relevant outside of the context of Chess itself. And obviously, within the world of Chess, they are major breakthroughs.

In this case there is more reason to think these things are relevant outside of the direct context - these tests were specifically designed to see if AI can do general-thinking tasks. The benchmarks might be bad, but that's at least their purpose (unlike in Chess).

lugu · 8 months ago

ARC is designed to be hard for current models. It cannot be a proxy for how useful they are. It says something else. Most likely those models won't replace human at their tasks in their organization. Instead "we" will design pipeline so that the tasks aligns with the ability of the model and we will put the human at the periphery. Think of how a factory is organised for the robots.

spamlettuce · 8 months ago

okay, but what about literal swe-bench. O3 scored 75% eval

daxfohl · 8 months ago

I wonder if we'll start seeing a shift in compute spend, moving away from training time, and toward inference time instead. As we get closer to AGI, we probably reach some limit in terms of how smart the thing can get just training on existing docs or data or whatever. At some point it knows everything it'll ever know, no matter how much training compute you throw at it.

To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.

Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.

Interesting times.

freehorse · 8 months ago

> I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.

SamPatt · 8 months ago

Right. Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.

The other benchmarks are a good indication though.

riku_iki · 8 months ago

> ~=$3400 per single task

report says it is $17 per task, and $6k for whole dataset of 400 tasks.

binarymax · 8 months ago

"Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration."

The low compute was $17 per task. Speculate 172*$17 for the high compute is $2,924 per task, so I am also confused on the $3400 number.

bluecoconut · 8 months ago

That's the low-compute mode. In the plot at the top where they score 88%, O3 High (tuned) is ~3.4k

xrendan · 8 months ago

You're misreading it, there's two different runs, a low and a high compute run.

The number for the high-compute one is ~172x the first one according to the article so ~=$2900

jhrmnn · 8 months ago

That’s for the low-compute configuration that doesn’t reach human-level performance (not far though)

Deleted Comment

cle · 8 months ago

Efficiency has always been the key.

Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.

Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.

Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...

Macuyiko · 8 months ago

I am not so sure, but indeed it is perhaps also a sad realization.

You compare this to "a human" but also admit there is a high variation.

And, I would say there are a lot humans being paid ~=$3400 per month. Not for a single task, true, but for honestly for no value creating task at all. Just for their time.

So what about we think in terms of output rather than time?

madduci · 8 months ago

Let's see when this will be released to the free tier. Looks promising, although I hope they will also be able to publish more details on this, as part of the "open" in their name

ein0p · 8 months ago

This is beta version. By the time they're done with this it'll be measured in single digit dollars, if not cents.

chefandy · 8 months ago

I think the real key is figuring out how to turn the hand-wavy promises of this making everything better into policy long fucking before we kick the door open. It’s self-evident that this being efficient and useful would be a technological revolution; what’s not self evident is that it wouldn’t benefit the large corporate entities that control even more disproportionately than it does now to the detriment of many other people.

Deleted Comment

| Name | Semi-private eval | Public eval | |--------------------------------------|-------------------|-------------| | Jeremy Berman | 53.6% | 58.5% | | Akyürek et al. | 47.5% | 62.8% | | Ryan Greenblatt | 43% | 42% | | OpenAI o1-preview (pass@1) | 18% | 21% | | Anthropic Claude 3.5 Sonnet (pass@1) | 14% | 21% | | OpenAI GPT-4o (pass@1) | 5% | 9% | | Google Gemini 1.5 (pass@1) | 4.5% | 8% |

Let me go against some skeptics and explain why I think full o3 is pretty much AGI or at least embodies most essential aspects of AGI.

What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)

ARC has been challenging precisely because solving its problems often requires:

   1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND

   2) using the right level(s) of abstraction

Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.

It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.

[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...

ADDED:

Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.

I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)

phil917 · 8 months ago

Quote from the creators of the AGI-ARC benchmark: "Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence."

qnleigh · 8 months ago

I like the notion, implied in the article, that AGI won't be verified by any single benchmark, but by our collective inability to come up with benchmarks that defeat some eventual AI system. This matches the cat-and-mouse game we've been seeing for a while, where benchmarks have to constantly adapt to better models.

I guess you can say the same thing for the Turing Test. Simple chat bots beat it ages ago in specific settings, but the bar is much higher now that the average person is familiar with their limitations.

If/once we have an AGI, it will probably take weeks to months to really convince ourselves that it is one.

nopinsight · 8 months ago

I'd need to see what kinds of easy tasks those are and would be happy to revise my hypothesis if that's warranted.

Also, it depends a great deal on what we define as AGI and whether they need to be a strict superset of typical human intelligence. o3's intelligence is probably superhuman in some aspects but inferior in others. We can find many humans who exhibit such tendencies as well. We'd probably say they think differently but would still call them generally intelligent.

CooCooCaCha · 8 months ago

Yeah the real goalpost is reliable intelligence. A supposed phd level AI failing simple problems is a red flag that we’re still missing something.

93po · 8 months ago

they say it isn't AGI but i think the way o3 functions can be refined to AGI - it's learning to solve a new, novel problems. we just need to make it do that more consistently, which seems achievable

dimitri-vs · 8 months ago

Have we really watered down the definition of AGI that much?

LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.

Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.

theptip · 8 months ago

> LLMs aren't really capable of "learning" anything outside their training data.

ChatGPT has had for some time the feature of storing memories about its conversations with users. And you can use function calling to make this more generic.

I think drawing the boundary at “model + scaffolding” is more interesting.

bubblyworld · 8 months ago

That's true for vanilla LLMs, but also keep in mind that there are no details about o3's architecture at the moment. Clearly they are doing something different given the huge performance jump on a lot of benchmarks, and it may well involve in-context learning.

timabdulla · 8 months ago

What's your explanation for why it can only get ~70% on SWE-bench Verified?

I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)

slewis · 8 months ago

I've spent tons of time evaluating o1-preview on SWEBench-Verified.

For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.

For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.

nopinsight · 8 months ago

One possibility is that it may not yet have sufficient experience and real-world feedback for resolving coding issues in professional repos, as this involves multiple steps and very diverse actions (or branching factor, in AI terms). They have committed to not training on API usage, which limits their ability to directly acquire training data from it. However, their upcoming agentic efforts may address this gap in training data.

nyrikki · 8 months ago

GPQA scores are mostly from pre-training, against content in the corpus. They have gone silent but look at the GPT4 technical report which calls this out.

We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.

As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.

As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.

I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.

Heck we aren't close to P with commercial models.

sebzim4500 · 8 months ago

Isn't any physically realizable computer (including our brains) limited to what uniform-TC0 can do?

Imnimo · 8 months ago

>Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforce, AIME, and Frontier Math strongly suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it.

The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?

nopinsight · 8 months ago

Great point. I'd love to see what these easy tasks are and would be happy to revise my hypothesis accordingly. o3's intelligence is unlikely to be a strict superset of human intelligence. It is certainly superior to humans in some respects and probably inferior in others. Whether it's sufficiently generally intelligent would be both a matter of definition and empirical fact.

mirkodrummer · 8 months ago

Please stop it calling AGI, we don’t even know or agree universally what that should actually mean. How far did we get with hype calling a lossy probabilistic compressor firing slowly at us words AGI? That’s a real bummer to me

razodactyl · 8 months ago

Is this comment voted down because of sentiment / polarity?

Regardless the critical aspect is valid, AGI would be something like Cortana from Halo.

norir · 8 months ago

Personally I find "human-level" to be a borderline meaningless and limiting term. Are we now super human as a species relative to ourselves just five years ago because of our advances in developing computer programs that better imitate what many (but far from all) of us were already capable of doing? Have we reached a limit to human potential that can only be surpassed by digital machines? Who decides what human level is and when we have surpassed it? I have seen some ridiculous claims about ai in art that don't stand up to even the slightest scrutiny by domain experts but that easily fool the masses.

razodactyl · 8 months ago

No I think we're just tired and depressed as a species... Existing systems work to a degree but aren't living up to their potential of increasing happiness according to technological capabilities.

ec109685 · 8 months ago

The problem with ARC is that there are a finite number of heuristics that could be enumerated and trained for, which would give model a substantial leg up on this evaluation, but not be generalized to other domains.

For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.

Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.

uncomplexity_ · 8 months ago

on the spatial data i see it as a highly intelligent head of a machine that just needs better limbs and better senses.

i think that's where most hardware startups will specialize with in the coming decades, different industries with different needs.

Deleted Comment

puttycat · 8 months ago

Great comment. See this as well for another potential reason for failure:

https://arxiv.org/abs/2402.10013

Deleted Comment

golol · 8 months ago

In order to replace actual humans doing their job I think LLMs are lacking in judgement, sense of time and agenticism.

Kostchei · 8 months ago

I mean fkcu me when they have those things, however, maybe they are just lazy and their judgement is fine, for a lazy intelligence. Inner-self thinks "why are these bastards asking me to do this? ". I doubt that is actually happening, but now, .. prove it isn't.

PaulDavisThe1st · 8 months ago

> It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could.

Every human does this dozens, hundreds or thousands of times ... during childhood.

ryoshu · 8 months ago

Ask o3 is P=NP?

amelius · 8 months ago

It will just answer with the current consensus on the matter.

tangjurine · 8 months ago

how about if it can work at a job? people can do that, can o3 do it?

zwnow · 8 months ago

This is not AGI lmao.

xvector · 8 months ago

Agree. AGI is here. I feel such a sense of pride in our species.

Deleted Comment

croemer · 8 months ago

The programming task they gave o3-mini high (creating Python server that allows chatting with OpenAI API and run some code in terminal) didn't seem very hard? Strange choice of example for something that's claimed to be a big step forwards.

YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)

Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...

zelphirkalt · 8 months ago

Looks like quite shoddy code though. Like, the procedure for running a shell command is pure side-effect procedural code, neither returning the exit code of the command nor its output. Like the incomplete stackoverflow answer it probably was trained from. It might do one job at a time, but once this stuff gets integrated into one coherent thing, one needs to rewrite lots of it, to actually be composable.

Though, of course one can argue, that lots of human written code is not much different from this.

Which code is shoddy? The Claude or o3-mini one? If you mean Claude, then have you checked the o3-mini one is better?

bearjaws · 8 months ago

It's good that it works since if you ask GPT-4o to use the openai sdk it will often produce invalid and out of date code.

luke-stanley · 8 months ago

But they did use a prompt that included a full example of how to call their latest model and API!

I would say they didn’t need to demo anything, because if you are gonna use the output code live on a demo it may make compile errors and then look stupid trying to fix it live

If it was a safe bet problem, then they should have said that. To me it looks like they faked excitement for something not exciting which lowers credibility of the whole presentation.

sunaookami · 8 months ago

They actually did that the last time when they showed the apps integration. First try in Xcode didn't work.

csomar · 8 months ago

Models are predictable at 0 temperatures. They might have tested the output beforehand.

HeatrayEnjoyer · 8 months ago

Sonnet isn't a "mini" sized model. Try it with Haiku.

How mini is o3-mini compared to Sonnet and why does it matter whether it's mini or not? Isn't the point of the demo to show what's now possible that wasn't before?

4o is cheaper than o1 mini so mini doesn't mean much for costs.

MyFirstSass · 8 months ago

What? Is this what this is? Either this is a complete joke or we're missing something.

I've been doing similar stuff in Claude for months and it's not that impressive when you see how limited they really are when going non boilerplate.

photonboom · 8 months ago

heres the right timestamp: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s

Yeah I agree that wasn't particularly mind blowing to me and seems fairly in line with what existing SOTA models can do. Especially since they did it in steps. Maybe I'm missing something.

modeless · 8 months ago

Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far.

A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.

It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.

We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!

adamgordonbell · 8 months ago

There is a benchmark, NovelQA, that LLMs don't dominate when it feels like they should. The benchmark is to read a novel and answer questions about it.

LLMs are below human evaluation, as I last looked, but it doesn't get much attention.

Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.

https://novelqa.github.io/

loxias · 8 months ago

NovelQA is a great one! I also like GSM-Symbolic -- a benchmark based on making _symbolic templates_ of quite easy questions, and sampling them repeatedly, varying things like which proper nouns are used, what order relevant details appear, how many irrelevant details (GSM-NoOp) and where they are in the question, things like that.

LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)

https://machinelearning.apple.com/research/gsm-symbolic

https://arxiv.org/pdf/2410.05229

Paper came out in October, I don't think many have fully absorbed the implications.

It's hard to take any of the claims of "LLMs can do reasoning!" seriously, once you understand that simply changing what names are used in a 8th grade math word problem can have dramatic impact on the accuracy.

meta_x_ai · 8 months ago

Looks like it's not updated for nearly a year and I'm guessing Gemini 2.0 Flash with 2m context will simply crush it

latency-guy2 · 8 months ago

> I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

I would think this is a not so good bench. Author does not write logically, they write for entertainment.

usaar333 · 8 months ago

That's an old leaderboard -- has no one checked any SOTA LLM in the last 8 months?

CamperBob2 · 8 months ago

Does it work on short stories, but not novels? If so, then that's just a minor question of context length that should self-resolve over time.

rowanG077 · 8 months ago

Benchmark how? Is it good if the LLM can or can't solve it?

skywhopper · 8 months ago

"The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning."

Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.

munchler · 8 months ago

It shows objectively that the models are getting better at some form of reasoning, which is at least worth noting. Whether that improved reasoning is relevant for the real world is a different question.

bagels · 8 months ago

It doesn't follow, faulty logic. The two are probably correlated though.

refulgentis · 8 months ago

This emphasizes persons and a self-conceived victory narrative over the ground truth.

Models have regularly made progress on it, this is not new with the o-series.

Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.

I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.

Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.

Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front. (Not sure I believe the speculation about o3's internals in the link.)

What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.

stego-tech · 8 months ago

I won't be as brutal in my wording, but I agree with the sentiment. This was something drilled into me as someone with a hobby in PC Gaming and Photography: benchmarks, while handy measures of potential capabilities, are not guarantees of real world performance. Very few PC gamers completely reinstall the OS before benchmarking to remove all potential cruft or performance impacts, just as very few photographers exclusively take photos of test materials.

While I appreciate the benchmark and its goals (not to mention the puzzles - I quite enjoy figuring them out), successfully passing this benchmark does not demonstrate or guarantee real world capabilities or performance. This is why I increasingly side-eye this field and its obsession with constantly passing benchmarks and then moving the goal posts to a newer, harder benchmark that claims to be a better simulation of human capabilities than the last one: it reeks of squandered capital and a lack of a viable/profitable product, at least to my sniff test. Rather than simply capitalize on their actual accomplishments (which LLMs are - natural language interaction is huge!), they're trying to prove to Capital that with a few (hundred) billion more in investments, they can make AGI out of this and replace all those expensive humans.

They've built the most advanced prediction engines ever conceived, and insist they're best used to replace labor. I'm not sure how they reached that conclusion, but considering even their own models refute this use case for LLMs, I doubt their execution ability on that lofty promise.

danielmarkbruce · 8 months ago

100%. The hype is misguided. I doubt half the people excited about the result have even looked at what the benchmark is.

Highly challenging for LLMs because it has nothing to do with language. LLMs and their training processes have all kinds of optimizations for language and how it's presented.

This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.

Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.

computerex · 8 months ago

The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.

If the hypothesis is that LLMs are the “computer” that drives the AGI then of course the benchmark is relevant in testing for AGI.

I don’t think you understand the benchmark and its motivation. ARC AGI benchmark problems are extremely easy and simple for humans. But LLMs fail spectacularly at them. Why they fail is irrelevant, the fact they fail though means that we don’t have AGI.

lossolo · 8 months ago

> making the most interesting and challenging LLM benchmark so far.

This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.

1. https://epoch.ai/frontiermath/the-benchmark

pynappo · 8 months ago

Apparently o3 scored about 25%

https://youtu.be/SKBG1sqdyIU?t=4m40s

You're right, I was wrong to say "most challenging" as there have been harder ones coming out recently. I think the correct statement would be "most challenging long-standing benchmark" as I don't believe any other test designed in 2019 has resisted progress for so long. FrontierMath is only a month old. And of course the real key feature of ARC is that it is easy for humans. FrontierMath is (intentionally) not.

jug · 8 months ago

I liked the SimpleQA benchmark that measures hallucinations. OpenAI models did surprisingly poorly, even o1. In fact, it looks like OpenAI often does well on benchmarks by taking the shortcut to be more risk prone than both Anthropic and Google.

aimanbenbaha · 8 months ago

Because LLMs are on an off-ramp path towards AGI. A generally intelligent system can brute force its way with just memory.

Once a model recognizes a weakness through reasoning with CoT when posed to a certain problem and gets the agency to adapt to solve that problem that's a precursor towards real AGI capability!

zone411 · 8 months ago

It's the least interesting benchmark for language models among all they've released, especially now that we already had a large jump in its best scores this year. It might be more useful as a multimodal reasoning task since it clearly involves visual elements, but with o3 already performing so well, this has proven unnecessary. ARC-AGI served a very specific purpose well: showcasing tasks where humans easily outperformed language models, so these simple puzzles had their uses. But tasks like proving math theorems or programming are far more impactful.

versteegen · 8 months ago

ARC wasn't designed as a benchmark for LLMs, and it doesn't make much sense to compare them on it since it's the wrong modality. Even a MLM with image inputs can't be expected to do well, since they're nothing like 99.999% of the training data. The fact that even a text-only LLM can solve ARC problems with the proper framework is important, however.

internet_points · 8 months ago

> The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.

One might also interpret that as "the fact that models which are studying to the test are getting better at the test" (Goodhart's law), not that they're actually reasoning.

dtquad · 8 months ago

Are there any single-step non-reasoner models that do well on this benchmark?

I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.

throwaway71271 · 8 months ago

https://arxiv.org/pdf/2412.04604

YetAnotherNick · 8 months ago

Here are the results for base models[1]:

  o3 (coming soon)  75.7% 82.8%
  o1-preview        18%   21%
  Claude 3.5 Sonnet 14%   21%
  GPT-4o            5%    9%
  Gemini 1.5        4.5%  8%

Score (semi-private eval) / Score (public eval)

[1]: https://arcprize.org/2024-results

justanotherjoe · 8 months ago

i am confused cause this dataset is visual-based, and yet being used to measure 'LLM'. I feel like the visual nature of it was really the biggest hurdle to solving it.

obblekk · 8 months ago

Human performance is 85% [1]. o3 high gets 87.5%.

This means we have an algorithm to get to human level performance on this task.

If you think this task is an eval of general reasoning ability, we have an algorithm for that now.

There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.

Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!

[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1

phillipcarter · 8 months ago

As excited as I am by this, I still feel like this is still just a small approximation of a small chunk of human reasoning ability at large. o3 (and whatever comes next) feels to me like it will head down the path of being a reasoning coprocessor for various tasks.

But, still, this is incredibly impressive.

qt31415926 · 8 months ago

Which parts of reasoning do you think is missing? I do feel like it covers a lot of 'reasoning' ground despite its on the surface simplicity

azeirah · 8 months ago

I'd like to see this o3 thing play 5d chess with multiverse time travel or baba is you.

The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.

If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.

scotty79 · 8 months ago

Still it's comparing average human level performance with best AI performance. Examples of things o3 failed at are insanely easy for humans.

You'd be surprised what the AVERAGE human fails to do that you think is easy, my mom can't fucking send an email without downloading a virus, i have a coworker that believes beyond a shadow of a doubt the world is flat.

The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person

FrustratedMonky · 8 months ago

There are things Chimps do easily that humans fail at, and vice/versa of course.

There are blind spots, doesn't take away from 'general'.

cryptoegorophy · 8 months ago

What’s interesting is it might be very close to human intelligence than some “alien” intelligence, because after all it is a LLM and trained on human made text, which kind of represents human intelligence.

hammock · 8 months ago

In that vein, perhaps the delta between o3 @ 87.5% and Human @ 85% represents a deficit in the ability of text to communicate human reasoning.

In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.

hamburga · 8 months ago

Agreed. I think what really makes them alien is everything else about them besides intelligence. Namely, no emotional/physiological grounding in empathy, shame, pride, and love (on the positive side) or hatred (negative side).

6gvONxR4sf7o · 8 months ago

Human performance is much closer to 100% on this, depending on your human. It's easy to miss the dot in the corner of the headline graph in TFA that says "STEM grad."

tim333 · 8 months ago

A fair comparison might be average human. The average human isn't a STEM grad. It seems STEM grad approximately equals an IQ of 130. https://www.accommodationforstudents.com/student-blog/the-su...

From a post elsewhere the scores on ARC-AGI-PUB are approx average human 64%, o3 87%. https://news.ycombinator.com/item?id=42474659

Though also elsewhere, o3 seems very expensive to operate. You could probably hire a PhD researcher for cheaper.

ALittleLight · 8 months ago

It's not saturated. 85% is average human performance, not "best human" performance. There is still room for the model to go up to 100% on this eval.

lastdong · 8 months ago

Curious about how many tests were performed. Did it consistently manage to successfully solve many of these types of problems?

antirez · 8 months ago

NNs are not algorithms.

notfish · 8 months ago

An algorithm is “a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer”

How does a giant pile of linear algebra not meet that definition?

benlivengood · 8 months ago

Deterministic (ieee 754 floats), terminates on all inputs, correctness (produces loss < X on N training/test inputs)

At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.

drdeca · 8 months ago

How do you define "algorithm"? I suspect it is a definition I would find somewhat unusual. Not to say that I strictly disagree, but only because to my mind "neural net" suggests something a bit more concrete than "algorithm", so I might instead say that an artificial neural net is an implementation of an algorithm, rather than or something like that.

But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.

necovek · 8 months ago

NN is a very wide term applied in different contexts.

When a NN is trained, it produces a set of parameters that basically define an algorithm to do inference with: it's a very big one though.

We also call that a NN (the joy of natural language).

KeplerBoy · 8 months ago

Running inference on a model certainly is a algorithm.

dyauspitr · 8 months ago

I’ll believe it when the AI can earn money on its own. I obviously don’t mean someone paying a subscription to use the AI I mean, letting the AI lose on the Internet with only the goal of making money and putting it into a bank account.

creer · 8 months ago

You don't think there are already plenty of attempts out there?

When someone is "disinterested enough" to publish though, note the obvious way to launch a new fund or advisor with a good track record: crank out a pile of them, run them one or two years, discard the many losers and publish the one or two top winners. I.E. first you should be suspicious of why it's being published, then of how selected that result is.

Do trading bots count?

hypoxia · 8 months ago

It actually beats the human average by a wide margin:

- 64.2% for humans vs. 82.8%+ for o3.

...

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374

Super human isn't beating rando mech turk.

Their post has stem grad at nearly 100%

dmead · 8 months ago

This is so strange. people think that an llm trained on programming questions and docs can do mundane tasks like this means intelligent? Come on.

It really calls into question two things.

1. You don't know what you're talking about about.

2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.

Either way, not a good look.

javaunsafe2019 · 8 months ago

This

sn0wr8ven · 8 months ago

Incredibly impressive. Still can't really shake the feeling that this is o3 gaming the system more than it is actually being able to reason. If the reasoning capabilities are there, there should be no reason why it achieves 90% on one version and 30% on the next. If a human maintains the same performance across the two versions, an AI with reason should too.

kmacdough · 8 months ago

The point of ARC is NOT to compare humans vs AI, but to probe the current boundary of AIs weaknesses. AI has been beating us at specific tasks like handwriting recognition for decades. Rather, it's when we can no longer readily find these "easy for human, hard for AI" reasoning tasks that we must stop and consider.

If you look at the ARC tasks failed by o3, they're really not well suited to humans. They lack the living context humans thrive on, and have relatively simple, analytical outcomes that are readily processed by simple structures. We're unlikely to see AI as "smart" until it can be asked to accomplish useful units of productive professional work at a "seasoned apprentice" level. Right now they're consuming ungodly amounts of power just to pass some irritating, sterile SAT questions. Train a human for a few hours a day over a couple weeks and they'll ace this no problem.

tintor · 8 months ago

o3 low and high are the same model. Difference is in how long was it allowed to think.

It works the same with humans. If they spend more time on the puzzle they are more likely to solve it.

cornholio · 8 months ago

But does it matter if it "really, really" reasons in the human sense, if it's able to prove some famous math theorem or come up with a novel result in theoretical physics?

While beyond current motels, that would be the final test of AGI capability.

jprete · 8 months ago

If it's gaming the system, then it's much less likely to reliably come up with novel proofs or useful new theoretical ideas.

xanderlewis · 8 months ago

That would be important, but as far as I know it hasn’t happened (despite how often it’s intimated that we’re on the verge of it happening).

intended · 8 months ago

Yeah, it really does matter if something was reasoned, or whether it appears if you metaphorically shake the magic 8 ball.

FartyMcFarter · 8 months ago

How would gaming the system work here? Is there some flaw in the way the tasks are generated?

AI models have historically found lots of ways to game systems. My favorite example is exploiting bugs in simulator physics to "cheat" at games of computer tag. Another is a model for radiology tasks finding biases in diagnostic results using dates on the images. And of course whenever people discuss a benchmark publicly it leaks the benchmark into the training set, so the benchmark becomes a worse measure.

Dead Comment

demirbey05 · 8 months ago

I am not expert in llm reasoning but I think because of RL. You cannot use AlphaZero to play other games.

ozten · 8 months ago

Nope. AlphaZero taught itself to play games like chess, shogi, and Go through self-play, starting from random moves. It was not given any strategies or human gameplay data but was provided with the basic rules of each game to guide its learning process.

sgt101 · 8 months ago

I thought that AlphaZero could play three games? Go, Chess and Shogi?

GaggiX · 8 months ago

Humans and AIs are different, the next benchmark would be build so that it emphasize the weak points of current AI models where a human is expected to perform better, but I guess you can also make a benchmark that is the opposite, where humans struggle and o3 has an easy time.

vectorhacker · 8 months ago

I think you've hit the nail on the head there. If these systems of reasoning are truly general then they should be able to perform consistently in the same way a human does across similar tasks, baring some variance.

pkphilip · 8 months ago

Yes, if a system has actually achieved AGI, it is likely to not reveal that information

dlubarov · 8 months ago

AGI wouldn't necessarily entail any autonomy or goals though. In principle there could be a superintelligent AI that's completely indifferent to such outcomes, with no particular goals beyond correctly answering question or what not.

AGI is a spectrum, not a binary quality.

Not sure why I am being downvoted. Why would a sufficiently advanced intelligence reveal its full capabilities knowing fully well that it would then be subjected to a range of constraints and restraints?

If you disagree with me, state why instead of opting to downvote me

w4 · 8 months ago

The cost to run the highest performance o3 model is estimated to be somewhere between $2,000 and $3,400 per task.[1] Based on these estimates, o3 costs about 100x what it would cost to have a human perform the exact same task. Many people are therefore dismissing the near-term impact of these models because of these extremely expensive costs.

I think this is a mistake.

Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.

Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?

There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.

So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.

[1] https://news.ycombinator.com/item?id=42473876

istjohn · 8 months ago

Your economic analysis is deeply flawed. If there was anything that valuable and that required that much manpower, it would already have driven up the cost of labor accordingly. The one property that could conceivably justify a substantially higher cost is secrecy. After all, you can't (legally) kill a human after your project ends to ensure total secrecy. But that takes us into thriller novel territory.

I don't think that's right. Free societies don't tolerate total mobilization by their governments outside of war time, no matter how valuable the outcomes might be in the long term, in part because of the very economic impacts you describe. Human-level AI - even if it's very expensive - puts something that looks a lot like total mobilization within reach without the societal pushback. This is especially true when it comes to tasks that society as a whole may not sufficiently value, but that a state actor might value very much, and when paired with something like a co-located reactor and data center that does not impact the grid.

That said, this is all predicated on o3 or similar actually having achieved human level reasoning. That's yet to be fully proven. We'll see!

lurking_swe · 8 months ago

i disagree because the job market is not a true free market. I mean it mostly is, but there’s a LOT of politics and shady stuff that employers do to purposely drive wages down. Even in the tech sector.

Your secrecy comment is really intriguing actually. And morbid lol.

atleastoptimal · 8 months ago

How many 99.9th percentile mathematicians do nation states normally have access to?

Direct quote from the ARC-AGI blog:

“SO IS IT AGI?

ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.

Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”

The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.

Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…

Bjorkbat · 8 months ago

> Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned”

Something I missed until I scrolled back to the top and reread the page was this

> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set

So yeah, the results were specifically from a version of o3 trained on the public training set

Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.

On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.

Lol I missed that even though it's literally the first sentence of the blog, good catch.

Yeah, that makes this result a lot less impressive for me.

skepticATX · 8 months ago

Great catch. Super disappointing that AI companies continue to do things like this. It’s a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.

hartator · 8 months ago

> acid test

The css acid test? This can be gamed too.

sundarurfriend · 8 months ago

https://en.wikipedia.org/wiki/Acid_test:

> An acid test is a qualitative chemical or metallurgical assay utilizing acid. Historically, it often involved the use of a robust acid to distinguish gold from base metals. Figuratively, the term represents any definitive test for attributes, such as gauging a person's character or evaluating a product's performance.

Specifically here, they're using the figurative sense of "definitive test".