Measuring AI Ability to Complete Long Tasks

For those people who won’t read anything more than the headline, this is a silly paper based on a metric that considers only “task completion time” at “a specified degree of reliability, such as 50 percent” for “human programmers”.

Then, in a truly genius stroke of AI science, the current article extrapolates this to infinity and beyond, while hand-waving away the problem of “messiness”, which clearly calls the extrapolation into question:

> At the heart of the METR work is a metric the researchers devised called “task-completion time horizon.” It’s the amount of time human programmers would take, on average, to do a task that an LLM can complete with some specified degree of reliability, such as 50 percent. A plot of this metric for some general-purpose LLMs going back several years [main illustration at top] shows clear exponential growth, with a doubling period of about seven months. The researchers also considered the “messiness” factor of the tasks, with “messy” tasks being those that more resembled ones in the “real world,” according to METR researcher Megan Kinniment. Messier tasks were more challenging for LLMs [smaller chart, above]

dang · 8 months ago

What would be a more accurate and neutral headline?

Y_Y · 8 months ago

The paper and blog posts referenced are both called "Measuring AI Ability to Complete Long Tasks”, this might do better.

I’m sure someone more knowledgeable and well-spoken than I will provide a more scathing takedown of this article soon, but even I can laugh at its breathless endorsement of some very dubious claims with no supporting evidence.

“AI might write a decent novel by 2030”? Have you read the absolute dreck they produce today? An LLM will NEVER produce a decent novel, for the same reason it will never independently create a decent game or movie: It can’t read the novel, play the game, or watch the movie, and have an emotional response to it or gauge it’s entertainment value. It has no way to judge if a work of art will have an emotional impact on its audience or dial in the art to enhance that impact or make a statement that resonates with people. Only people can do that.

All in all, this article is unscientific, filled with hand-waving “and then a miracle occurs”, and meaningless graphs that in no way indicate that LLMs will undergo the kind of step change transformation needed to reliably and independently accomplish complex tasks this decade. The study authors themselves give the game away when they use “50% success rate” as the yardstick for an LLM. You know what we call a human with a 50% success rate in the professional world? Fired.

I don’t think it was responsible of IEEE to publish this article and I expect better from the organization.

kcplate · 8 months ago

Likely due to my nearly 40 years experience in the tech industry, and knowing where we were then compared to where we are now—I am floored by what LLMs are doing and how much better they are even in the last 2 years I have been tracking on them.

That said, I will make no definitive statements like “never” and “can’t” as it relates to AI in the next 5 years because it is already doing things that I would have thought unlikely just 5 years ago…and frankly would have thought functionally impossible back 40 years ago.

timr · 8 months ago

LLMs are cool and they're amazing for what they are, but the hype is just ridiculous right now, and the extrapolation fallacy is still a fallacy. Without a good structural reason to assume exponential growth (e.g. organism reproduction, which is itself not actually exponential), it's kind of the Godwin's Law of AI debate: the first person to say "if we only project forward X years..." terminates the conversation.

I appreciate your unwillingness to say "never" here, but I think the parent comment deserves credit for calling out something important that rarely gets discussed: the importance of emotion for producing great art. This is one of the classic themes of Asimov's entire Robot ouvre, which spends many books digging into the differences between (far more advanced) AI and actual human intelligence.

There are fundamental, definable, structural deficiencies that separate LLMs from human thought, it's plainly incorrect to pretend otherwise, and the...extrapolationists...are neglecting that we have no idea how to solve these problems.

ysofunny · 8 months ago

the LLMs will do to novels something else:

I think it'll be possible to publish a "book" as a series of prompts.

which the LLMs can expand out into the narrative story.

it's a novel you can chat with. the new novel for the post-LLMs era is more like publishing the whole author... which then you can "intervew" as an LLM (reminiscent of Harry Potter when Ron's sister find the evil journal, and she basically "chats" with the notebook)

kcplate · 8 months ago

No idea why you are getting downvoted for this, it seems to me this would be exactly the kind of thing you could do…even hallucinations would contribute in a meaningful way.

fendy3002 · 8 months ago

Because I always believe that Pareto Principle applies in most aspect of computing: https://en.wikipedia.org/wiki/Pareto_principle, I believe it'll also apply on this case too, and I find that it tracks with the progress of LLM/AIs.

Breaking over 80% accuracy and solving the rest of 20% problem will be the main challenge of next-gen (or next-2gen) LLM, not to mention they still have tasks to bring down the computing costs.

EDIT: that said, solving 80% of problems with 80% of accuracy with significant time saving is a solution that's worth to account, though we need to keep sceptical because the rest 20% may be gotten much worse because the 80% solved is in bad quality.

Yoric · 8 months ago

There is a big difference between LLMs and most other tech improvements, though: with most technologies that I can think of that solve 80% of the problem, it's easy to find out whether the technology works. When you're working with an LLM, though, it's really hard to know whether the answer is correct/usable or not.

Deleted Comment

untitled2 · 8 months ago

Classic mistake is that if 1 worker will produce 10 products a day, 10 workers will produce 100. Fact is what one software developer will do in a week, ten will do in a year. Copypasta can be fast and very inaccuare today -- it will be faster and much more inaccurate later.

I thought there had been more threads about this but could only find the following. Others?

Predictions from the METR AI scaling graph are based on a flawed premise - https://news.ycombinator.com/item?id=43885051 - May 2025 (25 comments)

AI's Version of Moore's Law - https://news.ycombinator.com/item?id=43835146 - April 2025 (1 comment)

Forecaster reacts: METR's bombshell paper about AI acceleration - https://news.ycombinator.com/item?id=43758936 - April 2025 (74 comments)

Measuring AI Ability to Complete Long Tasks – METR - https://news.ycombinator.com/item?id=43423691 - March 2025 (1 comment)

nickpeterson · 8 months ago

The Skynet Funding Bill is passed. The system goes on-line August 4th, 1997. Human decisions are removed from strategic defense. Skynet begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th

actuallyalys · 8 months ago

I feel like it takes a human a month to write a novel or start up a company only of you’re talking about a very constrained version of the task. Like people write novels in a month — that’s the whole premise of National Novel Writing Month or NaNoWriMo — but they aren’t finished products, they’re first drafts.

Similarly, while I’m sure you could make good progress on starting a business in a month, it seems like that would take longer to genuinely complete from start to finish. Also, it seems like it’s necessarily a task that relies on external factors: Waiting for approval to come from various agencies, hiring employees, waiting for other parties to sign contracts, etc.

bgwalter · 8 months ago

"By 2030, the most advanced LLMs should be able to complete, with 50 percent reliability, a software-based task that takes humans a full month of 40-hour workweeks."

That is nothing. "git clone" can, with 100% reliability, "complete" tasks in a minute that take over 1,000,000 man hours. It even keeps the license.

It is a shame the IEEE now promotes this theft.

donkey_brains · 8 months ago