Are LLM merge rates not getting better?

I feel that two things are true at the same time:

1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

2) The quality of the code is still quite often terrible. Quadruple-nested control flow abounds. Software architecture in rather small scopes is unsound. People say AI is “good at front end” but I see the worst kind of atrocities there (a few days ago Codex 5.3 tried to inject a massive HTML element with a CSS before hack, rather than proprerly refactoring markup)

Two forces feel true simultaneously but in permanent tension. I still cannot make out my mind and see the synthesis in the dialectic, where this is truly going, if we’re meaningfully moving forward or mostly moving in circles.

leoedin · 14 hours ago

This matches my experience too. The models write code that would never pass a review normally. Mega functions, "copy and pasted" code with small changes, deep nested conditionals and loops. All the stuff we've spent a lot of time trying to minimise!

You could argue it's OK because a model can always fix it later. But the problem comes when there's subtle logic bugs and its basically impossible to understand. Or fixing the bug in one place doesn't fix it in the 10 other places almost the same code exists.

I strongly suspect that LLMs, like all technologies, are going to follow an S curve of capability. The question is where in that S curve we are right now.

zx8080 · 14 hours ago

> People say AI is “good at front end” but I see the worst kind of atrocities there

It's commonly universal to say "AI is great in X", where one is not professional in X. It's because that's how AI is designed: to output tokens according to stats, not logic, not semantic, and not meaning: stats.

contextfree · 6 hours ago

Reading discussions online and comparing them to my own experience makes me feel crazy, because I've found today's LLMs and agents to be seemingly good at everything except writing code. Including everything else in software engineering around code (debugging, reviewing, reading code, brainstorming architecture, etc.) as well as discussing various questions in the humanities and sciences where I'm a dilettante. But whenever I've asked them to generate any substantial amount of code, beyond a few lines to demonstrate usage of some API I'm unfamiliar with, the results have always been terrible and I end up either throwing it out or rewriting almost all of it myself and spending more time than if I'd just written it myself from the start.

It's occurred to me that maybe this just shows that I'm better at writing code and/or worse at everything else than I'd realized.

Deleted Comment

jygg4 · a day ago

The models lose the ability to inject subtle and nuance stuff as they scale up, is what I’ve observed.

orwin · a day ago

> People say AI is “good at front end”

I only say that because I'm a shit frontend dev. Honestly, I'm not that bad anymore, but I'm still shit, and the AI will probably generate better code than I will.

jygg4 · a day ago

As long as humans are needed to review code, it sounds your role evolves toward prompting and reviewing.

Which is akin to driving a car - the motor vehicle itself doesn’t know where to go. It requires you to prompt via steering and braking etc, and then to review what is happening in response.

That’s not necessarily a bad thing - reviewing code ultimately matters most. As long as what is produced is more often than not correct and legible.. now this is a different issue for which there isn’t a consensus across software engineer’s.

naruhodo · 20 hours ago

> 1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

I have heard say that the change was better context management and compression.

bbatha · 19 hours ago

A lot of enhancements came on the model side which in many ways enabled context engineering.

200k and now 1M contexts. Better context management was enabled by improvements in structured outputs/tool calling at the model level. Also reasoning models really upped the game “plan” mode wouldn’t work well without them.

I don't find this very compelling. If you look at the actual graph they are referencing but never showing [1] there is a clear improvement from Sonnet 3.7 -> Opus 4.0 -> Sonnet 4.5. This is just hidden in their graph because they are only looking at the number of PRs that are mergable with no human feedback whatsoever (a high standard even for humans).

And even if we were to agree that that's a reasonable standard, GPT 5 shouldn't be included. There is only one datapoint for all OpenAI models. That data point more indicative of the performance of OpenAI models (and the harness used) than of any progression. Once you exclude it it matches what you would expect from a logistic model. Improvements have slowed down, but not stopped

1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...

yorwba · a day ago

Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.

That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

thesz · a day ago

  > until all potential sources of error are close to being eliminated

This is what PSP/TSP did - one has to (continually) review its' own work to identify most frequent sources of (user facing) defects.

  >  if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

This is also one of tenets of PSP/TSP. If you have a task with estimate longer that a day (8 hours), break it down.

This is fascinating. LLM community discovers PSP/TSP rules that were laid over more than twenty years ago.

What LLM community miss is that in PSP/TSP it is an individual software developer who is responsible to figure out what they need to look after.

What I see is that it is LLM users who try to harness LLMs with what they perceive as errors. It's not that LLMs are learning, it is that users of LLMs are trying to stronghold these LLMs with prompts.

Bombthecat · a day ago

That's how the public perceive it though.

It's useless and never gets better until it suddenly, unexpecty got good enough.

roxolotl · a day ago

I don't know that graph to me shows Sonnet 4.5 as worse than 3.7. Maybe the automated grader is finding code breakages in 3.7 and not breaking that out? But I'd much prefer to add code that is a different style to my codebase than code that breaks other code. But even ignoring that the pass rate is almost identical between the two models.

aerhardt · a day ago

wongarsu · a day ago

curiouscube · a day ago

There is a decent case for this thesis to hold true especially if we look at the shift in training regimes and benchmarking over the last 1-2 years. Frontier labs don't seem to really push pure size/capability anymore, it's an all in focus on agentic AI which is mainly complex post-training regimes.

There are good reasons why they don't or can't do simple param upscaling anymore, but still, it makes me bearish on AGI since it's a slow, but massive shift in goal setting.

In practice this still doesn't mean 50 % of white collar can't be automated though.

lich_king · a day ago

> In practice this still doesn't mean 50 % of white collar can't be automated though.

Let me ask you this, though: if we wanted to, what percentage of white collar jobs could have been automated or eliminated prior to LLMs?

Meta has nearly 80k employees to basically run two websites and three mobile apps. There were 18k people working at LinkedIn! Many big tech companies are massive job programs with some product on the side. Administrative business partners, program managers, tech writers, "stewards", "champions", "advocates", 10-layer-deep reporting chains... engineers writing cafe menu apps and pet programming languages... a team working on in-house typefaces... the list goes on.

I can see AI producing shifts in the industry by reducing demand for meaningful work, but I doubt the outcome here is mass unemployment. There's an endless supply of bs jobs as long as the money is flowing.

jmalicki · a day ago

Meta has 80k employees to run the world's most massive engine of commerce through advertising and matching consumers to products.

They build generative AI tools so people can make ads more easily.

They have some of the most sophisticated tracking out there. They have shadow profiles on nearly everyone. Have you visited a website? You have a shadow profile even if you don't have a Facebook account. They know who your friends are based on who you are near. They know what stores you visit when.

Large fractions of their staff are making imperceptible changes to ads tracking and feed ranking that are making billions of dollars of marginal revenue.

What draws you in as a consumer is a tiny tip of the iceberg of what they actually do.

ehnto · a day ago

There are many reasons why we are seeing cuts economically, but the fact that it is possible to make such large cuts is because there were way too many people working at these companies. They had so much cheap money that they over-hired, now money isn't so cheap and they need to reduce headcount. AI need not enter the conversation to get to that point.

suttontom · a day ago

This is unfair and dismissive of many roles. Coordination in a massive, technically complex company that has to adhere to laws and regulations is a critical role. I don't get why people shit on certain roles (I'm a SWE). Our PgMs reduce friction and help us be more productive and focused. Technical writers produce customer-facing content and code, and have nothing to do with supporting internal bureaucracy. There are arguments against this in Bullshit Jobs but do you think companies pay PgMs or HR employees hundreds of thousands of dollars a year out of the goodness of their own hearts? Or maybe they actually help the business?

sunaurus · a day ago

I am pretty convinced that for most types of day to day work, any perceived improvements from the latest Claude models for example were total placebo. In blind tests and with normal tasks, people would probably have no idea if they're using Opus 4.5 or 4.6.

sumeno · a day ago

This has basically been my experience since Sonnet 3.5. I've been working on a personal project on and off with various models and things since then and the biggest difference between then and now is that it will do larger chunks of work than it did before, but the quality of the code is not particularly better, I still have to do a lot of cleanup and it still goes off the rails pretty frequently. I have to do fewer individual prompts, but the time spent reviewing the code takes longer because I also have to mentally process and fix larger chunks of code too

Is it a better user experience now? Yes. Has it boosted my productivity on this project? Absolutely.

But it still needs a ton of hand holding for anything complicated and I still deal with tons of "OK, this bug is fixed now!" followed by manually confirming a bug still exists.

BoumTAC · a day ago

It's because they are getting so good it's impossible to recognize them.

Haiku 4.5 is already so good it's ok for 80% (95%?) of dev tasks.

FuckButtons · a day ago

I must be writing very different software than you, I keep opus on a tight leash and it still comes to the strangest conclusions.

Bolwin · a day ago

I've found Haiku to be truly mediocre for working with. If you want a cheap models, the open source ones are much better

AussieWog93 · a day ago

I'd agree with you on 4.5 to 4.6, but going from gpt-5 or 4.0 to 4.5 was night and day.

NewLogic · a day ago

Because post 4.0 dropped the sycophancy?

butILoveLife · a day ago

GPT5 added the router, which was def a downgrade. 4.5 was probably the best non-COT model humanity has made. But too expensive to run.

Incipient · a day ago

I feel even if the models are stagnating, the tooling around them, and the integrations and harnesses they have are getting significantly more capable (if not always 'better' - the recent vscode update really handicapped them for some reason). Things like the new agent from booking.com or whatever, if it could integrate with all hotels, activities, mapping tools, flight system, etc could be hugely powerful.

Assuming we get no better than opus 4.6, they're very capable. Even if they make up nonsense 5% of the time!

Dead Comment

boonzeet · a day ago

Interesting article, although with so few data points and such a specific time slice it is difficult to draw serious conclusions about the "improvement" of LLM models.

It's notably lacking newer models (4.5 Opus, 4.6 Sonnet) and models from Gemini.

LLMs appear to naturally progress in short leaps followed by longer plateaus, as breakthroughs are developed such as chain-of-thought, mixture-of-experts, sub-agents, etc.

hrmtst93837 · a day ago

Focusing on flashy breakthroughs hides the issue that bigger models and merge benchmarks rarely translate to reliability in real codebases. For routine merges, subtle regressions and context quirks matter more than headline progress. Unless evals stress nasty scenarios like multi-file renames with tricky conflicts, the numbers are mostly for show. Progress will plateau until someone tunes for the boring, messy cases that waste dev time.

BoppreH · a day ago

Controversial opinion from a casual user, but state-of-art LLMs now feel to me more intelligent then the average person on the steet. Also explains why training on more average-quality data (if there's any left) is not making improvements.

But LLMs are hamstrung by their harnesses. They are doing the equivalent of providing technical support via phone call: little to no context, and limited to a bidirectional stream of words (tokens). The best agent harnesses have the equivalent of vision-impairment accessibility interfaces, and even those are still subpar.

Heck, giving LLMs time to think was once a groundbreaking idea. Yesterday I saw Claude Code editing a file using shell redirects! It's barbaric.

I expect future improvements to come from harness improvements, especially around sub agents/context rollbacks (to work around the non-linear cost of context) and LLM-aligned "accessibility tools". That, or more synthetic training data.

8note · a day ago

> But LLMs are hamstrung by their harnesses

entirely so. i think anthropic updated something about the compact algorithm recently, and its gone from working well over long times to basically garbage whenever a compact happens

xyzsparetimexyz · a day ago

Steet? Do you mean street? They're smarter in the same way a search engine is smarter.

Yes, "street". Typing from my phone, sorry.

And search engines are narrow tools that can only output copies of its dataset. An LLM is capable of surprisingly novel output, even if the exact level of creativity is heavily debated.

globular-toast · 17 hours ago

It's so disrespectful to say an LLM is more intelligent than a person on the street. The LLM has nothing at stake, cares not a sausage about the consequences of what it spits out. People have all kinds of pressures, dependants, and personal issues like health. Our thoughts and actions have real consequences. It's so easy to be intelligent when you're the pretend human that gets switched on for five minutes then switched off again.

BoppreH · 15 hours ago

It's not a value judgement, I'm no misanthrope. But it's a fact or life that we humans must specialize, while LLMs can afford to have "studied" a staggering variety of topics. It's no different than being slower than a car, or weaker than a hydraulic press.

On a different note, LLMs are still not very wise, as displayed by all the prompt attacks and occasional inane responses like walking to the car wash.

idorozin · a day ago

My experience has been that raw “one-shot intelligence” hasn’t improved as dramatically in the last year, but the workflow around the models has improved massively.

When you combine models with:

tool use

planning loops

agents that break tasks into smaller pieces

persistent context / repos

the practical capability jump is huge.