When AI writes the software, who verifies it?

> The Claude C Compiler illustrates the other side: it optimizes for

> passing tests, not for correctness. It hard-codes values to satisfy

> the test suite. It will not generalize.

This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.

I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.

WhyNotHugo · 11 days ago

This is why you write the tests first and then the code. Especially when fixing bugs, since you can be sure that the test properly fails when the bug is present.

pmontra · 11 days ago

When fixing bugs, yes. When designing an app not so much because you realize many unexpected things while writing the code and seeing how it behaves. Often the original test code would test something that is never built. It's obvious for integration tests but it happens for tests of API calls and even for unit tests. One could start writing unit tests for a module or class and eventually realize that it must be implemented in a totally different way. I prefer experimenting with the implementation and write tests only when it settles down on something that I'm confident it will go to production.

pipecmd · 11 days ago

I'd argue the AI writing the tests shouldn't even know about the implementation at all. You only want to pass it the interface (or function signatures) together with javadocs/docstrings/equivalent.

GuB-42 · 11 days ago

I don't think it addresses the problem.

Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.

I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.

Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.

usefulcat · 11 days ago

Agreed 1000%. But that can be a lot of work; creating a good set of tests is nearly as much or often even more effort than implementing the thing being tested.

When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.

dustingetz · 11 days ago

try this for a UI

porphyra · 11 days ago

At my job we have a requirement for 100% test coverage. So everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything.

harimau777 · 11 days ago

Exactly! It's frustrating how much developers get blamed for the outcomes of incompetent management.

IAmGraydon · 11 days ago

Yeah this is the exact kind of ridiculousness I've noticed as well - everything that comes out of an LLM is optimized to give you what you want to hear, not what's correct.

littlestymaar · 11 days ago

Long time ago in France the mainstream view by computer people was that code or compute weren't what's important when dealing with computers, it is information that matters and how you process it in a sensible way (hence the name of computer science in French: informatique. And also the name for computer: “ordinateur”, literally: what sets things into order).

As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.

Turns out that the French were just right too early, like with the Minitel.

msh · 11 days ago

"Computer science is no more about computers than astronomy is about telescopes." -Dijkstra

scotty79 · 11 days ago

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code.

I always felt like that's the main issue with unit testing. That's why I used it very rarely.

Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.

And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.

Herring · 11 days ago

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.

sarchertech · 11 days ago

It’s because people will do what they’re incentivized to do. And if no one cares about anything but whether the next feature goes out the door, that’s what programmers will focus on.

Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.

We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.

harimau777 · 11 days ago

Developers aren't given time to test and aren't rewarded if they do, but management will rain down hellfire upon their heads if they don't churn out code quickly enough.

ojo-rojo · 11 days ago

How about a subsequent review where a separate agent analyzes the original issue and resultant code and approves it if the code meets the intent of the issue. The principle being to keep an eye out for manual work that you can describe well enough to offload.

Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.

g947o · 11 days ago

You are fighting nondeterministic behavior with more nondeterministic behavior, or in other words, fighting probability with probability. That doesn't necessarily make things any better.

samrus · 11 days ago

Slop on slop. Who watches rhe watchman?

cmrdporcupine · 11 days ago

My only hope is that all of this push leads in the end to the adoption of more formal verification languages and tools.

If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.

8note · 11 days ago

> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?

Foobar8568 · 11 days ago

15years ago, I had tester writing "UI tests" / "User tests" that matched what the software was cranking out. At that time I just joined to continue at the client side so I didn't really worked on anything yet.

I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?

And that was at that time I had to dive into the code base and cry.

mattacular · 11 days ago

Test automation is kind of like a religion. It is comforting to believe that the solution to code is more code.

taatparya · 11 days ago

Property testing could've helped

yanis_t · 11 days ago

How long till the industry discover TDD?

HWR_14 · 10 days ago

> The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it.

I don't understand the value of that much code. What features are worth that much more than stability?

ZaoLahma · 11 days ago

I think it boils down to how companies view LLMs and their engineers.

Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.

Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.

Illniyar · 11 days ago

Building a C compiler should not have this problem. There is probably a million test suites coming from outside the LLM that it can sue verify correctness.

harimau777 · 11 days ago

Honestly, unit tests (at least on the front-end) are largely wasted time in the current state of software development. Taking the time that would have been spent on writing unit tests and instead using it to write functionally pure, immutable code would do much more to prevent bugs.

There's also the problem that when stack rank time comes around each year no one cares about your unit tests. So using AI to write unit tests gives me time to work on things that will actually help me avoid getting arbitrarily fired.

I wish that software engineers were given the time to write both clean code and unit tests, and I wish software engineers weren't arbitrarily judged by out of touch leadership. However, that's not the world we live in so I let AI write my unit tests in order to survive.

DiscourseFan · 11 days ago

You are overvaluing “clean code.” Code is code, it either works within spec or it doesn’t; or, it does but there are errors, more or less catastrophic, waiting to show themselves at any moment. But even in that latter case, no single individual can know for certain, no matter how much work they put in, that their code is perfect. But they can know its useable, and someone else can check to make sure it doesn’t blow something else up, and that is the most important thing.

msh · 11 days ago

I like unit tests when I have to modify code that someone made years ago, as a basic sanity check.

DeathArrow · 11 days ago

>LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").

You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.

salawat · 9 days ago

Mwahahahahaha! Suffer, devs, SUFFER! KNOW MY PAIN!

Ah hem... Welcome to the wonderful world of Quality Assurance, software developing audience. That part of the job, after you yeet your code over the fence, where the job is to bridge the gap between your madness, and the madness of the rest of the business. Here you will find: frustration, an ever present sense the rest of the world is just out to make your life more difficult, a creeping sense of despair, a hot ice pick in the back of your mind every time the language model does something syntactically valid, but completely nonsensical in the real world, the development of an ever increasing time horizon over which you can accurately predict the future, but no one will believe you anyway, a smoldering hatred of the overly confident executive with an over developed capacity for risk tolerance; a desire to run away and start a farm, and finally, a fundamental distrust of everything software, and all the people who write it.

Don't forget your complimentary test framework and swag bag on your way out, and remember, you're here forever. You can try to check out, but you can never leave.

mrighele · 11 days ago

> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code

This is true for humans too. Tests should not be written or performed by the same person that writes the code

nly · 11 days ago

That's a complete fantasy world where companies have twice the engineers they actually need instead of half.

rcpffm · 9 days ago

Thx. you hit the nail

bluefirebrand · 11 days ago

> I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens

I can't wait. Maybe when shitty vibe coded software starts to cause real pain for people we can return to some sensible software engineering

I'm not holding my breath though

bentobean · 11 days ago

This hits hard. I’m getting hit with so much slop at work that I’ve quietly stopped being all that careful with reviews.

SoftTalker · 11 days ago

Um, you're supposed to write the tests first. The agents can't do this?

alexsmirnov · 11 days ago

Actually, they extremely bad at that. All training data contains cod + tests, even if tests where created first. So far, all models that I tried failed to implement tests for interfaces, without access to actual code.

daliusd · 11 days ago

They can, but should be explicitly told to do that. Otherwise they just everything in batches. Anyway pure TDD or not but tests catches only what you tell AI to write. AI does not now what is right, it does what you told it to do. The above problem wouldn’t be solved by pure TDD.

Dead Comment

I encourage everyone to RTFA and not just respond to the headline. This really is a glimpse into where the future is going.

I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.

I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.

If you're thinking about building something today that will still be relevant in 10 years, this is insightful.

fmbb · 11 days ago

There are still no successful useful vibe codes apps. Kernels are pretty far away I think.

bonoboTP · 11 days ago

This is a very strange statement. People don't always announce when they use AI for writing their software since it's a controversial topic. And it's a sliding scale. I'm pretty sure a large fraction of new software has some AI involved in its development.

dehrmann · 11 days ago

Apps are a strange measure because there aren't really any new, groundbreaking ones. PCs and smartphones have mostly done what people have wanted them to do for a while.

tempaccount5050 · 11 days ago

I think this might miss the point. We put off upgrading to an new RMM at work because I was able to hack together some dashboards in a couple days. It's not novel and does exactly what we need it to do, no more. We don't need to pay 1000's of dollars a month for the bloated Solarwinds stack. We aren't saving lives, we're saving PDFs so any arguments about 5 9s and maintainability are irrelevant. LLMs are going to give us on demand, one off software. I think the SaaS market is terrified right now because for decades they've gouged customers for continual bloat and lock in that now we can escape from. In a single day I was able to build an RMM that fits our needs exactly. We don't need to hire anyone to maintain it because it's simple, like most business applications should be, but SV needs to keep complicating their offerings with bloat to justify crazy monthly costs that should have been a one time purchase from the start. SV shot itself in the face with AI.

theshrike79 · 11 days ago

Define "successful"?

Does it need to be HN-popular or a household name? Be in the news?

Or something that saves 50% of time by automating inane manual work from a team?

raincole · 11 days ago

Name 3 apps that are

1. widely considered successful 2. made by humans from scratch in 2025

It looks like humans and AI are on par in this realm.

GoatInGrey · 11 days ago

To be fair, Claude Code is vibe-coded. It's a terrible piece of software from an engineering (and often usability) standpoint, and the problems run deeper than just the choice of JavaScript. But it is good enough for people to get what they want out of it.

gwern · 11 days ago

> I encourage everyone to RTFA and not just respond to the headline.

This is an example of an article which 'buries the lede'†.

It should have started with the announcement of the new zlib autoformalization (!) https://leodemoura.github.io/blog/2026/02/28/when-ai-writes-... to get you excited.

Then it should have talked about the rest - instead of starting with rather graceless and ugly LLM-written generic prose about AI topics that to many readers is already tiresomely familiar and doubtless was tldr for even the readers who aren't repelled automatically by that.

† or in my terms, fails to 'make you care': https://gwern.net/blog/2026/make-me-care

bwestergard · 11 days ago

I am as enthusiastic about formal methods as the next guy, but I very much doubt any LLM-based technique will make it economical to write a substantial fraction of application software in Lean. The LLM can play a powerful heuristic role in searching for proof-bearing code in areas where there is good training data. Unfortunately those areas are few and far between.

Moreover, humans will still need to read even rigorously proved code if only to suss out performance issues. And training people to read Lean will continue to be costly.

Though, as the OP says, this is a very exciting time for developing provably correct systems programming.

zozbot234 · 11 days ago

LLMs are writing non-trivial math proofs in Lean, and software proofs tend to be individually easier than proofs in math, just more tedious because there's so much more of them in any non-trivial development.

Some performance issues (asymptotics) can be addressed via proof, others are routinely verified by benchmarking.

madrox · 11 days ago

This assumes everything about current capabilities stay static, and it wasn't long ago before LLMs couldn't do math. Many were predicting the genAI hype had peaked this time last year.

If you want it to be a question of economics, I think the answer is in whether this approach is more economical than the alternative, which is having people run this substrate. There's a lot of enthusiasm here and you can't deny there has been progress.

I wouldn't be so quick to doubt. It costs nothing to be optimistic.

charlieflowers · 11 days ago

> "any sufficiently advanced agent is indistinguishable from a DSL."

I don't quite follow but I'd love to hear more about that.

madrox · 11 days ago

If you give an agent a task, the typical agentic pattern is that it calls tools in some non-deterministic loop, feeding the tool output back into the LLM, until it deems the task complete. The LLM internalizes an algorithm.

Another way of doing it is the agent just writes an algorithm to perform the task and runs it. In this world, tools are just APIs and the agent has to think through its entire process end to end before it even begins and account for all cases.

Only latter is turing complete, but the former approaches the latter as it improves.

esafak · 11 days ago

https://en.wikipedia.org/wiki/Clarke's_three_laws

thinkling · 11 days ago

My read was roughly that agents require constraining scaffolding (CLAUDE.md) and careful phrasing (prompt engineering) which together is vaguely like working in a DSL?

jpollock · 11 days ago

If the llm is able to code it, there is enough training data that youight be better off in a different language that removes the boilerplate.

kubanczyk · 11 days ago

> RTFA

Sigh. Is there any LLM solution for HN reader to filter out all top-level commenters that hadn't RTFA? I don't need the (micro-)shitstorms that these people spawn, even if the general HN algo scores these as "interesting".