> The Claude C Compiler illustrates the other side: it optimizes for
> passing tests, not for correctness. It hard-codes values to satisfy
> the test suite. It will not generalize.
This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.
I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.
This is why you write the tests first and then the code. Especially when fixing bugs, since you can be sure that the test properly fails when the bug is present.
When fixing bugs, yes. When designing an app not so much because you realize many unexpected things while writing the code and seeing how it behaves. Often the original test code would test something that is never built. It's obvious for integration tests but it happens for tests of API calls and even for unit tests. One could start writing unit tests for a module or class and eventually realize that it must be implemented in a totally different way. I prefer experimenting with the implementation and write tests only when it settles down on something that I'm confident it will go to production.
I'd argue the AI writing the tests shouldn't even know about the implementation at all. You only want to pass it the interface (or function signatures) together with javadocs/docstrings/equivalent.
Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.
I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.
Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.
Agreed 1000%. But that can be a lot of work; creating a good set of tests is nearly as much or often even more effort than implementing the thing being tested.
When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.
At my job we have a requirement for 100% test coverage. So everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything.
Yeah this is the exact kind of ridiculousness I've noticed as well - everything that comes out of an LLM is optimized to give you what you want to hear, not what's correct.
Long time ago in France the mainstream view by computer people was that code or compute weren't what's important when dealing with computers, it is information that matters and how you process it in a sensible way (hence the name of computer science in French: informatique. And also the name for computer: “ordinateur”, literally: what sets things into order).
As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.
Turns out that the French were just right too early, like with the Minitel.
> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code.
I always felt like that's the main issue with unit testing. That's why I used it very rarely.
Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.
And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.
> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.
It’s because people will do what they’re incentivized to do. And if no one cares about anything but whether the next feature goes out the door, that’s what programmers will focus on.
Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.
We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.
Developers aren't given time to test and aren't rewarded if they do, but management will rain down hellfire upon their heads if they don't churn out code quickly enough.
How about a subsequent review where a separate agent analyzes the original issue and resultant code and approves it if the code meets the intent of the issue. The principle being to keep an eye out for manual work that you can describe well enough to offload.
Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.
You are fighting nondeterministic behavior with more nondeterministic behavior, or in other words, fighting probability with probability. That doesn't necessarily make things any better.
My only hope is that all of this push leads in the end to the adoption of more formal verification languages and tools.
If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.
> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?
15years ago, I had tester writing "UI tests" / "User tests" that matched what the software was cranking out. At that time I just joined to continue at the client side so I didn't really worked on anything yet.
I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?
And that was at that time I had to dive into the code base and cry.
I think it boils down to how companies view LLMs and their engineers.
Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.
Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.
Building a C compiler should not have this problem. There is probably a million test suites coming from outside the LLM that it can sue verify correctness.
Honestly, unit tests (at least on the front-end) are largely wasted time in the current state of software development. Taking the time that would have been spent on writing unit tests and instead using it to write functionally pure, immutable code would do much more to prevent bugs.
There's also the problem that when stack rank time comes around each year no one cares about your unit tests. So using AI to write unit tests gives me time to work on things that will actually help me avoid getting arbitrarily fired.
I wish that software engineers were given the time to write both clean code and unit tests, and I wish software engineers weren't arbitrarily judged by out of touch leadership. However, that's not the world we live in so I let AI write my unit tests in order to survive.
You are overvaluing “clean code.” Code is code, it either works within spec or it doesn’t; or, it does but there are errors, more or less catastrophic, waiting to show themselves at any moment. But even in that latter case, no single individual can know for certain, no matter how much work they put in, that their code is perfect. But they can know its useable, and someone else can check to make sure it doesn’t blow something else up, and that is the most important thing.
>LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.
Mwahahahahaha! Suffer, devs, SUFFER! KNOW MY PAIN!
Ah hem... Welcome to the wonderful world of Quality Assurance, software developing audience. That part of the job, after you yeet your code over the fence, where the job is to bridge the gap between your madness, and the madness of the rest of the business. Here you will find: frustration, an ever present sense the rest of the world is just out to make your life more difficult, a creeping sense of despair, a hot ice pick in the back of your mind every time the language model does something syntactically valid, but completely nonsensical in the real world, the development of an ever increasing time horizon over which you can accurately predict the future, but no one will believe you anyway, a smoldering hatred of the overly confident executive with an over developed capacity for risk tolerance; a desire to run away and start a farm, and finally, a fundamental distrust of everything software, and all the people who write it.
Don't forget your complimentary test framework and swag bag on your way out, and remember, you're here forever. You can try to check out, but you can never leave.
Actually, they extremely bad at that. All training data contains cod + tests, even if tests where created first. So far, all models that I tried failed to implement tests for interfaces, without access to actual code.
They can, but should be explicitly told to do that. Otherwise they just everything in batches. Anyway pure TDD or not but tests catches only what you tell AI to write. AI does not now what is right, it does what you told it to do. The above problem wouldn’t be solved by pure TDD.
I encourage everyone to RTFA and not just respond to the headline. This really is a glimpse into where the future is going.
I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.
I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.
If you're thinking about building something today that will still be relevant in 10 years, this is insightful.
This is a very strange statement. People don't always announce when they use AI for writing their software since it's a controversial topic. And it's a sliding scale. I'm pretty sure a large fraction of new software has some AI involved in its development.
Apps are a strange measure because there aren't really any new, groundbreaking ones. PCs and smartphones have mostly done what people have wanted them to do for a while.
I think this might miss the point. We put off upgrading to an new RMM at work because I was able to hack together some dashboards in a couple days. It's not novel and does exactly what we need it to do, no more. We don't need to pay 1000's of dollars a month for the bloated Solarwinds stack. We aren't saving lives, we're saving PDFs so any arguments about 5 9s and maintainability are irrelevant. LLMs are going to give us on demand, one off software. I think the SaaS market is terrified right now because for decades they've gouged customers for continual bloat and lock in that now we can escape from. In a single day I was able to build an RMM that fits our needs exactly. We don't need to hire anyone to maintain it because it's simple, like most business applications should be, but SV needs to keep complicating their offerings with bloat to justify crazy monthly costs that should have been a one time purchase from the start. SV shot itself in the face with AI.
To be fair, Claude Code is vibe-coded. It's a terrible piece of software from an engineering (and often usability) standpoint, and the problems run deeper than just the choice of JavaScript. But it is good enough for people to get what they want out of it.
Then it should have talked about the rest - instead of starting with rather graceless and ugly LLM-written generic prose about AI topics that to many readers is already tiresomely familiar and doubtless was tldr for even the readers who aren't repelled automatically by that.
I am as enthusiastic about formal methods as the next guy, but I very much doubt any LLM-based technique will make it economical to write a substantial fraction of application software in Lean. The LLM can play a powerful heuristic role in searching for proof-bearing code in areas where there is good training data. Unfortunately those areas are few and far between.
Moreover, humans will still need to read even rigorously proved code if only to suss out performance issues. And training people to read Lean will continue to be costly.
Though, as the OP says, this is a very exciting time for developing provably correct systems programming.
LLMs are writing non-trivial math proofs in Lean, and software proofs tend to be individually easier than proofs in math, just more tedious because there's so much more of them in any non-trivial development.
Some performance issues (asymptotics) can be addressed via proof, others are routinely verified by benchmarking.
This assumes everything about current capabilities stay static, and it wasn't long ago before LLMs couldn't do math. Many were predicting the genAI hype had peaked this time last year.
If you want it to be a question of economics, I think the answer is in whether this approach is more economical than the alternative, which is having people run this substrate. There's a lot of enthusiasm here and you can't deny there has been progress.
I wouldn't be so quick to doubt. It costs nothing to be optimistic.
If you give an agent a task, the typical agentic pattern is that it calls tools in some non-deterministic loop, feeding the tool output back into the LLM, until it deems the task complete. The LLM internalizes an algorithm.
Another way of doing it is the agent just writes an algorithm to perform the task and runs it. In this world, tools are just APIs and the agent has to think through its entire process end to end before it even begins and account for all cases.
Only latter is turing complete, but the former approaches the latter as it improves.
My read was roughly that agents require constraining scaffolding (CLAUDE.md) and careful phrasing (prompt engineering) which together is vaguely like working in a DSL?
Sigh. Is there any LLM solution for HN reader to filter out all top-level commenters that hadn't RTFA? I don't need the (micro-)shitstorms that these people spawn, even if the general HN algo scores these as "interesting".
I believe there is a Verification Complexity Barrier
As you add components to a system, the time it takes to verify that the components work together increases superlinearly.
At a certain point, the verification complexity takes off. You literally run out of time to verify everything.
AI coding agents hit this barrier faster than ever, because of how quickly they can generate components (and how poorly they manage complexity).
I think verification is now the problem of agentic software engineering. I think formal methods will help, but I don't see how they will apply to messy situations like end-to-end UI testing or interactions between the system and the real world.
> At a certain point, the verification complexity takes off. You literally run out of time to verify everything.
Could you elaborate on this? Your post makes it sound as if the verification complexity diverged as the number of components n approaches a certain finite value n_0, but that seems unlikely to me. If, in contrast, the verification complexity remains finite at n_0, then verification should still be possible in finite time, shouldn't it? Yes, it might be a considerable amount of time but I assume your theorem doesn't predict lower bounds for the involved constants?
Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though? AI might also lead us to removing unnecessarily complex dependencies from our software supply chain or stripping them down to the few features we need.
Thank you for reading and the very thoughtful observations.
>> At a certain point, the verification complexity takes off. You literally run out of time to verify everything.
> Could you elaborate on this?
I plan to publish a thorough post with an interactive model. Whether human or AI, you are capacity constrained, and I glossed over `C` (capacity within a given timeframe) in the X post.
You are correct that verification complexity remains finite at n_0. The barrier is practical: n_0 is where V(n) exceeds your available capacity C. If V(n) = n^(1+k), then n_0 = C^(1/(1+k)). Doubling your capacity doesn't double n_0. It increases by a factor of 2^(1/(1+k)), which is always less than 2.
So the barrier always exists for, say, a given "dev year" or "token budget," and the cost to push it further grows superlinearly. It's not absolutely immovable, but moving it gets progressively harder. That's what I mean by "literally run out of time." At any given capacity, there is a finite n beyond which complete verification is not possible. Expanding capacity buys diminishing returns.
> Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though?
You are getting at my core motivation for exploring this question.
Verification requires a definition of "done" and I wonder if it will ever be possible (or desirable) for AI to define done on its own, let alone verify it and simplify software based on its understanding of our needs.
You make a great point that we are not required to add more components and "go right" along the curve. We can choose to simplify, and that is absolutely the right takeaway. AI has made many people believe that by generating more code at a faster pace they are accomplishing more. But that's not how software productivity should be judged.
To answer your question about assumptions, while AI can certainly be prompted to help reduce n or k in isolated cases where "done" is very clear, I don't think it's realistic to expect this in aggregate for complex systems where "done" is subjective and dynamic.
I'm speaking mainly in the context of commercial software dev here, informed by my lived experience building hundreds of apps. I often say software projects have a fractal complexity. We're constantly identifying new needs and broader scope the deeper we go, not to mention pivots and specific customer asks. You rarely get to stand still.
I don't mean to be pessimistic, but my hunch is that complexity growth outpaces the rate of simplification in almost every software project. This model attempts to explain why that is so. And notably, simplification itself requires verification and so it is in a sense part of the verification cost, too.
Thank you for the post. It's a good read. I'm working on governance/validation layers for n-LLMs and making them observable so your comments on runaway AIs resonated with me. My research is pointing me to reputation and stake consensus mechanisms being the validation layer either pre inference or pre-execution, and the time to verify decisions can be skipped with enough "decision liquidity" via reputation alone aka decision precedence.
The fundamental problem is the verification loop for the average developer is grounded not in tests, but with results. Write code, reload browser, check output. Does it work the way I want? Good. We're done here.
Not write code, write tests, ensure all test-cases are covered. Now, imagine such a flaky foundation is used to build on top of even more untested code. That's how bad quality software (that's usually unfixable without a major re-write) is born.
Also, most vibe-coders don't have enough experience/knowledge to figure out what is wrong with the code generated by the AI. For that, you need to know more than the AI and have a strong foundation in the domain you're working on. Here is an example: You ask the AI to write the code for a comment form. It generates the backend and frontend code for you (let's say React/Svelte/Vue/whatever). The vibe-coder sees the UI - most likely written in Tailwind CSS and thinks "wow, that looks really good!" and they click approve. However, an experienced person might notice the form does not have CSRF protection in place. The vibe-coder might not even be aware of the concept of CSRF (let alone the top 10 OSWAP security risks).
Hence, the fundamental problem is not knowing about the domain more than the AI to pick up the flaw. Unless this fundamental problem is solved - which I don't think it will anytime soon because everyone can generate code + UI these days, I don't see a solution to the verification problem.
However, this is good news for consultants and the like because it creates more work down the line to fix the vide-coded mess because they got hacked the very next day and we can charge them a rush fee on top of it, too. So, it's not all that bad.
My impression has been with frontend tests is they come in two flavors: extremely hard to write to the point of impracticality, and useless. Most orgs settle for the latter and end up testing easily testable things like the final rendered dom and rely on human qa for all the hard bits like "does the page actually look like it's supposed to, does interacting with the UI elements behave correctly, and do the flows actually work."
All largely stemming from the fact that tests can't meaningfully see and interact with the page like the end user will.
> extremely hard to write to the point of impracticality, and useless
> Most orgs settle for the latter and end up testing easily testable things like the final rendered dom and rely on human qa for all the hard bits like "does the page actually look like it's supposed to
Not disagreeing with you here, but what ends up happening is the frontend works flawlessly in the browser/device being tested but has tons of bugs in the others. Best examples of this are most banking apps, corporate portals, etc. But honestly, you can get away with the frontend without writing tests. But the backend? That directly affects the security aspect of the software and cannot afford to skip important tests atleast.
>can't meaningfully see and interact with the page like the end user will
Isn't this a great use case for LLM tests? Have a "computer use agent" and then describe the parameters of the test as "load the page, then navigate to bar, expect foo to happen". You don't need the LLM to generate a test using puppeteer or whatever which is coupled to the specific dom, you just describe what should happen.
This is actually how a lot of software is written, sadly. I used to call it "trial and error programming". I've observed people writing C++ who do not have a mental model of memory but just try stuff until it compiles. For some classes of software, like games, that is acceptable, but for others it would be horrifying.
Now, it is actually completely possible to write UI code without any unit tests in a completely safe way. You use the functional core, imperative shell approach. When all your domain logic is in a fully tested, functional core, you can just go ahead and write "what works" in a thin UI shell. Good luck getting an LLM to rigidly conform to such an architecture, though.
The article says that AWS's Cedar authorization policy engine is written in Lean, but it's actually written in Dafny. Writing Dafny is a lot closer to writing "normal" code rather than the proofs you see in Lean. As a non-mathematician I gave up pretty early in the Lean tutorial, while in a recent prototype I learned enough Dafny to be semi-confident in reviewing Claude's Dafny code in about half a day.
The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).
In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...
Looks like LLMs also find Dafny easier to write than Lean. This study, “A benchmark for vericoding: formally verified program synthesis”, reports:
> We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications … We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs.
Not surprising, as Dafny is a bit less expressive (refinement instead of dependent types) and therefore easier to write. IMHO, it hits a very nice sweet spot. The disadvantage of Dafny is the lack of manual tactics to prove things when SAT/SMT automation fails. But this is getting fixed.
There's multiple Lean tutorials, some of which are more mathy than others. One of the things I like about Lean is precisely that it's an ordinary, Haskell-style functional programming language in addition to having all the Curry-Howard-isomorphism-based mathematical proof machinery. You can write `cat` in Lean.
Maybe I'm missing something, but isn't this the same as writing code, but with extra steps?
Currently, engineers work with loose specifications, which they translate into code. With the proposed approach, they would need to first convert those specifications into a formally verifiable form before using LLMs to generate the implementation.
But to be production-ready, that spec would have to cover all possible use-cases, edge cases, error handling, performance targets, security and privacy controls, etc. That sounds awfully close to being an actual implementation, only in a different language.
The key bit is that specifications don't need to be "obviously computable", so they can be a lot simpler than the code that implements them. Consider the property "if some function has a reference to a value, that value will not change unless that function explicitly changes it". It's simple enough to express, but to implement it Rust needs the borrow checker, which is a pretty heavy piece of engineering. And proving the implementation actually guarantees that property isn't easy, either!
Formal specifications can be easier to write than code. Take refinement types in Dafny, for example. Because they are higher-level, you don't need to bother with tedious implementation details. By shifting our focus to formal specifications rather than manual coding, we could possibly leverage automation to not only develop software faster but also achieve a higher degree of assurance.
Of course, this remains largely theoretical for now, but it is an exciting possibility. Note high-level specifications often overlook performance issues, but they are likely sufficient for most scenarios. Regardless, we have formal development methodologies able to decompose problems to an arbitrary level of granularity since the 1990s, all while preserving correctness. It is likely that many of these ideas will be revisited soon.
I think the issue goes even deeper than verification. Verification is technically possible. You could, in theory, build a C compiler or a browser and use existing tests to confirm it works.
The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?
Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.
But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.
Well that's a problem the software industry has been building for itself for decades.
Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.
Agile hasn't been insisting that specs are impossible to get from a customer. They have been insisting that getting specs from a customer is best performed as a dynamic process. In my opinion, that's one of agile's most significant contributions. It lines up with a learning process that doesn't assume the programmer or the customer knows the best course ahead of time.
I don’t want to put words in your mouth but I think I agree. It’s called requirements engineering. It’s hard, but it’s possible and waterfall works fine for many domains. Agile teams I see burning resources doing the same thing 2-3x or sprinting their way into major, costly architectural mistakes that would have been easily avoided by upfront planning and specs.
Agile is a pretty badly defined beast at the best of times but even the most twisted interpretation doesnt mean that. It's mainly just a rejection of BDUF.
> Most people think of verification as a cost, a tax on development, justified only for safety-critical systems. That framing is outdated.
> The value is not in the verification workforce. It is in what verified delivery enables
These takes misrepresent safety-critical software verification. The verification workforce is indispensable. Whereas the software engineers are usually highly specialized, the verification team owns the domain and project knowledge to make sure the whole thing integrates. They consult on requirements, anticipate oversights, build infrastructure and validate it, and execute the verification.
So when the business needs a go / no-go call, they ask the verification lead.
Automation and proofs have their place, but I don't see how verification without human accountability is workable.
> passing tests, not for correctness. It hard-codes values to satisfy
> the test suite. It will not generalize.
This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.
I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.
Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.
I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.
Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.
When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.
As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.
Turns out that the French were just right too early, like with the Minitel.
I always felt like that's the main issue with unit testing. That's why I used it very rarely.
Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.
And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.
Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.
Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.
We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.
Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.
If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.
its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?
I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?
And that was at that time I had to dive into the code base and cry.
I don't understand the value of that much code. What features are worth that much more than stability?
Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.
Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.
There's also the problem that when stack rank time comes around each year no one cares about your unit tests. So using AI to write unit tests gives me time to work on things that will actually help me avoid getting arbitrarily fired.
I wish that software engineers were given the time to write both clean code and unit tests, and I wish software engineers weren't arbitrarily judged by out of touch leadership. However, that's not the world we live in so I let AI write my unit tests in order to survive.
You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.
Ah hem... Welcome to the wonderful world of Quality Assurance, software developing audience. That part of the job, after you yeet your code over the fence, where the job is to bridge the gap between your madness, and the madness of the rest of the business. Here you will find: frustration, an ever present sense the rest of the world is just out to make your life more difficult, a creeping sense of despair, a hot ice pick in the back of your mind every time the language model does something syntactically valid, but completely nonsensical in the real world, the development of an ever increasing time horizon over which you can accurately predict the future, but no one will believe you anyway, a smoldering hatred of the overly confident executive with an over developed capacity for risk tolerance; a desire to run away and start a farm, and finally, a fundamental distrust of everything software, and all the people who write it.
Don't forget your complimentary test framework and swag bag on your way out, and remember, you're here forever. You can try to check out, but you can never leave.
This is true for humans too. Tests should not be written or performed by the same person that writes the code
I can't wait. Maybe when shitty vibe coded software starts to cause real pain for people we can return to some sensible software engineering
I'm not holding my breath though
Dead Comment
I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.
I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.
If you're thinking about building something today that will still be relevant in 10 years, this is insightful.
Does it need to be HN-popular or a household name? Be in the news?
Or something that saves 50% of time by automating inane manual work from a team?
1. widely considered successful 2. made by humans from scratch in 2025
It looks like humans and AI are on par in this realm.
This is an example of an article which 'buries the lede'†.
It should have started with the announcement of the new zlib autoformalization (!) https://leodemoura.github.io/blog/2026/02/28/when-ai-writes-... to get you excited.
Then it should have talked about the rest - instead of starting with rather graceless and ugly LLM-written generic prose about AI topics that to many readers is already tiresomely familiar and doubtless was tldr for even the readers who aren't repelled automatically by that.
† or in my terms, fails to 'make you care': https://gwern.net/blog/2026/make-me-care
Moreover, humans will still need to read even rigorously proved code if only to suss out performance issues. And training people to read Lean will continue to be costly.
Though, as the OP says, this is a very exciting time for developing provably correct systems programming.
Some performance issues (asymptotics) can be addressed via proof, others are routinely verified by benchmarking.
If you want it to be a question of economics, I think the answer is in whether this approach is more economical than the alternative, which is having people run this substrate. There's a lot of enthusiasm here and you can't deny there has been progress.
I wouldn't be so quick to doubt. It costs nothing to be optimistic.
I don't quite follow but I'd love to hear more about that.
Another way of doing it is the agent just writes an algorithm to perform the task and runs it. In this world, tools are just APIs and the agent has to think through its entire process end to end before it even begins and account for all cases.
Only latter is turing complete, but the former approaches the latter as it improves.
Sigh. Is there any LLM solution for HN reader to filter out all top-level commenters that hadn't RTFA? I don't need the (micro-)shitstorms that these people spawn, even if the general HN algo scores these as "interesting".
As you add components to a system, the time it takes to verify that the components work together increases superlinearly.
At a certain point, the verification complexity takes off. You literally run out of time to verify everything.
AI coding agents hit this barrier faster than ever, because of how quickly they can generate components (and how poorly they manage complexity).
I think verification is now the problem of agentic software engineering. I think formal methods will help, but I don't see how they will apply to messy situations like end-to-end UI testing or interactions between the system and the real world.
I posted more detailed thoughts on X: https://x.com/i/status/2027771813346820349
> At a certain point, the verification complexity takes off. You literally run out of time to verify everything.
Could you elaborate on this? Your post makes it sound as if the verification complexity diverged as the number of components n approaches a certain finite value n_0, but that seems unlikely to me. If, in contrast, the verification complexity remains finite at n_0, then verification should still be possible in finite time, shouldn't it? Yes, it might be a considerable amount of time but I assume your theorem doesn't predict lower bounds for the involved constants?
Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though? AI might also lead us to removing unnecessarily complex dependencies from our software supply chain or stripping them down to the few features we need.
>> At a certain point, the verification complexity takes off. You literally run out of time to verify everything. > Could you elaborate on this?
I plan to publish a thorough post with an interactive model. Whether human or AI, you are capacity constrained, and I glossed over `C` (capacity within a given timeframe) in the X post.
You are correct that verification complexity remains finite at n_0. The barrier is practical: n_0 is where V(n) exceeds your available capacity C. If V(n) = n^(1+k), then n_0 = C^(1/(1+k)). Doubling your capacity doesn't double n_0. It increases by a factor of 2^(1/(1+k)), which is always less than 2.
So the barrier always exists for, say, a given "dev year" or "token budget," and the cost to push it further grows superlinearly. It's not absolutely immovable, but moving it gets progressively harder. That's what I mean by "literally run out of time." At any given capacity, there is a finite n beyond which complete verification is not possible. Expanding capacity buys diminishing returns.
> Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though?
You are getting at my core motivation for exploring this question.
Verification requires a definition of "done" and I wonder if it will ever be possible (or desirable) for AI to define done on its own, let alone verify it and simplify software based on its understanding of our needs.
You make a great point that we are not required to add more components and "go right" along the curve. We can choose to simplify, and that is absolutely the right takeaway. AI has made many people believe that by generating more code at a faster pace they are accomplishing more. But that's not how software productivity should be judged.
To answer your question about assumptions, while AI can certainly be prompted to help reduce n or k in isolated cases where "done" is very clear, I don't think it's realistic to expect this in aggregate for complex systems where "done" is subjective and dynamic.
I'm speaking mainly in the context of commercial software dev here, informed by my lived experience building hundreds of apps. I often say software projects have a fractal complexity. We're constantly identifying new needs and broader scope the deeper we go, not to mention pivots and specific customer asks. You rarely get to stand still.
I don't mean to be pessimistic, but my hunch is that complexity growth outpaces the rate of simplification in almost every software project. This model attempts to explain why that is so. And notably, simplification itself requires verification and so it is in a sense part of the verification cost, too.
Not write code, write tests, ensure all test-cases are covered. Now, imagine such a flaky foundation is used to build on top of even more untested code. That's how bad quality software (that's usually unfixable without a major re-write) is born.
Also, most vibe-coders don't have enough experience/knowledge to figure out what is wrong with the code generated by the AI. For that, you need to know more than the AI and have a strong foundation in the domain you're working on. Here is an example: You ask the AI to write the code for a comment form. It generates the backend and frontend code for you (let's say React/Svelte/Vue/whatever). The vibe-coder sees the UI - most likely written in Tailwind CSS and thinks "wow, that looks really good!" and they click approve. However, an experienced person might notice the form does not have CSRF protection in place. The vibe-coder might not even be aware of the concept of CSRF (let alone the top 10 OSWAP security risks).
Hence, the fundamental problem is not knowing about the domain more than the AI to pick up the flaw. Unless this fundamental problem is solved - which I don't think it will anytime soon because everyone can generate code + UI these days, I don't see a solution to the verification problem.
However, this is good news for consultants and the like because it creates more work down the line to fix the vide-coded mess because they got hacked the very next day and we can charge them a rush fee on top of it, too. So, it's not all that bad.
All largely stemming from the fact that tests can't meaningfully see and interact with the page like the end user will.
Not disagreeing with you here, but what ends up happening is the frontend works flawlessly in the browser/device being tested but has tons of bugs in the others. Best examples of this are most banking apps, corporate portals, etc. But honestly, you can get away with the frontend without writing tests. But the backend? That directly affects the security aspect of the software and cannot afford to skip important tests atleast.
Isn't this a great use case for LLM tests? Have a "computer use agent" and then describe the parameters of the test as "load the page, then navigate to bar, expect foo to happen". You don't need the LLM to generate a test using puppeteer or whatever which is coupled to the specific dom, you just describe what should happen.
Now, it is actually completely possible to write UI code without any unit tests in a completely safe way. You use the functional core, imperative shell approach. When all your domain logic is in a fully tested, functional core, you can just go ahead and write "what works" in a thin UI shell. Good luck getting an LLM to rigidly conform to such an architecture, though.
The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).
In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...
https://aws.amazon.com/blogs/opensource/lean-into-verified-s...
> We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications … We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs.
https://arxiv.org/html/2509.22908v1
Currently, engineers work with loose specifications, which they translate into code. With the proposed approach, they would need to first convert those specifications into a formally verifiable form before using LLMs to generate the implementation.
But to be production-ready, that spec would have to cover all possible use-cases, edge cases, error handling, performance targets, security and privacy controls, etc. That sounds awfully close to being an actual implementation, only in a different language.
Deleted Comment
Of course, this remains largely theoretical for now, but it is an exciting possibility. Note high-level specifications often overlook performance issues, but they are likely sufficient for most scenarios. Regardless, we have formal development methodologies able to decompose problems to an arbitrary level of granularity since the 1990s, all while preserving correctness. It is likely that many of these ideas will be revisited soon.
The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?
Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.
But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.
Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.
A) They let you verify that implementation passes its spec, more or less.
B) They are a (trustworthy) description of how the system behaves, they allow you to understand the system better.
> The value is not in the verification workforce. It is in what verified delivery enables
These takes misrepresent safety-critical software verification. The verification workforce is indispensable. Whereas the software engineers are usually highly specialized, the verification team owns the domain and project knowledge to make sure the whole thing integrates. They consult on requirements, anticipate oversights, build infrastructure and validate it, and execute the verification.
So when the business needs a go / no-go call, they ask the verification lead.
Automation and proofs have their place, but I don't see how verification without human accountability is workable.