Readit News logoReadit News
KurSix · 3 months ago
There's a catch with 100% coverage. If the agent writes both the code and the tests, we risk falling into a tautology trap. The agent can write flawed logic and a test that verifies that flawed logic (which will pass). 100% coverage only makes sense if tests are written before the code or rigorously verified by a human. Otherwise, we're just creating an illusion of reliability by covering hallucinations with tests. An "executable example" is only useful if it's semantically correct, not just syntactically
ben_w · 3 months ago
All the problems you list are true, but the solutions not so much.

I've seen this problem with humans even back at university when it was the lecturer's own example attempting to illustrate the value of formal methods and verification.

I would say the solution is neither "get humans to do it" nor "do it before writing code", but rather "get multiple different minds involved to check each other's blind spots, and no matter how many AI models you throw at it they only count as one mind even when they're from different providers". Human tests and AI code, AI tests and human code, having humans do code reviews of AI code or vice-versa, all good. Two different humans usually have different blind spots, though even then I've seen some humans bully their way into being the only voice in the room with the full support of their boss, not that AI would help with that.

godelski · 3 months ago

  > "get multiple different minds involved to check each other's blind spots
This is actually my big gripe about chatbot coding agents. They are trained on human preference and thus they optimize for errors that are in our blind spots.

I don't think people take this subtly seriously enough. Unless we have an /objective/ ground truth we end up proxying our optimization. So we don't optimize for code that /is/ correct, we optimize for code that /looks/ correct. It may seem like a subtle difference but it is critical.

The big difference is when they make errors they are errors that are more likely to be difficult for humans to detect.

Good tools should complement tool users. Fill in gaps. But as we've been trying to train agents to replace humans we are not focusing on this distinction. I want my coding agent to make errors that are obvious to me just as I want errors I make to be obvious to it (or for it to be optimized to detect errors I make)

melagonster · 3 months ago
Maybe this is because humans have good intuition to know the difference between us. But this type of intuition does not work on the behaviour of LLMs.
joshribakoff · 3 months ago
That’s why you’ve gotta test your tests. Insert bugs and ensure they fail.

As the sibling comments alluded to, it’s not exclusively an AI problem since multiple people can miss the issue too.

It’s wonderful that AI is an impetus for so many people to finally learn proper engineering principles though!

KurSix · 2 months ago
Mutation testing is becoming the only way to catch AI red-handed. Without mutations you'll be staring at a perfect CI/CD dashboard, unaware that your tests verify absolutely nothing

Yeah, it burns CPU like crazy, but CPU time is dirt cheap right now compared to the cost of an engineer debugging that self-deception in production

vrighter · 3 months ago
but who will test the tests of tests?
smarx007 · 3 months ago
I think the phase change hypothesis* is a bit wrong.

I think it happens not at 100% coverage but at, say, 100% MC/DC test coverage. This is what SQLite and avionics software aim for.

*has not been confirmed by a peer-reviewed research.

cloudhead · 3 months ago
What's MC/DC?
closeparen · 3 months ago
Also true of human-written unit tests. You probably also want to have integration or UI automation tests that cover the end-user scenarios in your product requirements, and invariants that are checked against large numbers of examples either taken from production (sanitized of course), in a shadow environment, or generated if you absolutely must.
theptip · 3 months ago
It’s true - but there are “good code” solutions to this already. For example, BDD / Acceptance Tests can be used to write human-readable specs.

IMO it’s quite boilerplate-y to set this up pre-LLM but probably the ROI is favorable now.

Furthermore, as Uncle Bob has written a lot about, putting effort into structuring your tests well is another area that’s usually under-invested. LLMs often write very repetitive tests, but are happy to DRY out, write factories, etc if you ask them.

notimetorelax · 3 months ago
You’re right. What I like doing in those cases is to review very closely the tests and the assertions. Frequently it’s even faster than looking at the SUT itself.
ruszki · 3 months ago
I heard this “review very closely” thing many times, and rarely means review very closely. Maybe 5% of developers really do this ever, and I probably overestimate it. When people send here AI generated code, it’s quite obvious that they don’t review code properly. There are videos when people recorded how we should use LLMs, and they clearly don’t do this.
eternityforest · 3 months ago
Tests freeze behavior in place, and manual end to end testing can confirm that the most common paths are at least kind of correct ish.

Obviously that's not good enough, but I'd much rather have AI tests than poor test coverage.

dbdoskey · 3 months ago
In theory, that is the benefit of having an agent that is limited to only doing the tests, and an agent that only does the coding, and have them run separately, that way to fix a test, you don't change the test, etc...
eru · 3 months ago
Well, we let humans write both business logic code and tests often enough, too.

Btw, you can get a lot further in your tests, if you move away from examples, and towards properties.

jaynetics · 3 months ago
Can you give an example (pun not intended) of testing with properties?
machomaster · 3 months ago
You could mitigate that risk by using different agents (versions, companies).
tombert · 3 months ago
Something I just started doing yesterday, and I'm hoping it catches on, is that I've been writing the spec for what I want in TLA+/PlusCal at a pretty high level, and then I tell Codex implement exactly to the spec. I tell it to not deviate from the spec at all, and be as uncreative as possible.

Since it sticks pretty close to the spec and since TLA+ is about modifying state, the code it generates is pretty ugly, but ugly-and-correct code beats beautiful code that's not verified.

It's not perfect; something that naively adheres to a spec is rarely optimized, and I've had to go in and replace stuff with Tokio or Mio or optimize a loop because the resulting code is too slow to be useful, and sometimes the code is just too ugly for me to put up with so I need to rewrite it, but the amount of time to do that is generally considerably lower than if I were doing the translation myself entirely.

The reason I started doing this: the stuff I've been experimenting with lately has been lock-free data structures, and I guess what I am doing is novel enough that Codex does not really appear to generate what I want; it will still use locks and lock files and when I complain it will do the traditional "You're absolutely right", and then proceed to do everything with locks anyway.

In a sense, this is close to the ideal case that I actually wanted: I can focus on the high-level mathey logic while I let my metaphorical AI intern deal with the minutia of actually writing the code. Not that I don't derive any enjoyment out of writing Rust or something, but the code is mostly an implementation detail to me. This way, I'm kind of doing what I'm supposed to be doing, which is "formally specify first, write code second".

BrittonR · 3 months ago
This is how I’m also developing most of my code these days as well. My opinions are pretty similar to the pig book author https://martin.kleppmann.com/2025/12/08/ai-formal-verificati....
tombert · 3 months ago
For the first time I might be able to make a case for TLA+ to be used in a workplace. I've been trying for the last nine years, with managers that will constantly say "they'll look into it".
jnpnj · 3 months ago
Interesting, just the other day I tried asking if iterating in haskell or prolog wouldn't help both converging speed and token use. I wish there was a group to study how to do proper engineering with LLMs without losing the modeling / verification aspect.
baq · 3 months ago
You might find success with having the LLM contribute to the spec itself. It suddenly started to work with the most recent frontier models, to the point that economics of writing then shifted due to turn getting 10-100x cheaper to get right.
pgroves · 3 months ago
This is sort of why I think software development might be the only real application of LLMs outside of entertainment. We can build ourselves tight little feedback loops that other domains can't. I somewhat frequently agree on a plan with an LLM and a few minutes or hours later find out it doesn't work and then the LLM is like "that's why we shouldn't have done it like that!". Imagine building a house from scratch and finding out that it was using some american websites to spec out your electric system and not noticing the problem until you're installing your candadian dishwasher.
mrtksn · 3 months ago
> Imagine building a house from scratch

Thats why those Engineering fields have strict rules, often require formal education and someone can even end up in prison if screws up badly enough.

Software is so much easier and safer, till very recently anonymous engineering was the norm and people are very annoyed with Apple pushing for signing off the resulting product.

Highly paid software engineers across the board must have been an anomaly that is ending now. Maybe in the future only those who code actually novel solutions or high risk software will be paid very well - just like engineers in the other fields.

zarzavat · 3 months ago
> people are very annoyed with Apple pushing for signing off the resulting product.

Apple is very much welcome to push for signing off of software that appears on their own store. That is nothing new.

What people are annoyed about is Apple insisting that you can only use their store, a restriction that has nothing to do with safety or quality and everything to do with the stupendous amounts of money they make from it.

PunchyHamster · 3 months ago
Software developers being paid well is result of demand, not be cause it's very hard.

Skill and strictness required is only vaguely related to pay, if there is enough people for the job it won't pay amazing, regardless on how hard it is.

> Software is so much easier and safer, till very recently anonymous engineering was the norm and people are very annoyed with Apple pushing for signing off the resulting product.

that has nothing to do with engineering quality, that is just to make it harder to go around their ecosystem (and skip paying the shop fee). With additional benefit of signed package being harder to attack. You can still deliver absolute slop, but the slop will be from you, not the middleman that captured the delivery process

ptx · 3 months ago
I don't understand why the experience you describe would lead you to conclude that LLMs might be useful for software development.

The response "that's why we shouldn't have done it like that!" sounds like a variation on the usual "You're absolutely right! I apologize for any confusion". Why would we want to get stuck in a loop where an AI produces loads of absolute nonsense for us to painstakingly debug and debunk, after which the AI switches track to some different nonsense, which we again have debug and debunk, and so on. That doesn't sound like a good loop.

ogogmad · 3 months ago
> This is sort of why I think software development might be the only real application of LLMs outside of entertainment.

Wow. What about also, I don't know, self-teaching*? In general, you have to be very arrogant to say that you've experienced all the "real" applications.

* - For instance, today and yesterday, I've been using LLMs to teach myself about RLC circuits and "inerters".

Larrikin · 3 months ago
I would absolutely not trust an LLM to teach me anything alone. I've had it introduce ideas I hadn't heard about which I looked up from actual sources to confirm it was a valid solution. Daily usage has shown it will happily lead you down the wrong path and usually the only way to know that it is the wrong path, is if you already knew what the solution should be.

LLMs MAY be a version of office hours or asking the TA, if you only have the book and no actual teacher. I have seen nothing that convinces me they are anything more than the latest version of the hammer in our toolbox. Not every problem is a nail.

array_key_first · 3 months ago
Self-teaching pretty much doesn't work. For many decades now, the barrier has not been access to information, it's been the "self" part. Turns out most people need regimen, accountablity, strictness, which AI just doesn't solve because it's yes-men.
skywhopper · 3 months ago
It’s somewhat delusional and potentially dangerous to assume that chatting with an LLM about a specific topic is self-teaching beyond the most surface-level understanding of a topic. No doubt you can learn some true things, but you’ll also learn some blatant falsehoods and a lot of incorrect theory. And you won’t know which is which.

One of the most important factors in actually learning something is humility. Unfortunately, LLM chatbots are designed to discourage this in their users. So many people think they’re experts because they asked a chatbot. They aren’t.

stackghost · 3 months ago
Why would you think that a machine known to cheerfully and confidently assert complete bullshit is suitable to learn from?
pgroves · 3 months ago
Thinking about this some more, maybe I wasn't considering simulators (aka digital twins), which are supposed to be able to create fairly reliable feedback loops without building things in reality. Eg will this plane design be able to take off? Still, I feel fortunate I only have to write unit tests to get a bit of contact with reality.
redox99 · 3 months ago
Simulations in general are pretty flawed, and AIs will usually find ways to "cheat" the simulation.

It's a very useful tool of course, but not as good as the software situation.

knollimar · 3 months ago
You can install a Canadian one just fine; authorities might not like it in some jurisdictions though but it's safe and might even be code.

I literally just had this exact argument; biggest issue is they can be tested for smaller amperages but you just downside the breaker.

toxic72 · 3 months ago
It's more like you're installing the dishwasher and the dishwasher itself yells at you "I told you so" ;)
lvspiff · 3 months ago
I think of it as you say "install dishwasher" and it plan looks like all the steps but as it builds it out it somehow you end up hiring a maid and buying a drying rack.
tempodox · 3 months ago
This is hallucination. Or maybe a sales pitch. If production bugs and the requirement to retain a workable code base don’t get us to write “good” code, then nothing will. And at the current state of the art, “AI” will tend to make it worse.
reedlaw · 3 months ago
The first sentence is problematic:

> For decades, we’ve all known what “good code” looks like.

When relatively trivial concerns such as the ideal length of methods haven't achieved consensus, I doubt there can be any broadly accepted standard for software quality. There are plenty of metrics such as test coverage, but anyone with experience could tell you how easy it is to game those and that enforcing arbitrary standards can even cause harm.

tempodox · 3 months ago
I agree. Moreover, I submit that “good code” isn’t even a universal constant, but context-sensitive along several dimensions.
deaux · 3 months ago
> When relatively trivial concerns such as the ideal length of methods haven't achieved consensus

Is the consensus not that there isn't one? Surely that's the only consensus to reach? I don't see how there could possibly be an "ideal length", whatever you pick it'd be much too dogmatic.

stingraycharles · 3 months ago
Yeah, test coverage isn't a replacement for good code. Worse yet, it may give you false confidence, especially if it's the AI that's writing the tests (which in practice very often is the case).

Deleted Comment

zwnow · 3 months ago
Shhhh the original poster is the CEO of an AI based company. I am sure there is no bias here. /s
mkozlows · 3 months ago
I like this. "Best practices" are always contingent on the particular constellation of technology out there; with tools that make it super-easy to write code, I can absolutely see 100% coverage paying off in a way that doesn't for human-written code -- it maximizes what LLMs are good at (cranking out code) while giving them easy targets to aim for with little judgement.

(A thing I think is under-explored is how much LLMs change where the value of tests are. Back in the artisan hand-crafted code days, unit tests were mostly useful as scaffolding: Almost all the value I got from them was during the writing of the code. If I'd deleted the unit tests before merging, I'd've gotten 90% of the value out of them. Whereas now, the AI doesn't necessarily need unit tests as scaffolding as much as I do, _but_ having them put in there makes future agentic interactions safer, because they act as reified context.)

Waterluvian · 3 months ago
It might depend on the lifecycle of your code.

The tests I have for systems that keep evolving while being production critical over a decade are invaluable. I cannot imagine touching a thing without the tests. Many of which reference a ticket they prove remains fixed: a sometimes painfully learned lesson.

zmgsabst · 3 months ago
Also the lifecycle of your system, eg, I’ve maintained projects that we no longer actively coded, but we used the tests to ensure that OS security updates, etc didn’t break things.
johnnyfived · 3 months ago
I've said this before here, but "best practices" in code indeed is very typical even with different implementations and architectures. You can ask a LLM to write you the best possible code for a scenario and likely your implementation wouldn't differ much.

Writing, art, creative output, that's nothing at all like code, which puts the software industry in a more particular spot than anything else in automation.

afro88 · 3 months ago
Without having tried it (caveat), I worry that 100% coverage to an LLM will lock in bad assumptions and incorrect functionality. It makes it harder for it to identify something that is wrong.

That said, we're not talking about vibe coding here, but properly reviewed code, right? So the human still goes "no, this is wrong, delete these tests and implement for these criteria"?

realusername · 3 months ago
That's already what I'm experiencing even without forcing anything, the LLM creates a lot of "is 1 = 1?" tests
sgk284 · 3 months ago
Yep, 100% correct. We're still reviewing and advising on test cases. We also write a PRD beforehand (with the LLM interviewing us!) so the scope and expectations tend to be fairly well-defined.
danieka · 3 months ago
I thought that the article would be about if we want AI to be effective, we should write good code.

What I notice is that Claude stumbles more on code that is illogical, unclear or has bad variable names. For example if a variable is name "iteration_count" but actually contains a sum that will "fool" AI.

So keeping the code tidy gives the AI clearer hints on what's going on which gives better results. But I guess that's equally true for humans.

asielen · 3 months ago
Related it seems AI has been effective at forcing my team to care about documentation including good comments. Before when it was just humans reading these things, it felt like there was less motivation to keep things up to date. Now the idea that AI may be using that documentation as part of a knowledge base or in evaluating processes, seems to motivate people to actually spend time updating the internal docs (with some AI help of course).

It is kind of backwards because it would have been great to do it before. But it was never prioritized. Now good internal documentation is seen as essential because it feeds the models.

sleepy_keita · 3 months ago
Humans can work with these cases better though because they have access to better memory. Next time you see "iteration_count", you'll know that it actually has a sum, while a new AI session will have to re-discover it from scratch. I think this will only get better as time goes on, though.
charcircuit · 3 months ago
You are underestimating how lazy humans can be. Humans are going to skim code, scroll down into the middle of some function and assume iteration count means iteration count. AI on the other hand will have the full definition of the function in its context every time.
drak0n1c · 3 months ago
Unfortunately, so far coding models seem to perform worse and break in other ways as context grows, so it's still best practice to start a new conversation even when iterating. Luckily, high-end reasoning models are now catching when var names don't match what they actually do (as long as the declaration is provided in context).
rsyring · 3 months ago
Or you immediately rename it to avoid the need to remember? :)
CharlieDigital · 3 months ago
What I find works really well: scaffold the method signature and write your intent in the comment for the inputs, outputs, and any mutations/business logic + instructions on approach.

LLM has very high chance of on shotting this and doing it well.

Philip-J-Fry · 3 months ago
This is what I tend to do. I still feel like my expertise in architecting the software and abstractions is like 10x better than I've seen an LLM do. I'll ask it to do X, and then ask it to do Y, and then ask it to do Z, and it'll give you the most junior looking code ever. No real thought on abstractions, maybe you'll just get the logic split into different functions if you're lucky. But no big picture thinking, even if I prompt it well it'll then create bad abstractions that expose too much information.

So eventually it gets to the point where I'm basically explaining to it what interfaces to abstract, what should be an implementation detail and what can be exposed to the wider system, what the method signatures should look like, etc.

So I had a better experience when I just wrote the code myself at a very high level. I know what the big picture look of the software will be. What types I need, what interfaces I need, what different implementations of something I need. So I'll create them as stubs. The types will have no fields, the functions will have no body, and they'll just have simple comments explaining what they should do. Then I ask the LLM to write the implementation of the types and functions.

And to be fair, this is the approach I have taken for a very long time now. But when a new more powerful model is released, I will try and get it to solve these types of day to day problems from just prompts alone and it still isn't there yet.

It's one of the biggest issues with LLM first software development from what I've seen. LLMs will happily just build upon bad foundations and getting them to "think" about refactoring the code to add a new feature takes a lot of prompting effort that most people just don't have. So they will stack change upon change upon change and sure, it works. But the code becomes absolutely unmaintainable. LLM purists will argue that the code is fine because it's only going to be read by an LLM but I'm not convinced. Bad code definitely confuses the LLMs more.

zahlman · 3 months ago
What if I write the main function but stub out calls to functions that don't exist yet; how will it do with inferring what's missing?
sandblast2 · 3 months ago
The expertise in software engineering typical in these promptfondling companies shine through this blog post.

Surely they know 100% code coverage is not a magical bullet because the code flow and the behavior can differ depending on the input. Just because you found a few examples which happen to hit every line of code you didn't hit every possible combination. You are living in a fool's paradise which is not a surprise because only fools believe in LLMs. You are looking for a formal proof of the codebase which of course no one does because the costs would be astronomical (and LLMs are useless for it which is not at all unique because they are useless for everything software related but they are particularly unusable for this).

visarga · 3 months ago
So, what is the solution? Senior engineer looks over PR and signs LGTM? That is just "vibe testing". The worst kind of testing. I think the author is right, setting up tests to form a reactive environment for coding agents will lead us to a new golden age. If you later find some issue with your test case coverage, you expand it. But it is good to do it from the start as throroughtly as possible.
sandblast2 · 3 months ago
> So, what is the solution?

1. Clearly explain the massive harm LLMs cause society and the environment to everyone. (Mass media should do this instead of parroting every nonsense the promptfondlers feed them.)

2. Ban them all. Don't tell me it's impossible just because it's widespread. Asbesthos was everywhere.

SR2Z · 3 months ago
It's a bold claim that LLMs are useless for formal verification when people have been hooking them up to proof assistants for a while. I think that it's probably not a terrible idea; the LLM might make some mistakes in the spec but 99% of the time there are a lot of irrelevant details that it will do a serviceable job with.