I have a lot of bitter things to say about automated testing, having spent 14 years of my life trying to knead it into a legitimate profession, but here's the most significant:
You test case is more useless than a turd in the middle of the dining room table unless you put a comment in front of it that explains what it assumes, what it attempts, and what you expect to happen as a result.
Because if you just throw in some code, you're only giving the poor bastard investigating it two puzzles to debug instead of one.
At an old job, one manager would put in his employees' annual reports stuff like "Developer X wrote N automated tests, fixed M bugs, and filed P new bugs this quarter..."
The obvious result of Goodhart's Law ensued, leading to test cases like you mention.
Lesson to leaders: Please stop your bad managers from pulling stupid crap like this. It wastes a lot more time in the longer run.
Which is funny as the purpose of testing is to explain to other other developers what the code under test assumes and what should be expected of it under various conditions. It is documentation.
If you have to document your documentation, you might be missing something fundamental in how you are writing your first order documentation. Not to mention that in doing so you defeat the reason for writing your documentation in an executable form (to be able to automatically validate that the documentation is true).
So i understand correctly, that your position is "code is the documentation"?
Over time im inclined to value human written documentation. Especially when things involve integrations of multiple systems. I had real cases, where two parties point at code and say their code is correct. And in isolation code looks correct. But when time comes to integrate these systems. It breaks. And then if you have human readable document where intentions and expectations are specified it's much easier to come to common (working) solution.
Not all languages have capability to express complex intentions so code as documentation does not work most of the time.
Disregarding the "code is doc" position, it's still common to have an overview or index for documentation, which points readers in the right direction instead of dumping pages of detailed docs on them.
Now, you could also have a well organized test suite that goes from most obvious to most detailed, split into sections for each use-case, but this sounds a lot more tedious than "write a one-line comment describing the unit test".
>the purpose of testing is to explain to other other developers what the code under test assumes and what should be expected of it under various conditions
No, the point of automated testing is to verify that what is under test behaves correctly and to be able to scale this verification cheaper than having humans do it. Documenting what it verifies and under what conditions is just a side effect.
>You test case is more useless than a turd in the middle of the dining room table unless you put a comment in front of it that explains what it assumes, what it attempts, and what you expect to happen as a result.
This is why I found Gherkin/Cucumber (and BDD in general) to be a total revelation when I first encountered it. No one should be writing tests any other way IMO.
Gherkin/Cucumber reintroduce the very problem TDD/BDD was intended to solve: Documentation falling out of sync with the implementation.
The revelation of TDD, which was later rebranded as BDD to deal with the confusion that arose with other types of testing, was that if your documentation was also executable the machine could be used to prove that the documentation is true. The Gherkin/Cucumber themselves are not executable and require you to re-document the function in another language with no facilities to ensure that the two are consistent with each other.
If you are attentive enough to ensure that the documentation and the implementation are aligned, you may as well write it in plain English. It will give you all of the same benefits without the annoying syntax.
This sounds like a good theory but the practice of it is really hard. Pretty quickly you end up with tests that "say" one thing but have nuanced different behavior in the underlying implementation.
Then try to debug a "document"...
I like the idea. But having tried it at scale, it becomes a mess. Code I can understand. I can read English comments. I can't debug English.
I agree. One nice feature of property-driven testing is that assumptions often end up causing test failures. For example (in ScalaTest):
"Average of list" should "be within range" in {
forAll() {
(l: List[Float]) => {
val avg = l.average
assert(avg >= l.min && avg <= l.max)
}
}
This test will fail, since it doesn't hold for e.g. empty lists. Requiring non-empty lists will still fail, if we have awkward values like NaNs, etc. The following version has a better chance of passing:
"Average of list" should "be within range" in {
forAll() {
(raw: List[Float]) => {
val l = raw.filter(n => !n.isNaN && !n.isInfinite)
whenever (l.nonEmpty) {
val avg = l.average
assert(avg >= l.min && avg <= l.max)
}
}
}
Getting this test to pass required us to make those assumptions explicit. Of course, it doesn't spot everything; here's an article which explores this example in more depth (in Python) https://hypothesis.works/articles/calculating-the-mean
We have a policy of making each test a spec. That is, a test requires a plain text spec to be attached to it in its doc string. It's kind of like BDD but without all the weird DSLs.
Which would replace all those humans producing perfectly valid sounding explanations that if you invest some research effort have no basis in (the usually far more complex, but also far more fascinating and infinitely deep) reality. So yes, I think AI can indeed replace lots of human-produced thoughts :-)
I admit to have been guilty of this myself. I have a famous anecdote-example where I had a very well-paid contractor job and explained something about how my then department's software worked to someone from another department. I think I must have sounded very convincing, the person went off to change something in how they used our stuff. A few minutes later, after accidentally meeting and casually chatting with my boss for that job I realized everything I had said was total garbage. I quickly excused myself from my boss and hurried after the person to tell them to forget and ignore everything I had just explained to them because it was all wrong. I think this last step is not what happens in those cases because we don't usually realize that such a thing just happened.
The brain, or parts of it, are great at producing "explanations". I think that it was part of the more established and reproducible results of psychology that our brain first decides and acts, and only then produces some (often bullshit) "reason" when/if our conscious self asks for one? Does anybody remember if this is true and has a link?
> During day-to-day development, the important bit isn't that there are no failures. The important bit is that there are no regressions.
And that's why we test and why tests shouldn't be allowed to fail.
Just because the scenarios described make testing hard does not change reality of what makes tests valuable.
If pre-existing failures are halting the production pipeline and you don't like it, switch off trunk based development and see if you like the waits and constant rebasing in large projects/teams. But don't eff with the bloody tests!
when the codebase gets large enough you need to allow some tests to "fail" but really I mean you need a way to quickly mark a failing test as flakey so the author can fix it while everyone else can get on with their day and merge code.
At $dayjob this works well, if your CI comes up red with some unrelated test failing, you can mark the test as flakey in the UI and CI will allow your code to merge and a Jira ticket will be created for the test owner to fix their test (and it will be disabled for future test runs)
I think for small to medium projects, you can have all tests succeed but once the repo is large enough / has frequent enough changes, flakey tests are bound to slip in.
Our setup just reruns the tests a few times which sorts out flaky tests. The page then shows the most frequently failing tests so they can be properly fixed.
I think GitHub does something similar - public website tests must always pass but if you break GitHub Enterprise you get three days to fix it (or something like that - I think they had a blog post on it).
If testing that way is painful (and it is), then work with people to remove the pain. Tests are supposed to help developers, not constrain or punish them.
Put tests in the same repo as the SUT. Do more testing closer to the code (more service and component tests) and do less end-to-end testing. Ban "flakey" tests - they burn engineering time for questionable payoff.
Test failures can be thought of as "things developers should investigate." Make sure the tests are focused on telling you about those things as fast as possible.
Also, take the human out of the "wait for green, then submit PR" steps. Open a PR but don't alert everyone else about it until you run green, maybe?
It would work for most "classical" software development. In this case, the author talks about conformance tests (a HUGE collection) from an external vendor. Most of them will fail at first, then you make them pass slowly but steadily.
The problem becomes: I want to know if there are significant regressions in the vendor tests, ie. tests that were green for a long time and suddenly changed. You could flag any test that became green at some point as "required" to pass the CI, but then you have tests that randomly succeed or fail depending on code you have not yet written (eg. locking around concurrent structures). Marking these tests manually is impractical and could definitively be replaced by tooling that supports some statistical modeling of success/failure.
You may have the best testing strategy for internal code but as long as you have to test against these conformance tests it's simply unfeasible to say "sorry, only green allowed".
> take the human out of the "wait for green, then submit PR"
It'd be great if GitHub could open a PR for reviews (aka un-draft) automatically after CI succeeds. (If not in the core product, is there a bot that does that?)
My company uses a workflow where we don't use PRs for code reviews. Instead we each have our own git repo that's a fork of the tech lead's, with some git rules in place to impose a branch namespace. To open a review request you push a branch into the reviewer's repository. Our CI system detects the new branch and starts running it. Once CI passes that updates the bug tracker which triggers a notification to the reviewer.
The reviewer then does a git fetch, and then checks out the newly created rr/ branch. They make any small changes that aren't worth a roundtrip and push them to the rr branch. They add FIXME comments for bigger changes. They then either assign the ticket back to the developer, or go ahead and merge straight into their own dev branch. Once an rr branch is merged it's simply deleted. The dev branch is then pushed and CI will merge it to that user's master when it's green.
IntelliJ will show branches in each origin organized by "folder" if you use backslashes in branch names, and gitolite (which is what we use to run our repos) can impose ACLs by branch name too. So for example only user alice can push to a branch named rr/alice/whatever in each persons repo. That ensures it's always clear where a PR/RR is coming from.
Because each user gets their own git repo and cloned set of individual CI builds, you can push experimental or WIP branches to your personal area and iterate there without bothering other people.
This workflow gets rid of things like draft PRs (which are a contradiction), it ensures each reviewer has a personal review queue, it means work and progress is tracked via the bug tracker (which understands commands in commit messages so you can mark bugs as fixed when they clear CI automatically) and it eliminates the practice of requesting dozens of tiny changes that'd be faster for the reviewer to apply themselves, because reviewer and task owner can trade commits on the rr branch using git's features to keep it all organized and mergeable.
Seems to me like you're underinvesting in tooling. It's a mistake a lot of development shops make - you focus on your product, so you can't spend time building something completely orthogonal, but in the process you suddenly waste man-years wasting time on a broken PR process, instead of spending a month early on building some tooling that would have removed the pain in the first place.
The continuous testing is something I’ve thought about and it’s a tricky one. We use property tests[1] so here’s a quick stab at how I’d like it to look like:
Test starts failing, immediately send a report with the failing input, then continue with the test case minimisation and send another report when that finishes.
Concurrently, start up another long running process to look for other failures, skipping the input that caused the previous failure. We do want new inputs for the same failure though. This is the tricky one. We could probably make it work by having the prop test framework not reuse previously-failing inputs, but that’s one of the big strategies it uses to catch regressions.
> The above development practice works well when the SUT and TB are both defined by the same code repository and are developed together.
I once witnessed a team creating an app, specs and tests in three respective repositories. For no other reason than "each project should be in it's own repository".
The added work/maintenance around that is crazy, for absolutely no gain in that case.
Phase 1. Code and test basic functions concerning any kind of arithmetic, mathematical distribution, state machines, file operations and datetimes. This documents any assumptions and makes a solid foundation.
Phase 2. Write a simulation for generating randomized inputs to test the whole system. Run it for hours. If I can't generate the inputs, find as big a variety of inputs as possible. Collect any bugs, fix, repeat. This reduces the chances of finding real time bugs by three orders of magnitude.
This has worked really well in the past whether I'm working on games, parsers or financial software. I don't conform to corporate whatever driven testing patterns because they are usually missing the crucial part 2 and time part 1 incorrectly.
The author's problem is pretty simple: the test repo is required for pre-merge tests to pass, but it can be updated independently, without having pre-merge tests pass.
And the answer is pretty simple: pin the specific test repo version! Use lockfiles, or git submodules, or put "cd tests && git checkout 3e524575cc61" in your CI config file _and keep it in the same repo as source code_ (that part is very important!).
This solves all of author problems:
> new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.
Conformance test suite is pinned, so new test is not used. A separate PR has to update conformance test suite version/revision, and it must go through regular driver PR process and therefore must pass. Practically, this is a PR with 2 changes: update pin and disable new test.
> are you going to remember to update that exclusion list?
That's why you use "expect fail" list (not exclusion) and keep it in driver's dir. Ad you submit your PR you might see a failure saying: "congrats, test X which was expect-fail is now passing! Please remove it from the list". You'll need to make one more PR revision but then you get working tests.
> allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.
And if your TB cannot be changed in lockstep with SUT, you are going to have truly miserable time. You cannot even reproduce the problems of the past!
So make sure your kernel is known or at least recorded, repos are pinned. Ideally the whole machine image, with packages and all is archived somehow -- maybe via docker or raw disk image or some sort of ostree system.
> Problem #2 is that good test coverage means that tests take a very long time to run.
The described system sounds very nice, and I would love to have something like this. I suspect it will be non-trivial to get working, however. But meanwhile, there is a manual solution: have more than one test suite. "Pre-merge" tests run before each merge and contain small subset of testing. A bigger "continuous" test suite (if you use physical machines) or "every X hours" (if you use some sort of auto-scaling cloud) will run a bigger set of tests, and can be triggered manually on PRs if a developer suspects the PR is especially risky.
You can even have multiple levels (pre-merge, once per hour, 4 times per day) but this is often more trouble than it worth.
And of course it is absolutely critical to have reproducible tests first -- if you come up to work and find a bunch of continuous failures, you want to be able to re-run with extra debugging or bisect what happened.
> And the answer is pretty simple: pin the specific test repo version! Use lockfiles, or git submodules, or put "cd tests && git checkout 3e524575cc61" in your CI config file _and keep it in the same repo as source code_ (that part is very important!).
Indeed. Where I work we have a bunch of repos, but they always reference each other via pinned commits. We happen to use Nix, with its built in 'fetchGit' function; it's also easy to override any of these dependencies with a different revision. For example:
This is a function taking two arguments ('helpers' and 'some-library'), with default arguments that fetch particular git commits. This gives us the option of calling the function with different values, to e.g. build against different commits.
We run our CI on GitHub Actions, which allows some jobs to be marked as 'required' for PRs (using branch protection rules). The normal build/test jobs use the default arguments, and are marked as required: everything is pinned, so there should be no unexpected breakages.
Some of our libraries also define extra CI jobs, which are not marked as required. Those fetch the latest revision of various downstream projects which are known to use that library, and override the relevant argument with themselves. For example, the 'some-library' repo might have a test like this:
import (fetchGit {
url = "git://url-of-some-library.git";
ref = "master";
# No 'rev' given, so it will fetch 'HEAD'
}) {
# Build with this checkout of some-library, instead of the pinned version
some-library = import ./. {};
}
This lets us know if our PR would break downstream projects, if they were to subsequently update their pinned dependencies (either because we've broken the library, or the downstream project is buggy). It's useful for spotting problems early, regardless of whether the root cause is upstream or downstream.
Yeah - developers need to control their own tests. If in the weird case they don't control their tests (conformance tests) - you need to control when those tests are added.
You test case is more useless than a turd in the middle of the dining room table unless you put a comment in front of it that explains what it assumes, what it attempts, and what you expect to happen as a result.
Because if you just throw in some code, you're only giving the poor bastard investigating it two puzzles to debug instead of one.
The obvious result of Goodhart's Law ensued, leading to test cases like you mention.
Lesson to leaders: Please stop your bad managers from pulling stupid crap like this. It wastes a lot more time in the longer run.
If you have to document your documentation, you might be missing something fundamental in how you are writing your first order documentation. Not to mention that in doing so you defeat the reason for writing your documentation in an executable form (to be able to automatically validate that the documentation is true).
Over time im inclined to value human written documentation. Especially when things involve integrations of multiple systems. I had real cases, where two parties point at code and say their code is correct. And in isolation code looks correct. But when time comes to integrate these systems. It breaks. And then if you have human readable document where intentions and expectations are specified it's much easier to come to common (working) solution.
Not all languages have capability to express complex intentions so code as documentation does not work most of the time.
Now, you could also have a well organized test suite that goes from most obvious to most detailed, split into sections for each use-case, but this sounds a lot more tedious than "write a one-line comment describing the unit test".
No, the point of automated testing is to verify that what is under test behaves correctly and to be able to scale this verification cheaper than having humans do it. Documenting what it verifies and under what conditions is just a side effect.
A test must be reproduceable. If it is not, is not a test.
This is why I found Gherkin/Cucumber (and BDD in general) to be a total revelation when I first encountered it. No one should be writing tests any other way IMO.
https://cucumber.io/docs/gherkin/reference/
The revelation of TDD, which was later rebranded as BDD to deal with the confusion that arose with other types of testing, was that if your documentation was also executable the machine could be used to prove that the documentation is true. The Gherkin/Cucumber themselves are not executable and require you to re-document the function in another language with no facilities to ensure that the two are consistent with each other.
If you are attentive enough to ensure that the documentation and the implementation are aligned, you may as well write it in plain English. It will give you all of the same benefits without the annoying syntax.
Then try to debug a "document"...
I like the idea. But having tried it at scale, it becomes a mess. Code I can understand. I can read English comments. I can't debug English.
@Test
public void myTestMethod_Scenario_ShouldReturnThis() {....
It(“throws when the object belongs to another user”)
It(“does a business thing when thing is in state BLAH”)
I don't think it quite does it right, but it is of note.
(I would buy a Copilot subscription for this)
I admit to have been guilty of this myself. I have a famous anecdote-example where I had a very well-paid contractor job and explained something about how my then department's software worked to someone from another department. I think I must have sounded very convincing, the person went off to change something in how they used our stuff. A few minutes later, after accidentally meeting and casually chatting with my boss for that job I realized everything I had said was total garbage. I quickly excused myself from my boss and hurried after the person to tell them to forget and ignore everything I had just explained to them because it was all wrong. I think this last step is not what happens in those cases because we don't usually realize that such a thing just happened.
The brain, or parts of it, are great at producing "explanations". I think that it was part of the more established and reproducible results of psychology that our brain first decides and acts, and only then produces some (often bullshit) "reason" when/if our conscious self asks for one? Does anybody remember if this is true and has a link?
And that's why we test and why tests shouldn't be allowed to fail.
Just because the scenarios described make testing hard does not change reality of what makes tests valuable.
If pre-existing failures are halting the production pipeline and you don't like it, switch off trunk based development and see if you like the waits and constant rebasing in large projects/teams. But don't eff with the bloody tests!
At $dayjob this works well, if your CI comes up red with some unrelated test failing, you can mark the test as flakey in the UI and CI will allow your code to merge and a Jira ticket will be created for the test owner to fix their test (and it will be disabled for future test runs)
I think for small to medium projects, you can have all tests succeed but once the repo is large enough / has frequent enough changes, flakey tests are bound to slip in.
If testing that way is painful (and it is), then work with people to remove the pain. Tests are supposed to help developers, not constrain or punish them.
Put tests in the same repo as the SUT. Do more testing closer to the code (more service and component tests) and do less end-to-end testing. Ban "flakey" tests - they burn engineering time for questionable payoff.
Test failures can be thought of as "things developers should investigate." Make sure the tests are focused on telling you about those things as fast as possible.
Also, take the human out of the "wait for green, then submit PR" steps. Open a PR but don't alert everyone else about it until you run green, maybe?
The problem becomes: I want to know if there are significant regressions in the vendor tests, ie. tests that were green for a long time and suddenly changed. You could flag any test that became green at some point as "required" to pass the CI, but then you have tests that randomly succeed or fail depending on code you have not yet written (eg. locking around concurrent structures). Marking these tests manually is impractical and could definitively be replaced by tooling that supports some statistical modeling of success/failure.
You may have the best testing strategy for internal code but as long as you have to test against these conformance tests it's simply unfeasible to say "sorry, only green allowed".
It'd be great if GitHub could open a PR for reviews (aka un-draft) automatically after CI succeeds. (If not in the core product, is there a bot that does that?)
The reviewer then does a git fetch, and then checks out the newly created rr/ branch. They make any small changes that aren't worth a roundtrip and push them to the rr branch. They add FIXME comments for bigger changes. They then either assign the ticket back to the developer, or go ahead and merge straight into their own dev branch. Once an rr branch is merged it's simply deleted. The dev branch is then pushed and CI will merge it to that user's master when it's green.
IntelliJ will show branches in each origin organized by "folder" if you use backslashes in branch names, and gitolite (which is what we use to run our repos) can impose ACLs by branch name too. So for example only user alice can push to a branch named rr/alice/whatever in each persons repo. That ensures it's always clear where a PR/RR is coming from.
Because each user gets their own git repo and cloned set of individual CI builds, you can push experimental or WIP branches to your personal area and iterate there without bothering other people.
This workflow gets rid of things like draft PRs (which are a contradiction), it ensures each reviewer has a personal review queue, it means work and progress is tracked via the bug tracker (which understands commands in commit messages so you can mark bugs as fixed when they clear CI automatically) and it eliminates the practice of requesting dozens of tiny changes that'd be faster for the reviewer to apply themselves, because reviewer and task owner can trade commits on the rr branch using git's features to keep it all organized and mergeable.
Test starts failing, immediately send a report with the failing input, then continue with the test case minimisation and send another report when that finishes.
Concurrently, start up another long running process to look for other failures, skipping the input that caused the previous failure. We do want new inputs for the same failure though. This is the tricky one. We could probably make it work by having the prop test framework not reuse previously-failing inputs, but that’s one of the big strategies it uses to catch regressions.
[1] specifically, hypothesis on python
I once witnessed a team creating an app, specs and tests in three respective repositories. For no other reason than "each project should be in it's own repository".
The added work/maintenance around that is crazy, for absolutely no gain in that case.
Phase 1. Code and test basic functions concerning any kind of arithmetic, mathematical distribution, state machines, file operations and datetimes. This documents any assumptions and makes a solid foundation.
Phase 2. Write a simulation for generating randomized inputs to test the whole system. Run it for hours. If I can't generate the inputs, find as big a variety of inputs as possible. Collect any bugs, fix, repeat. This reduces the chances of finding real time bugs by three orders of magnitude.
This has worked really well in the past whether I'm working on games, parsers or financial software. I don't conform to corporate whatever driven testing patterns because they are usually missing the crucial part 2 and time part 1 incorrectly.
And the answer is pretty simple: pin the specific test repo version! Use lockfiles, or git submodules, or put "cd tests && git checkout 3e524575cc61" in your CI config file _and keep it in the same repo as source code_ (that part is very important!).
This solves all of author problems:
> new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.
Conformance test suite is pinned, so new test is not used. A separate PR has to update conformance test suite version/revision, and it must go through regular driver PR process and therefore must pass. Practically, this is a PR with 2 changes: update pin and disable new test.
> are you going to remember to update that exclusion list?
That's why you use "expect fail" list (not exclusion) and keep it in driver's dir. Ad you submit your PR you might see a failure saying: "congrats, test X which was expect-fail is now passing! Please remove it from the list". You'll need to make one more PR revision but then you get working tests.
> allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.
And if your TB cannot be changed in lockstep with SUT, you are going to have truly miserable time. You cannot even reproduce the problems of the past! So make sure your kernel is known or at least recorded, repos are pinned. Ideally the whole machine image, with packages and all is archived somehow -- maybe via docker or raw disk image or some sort of ostree system.
> Problem #2 is that good test coverage means that tests take a very long time to run.
The described system sounds very nice, and I would love to have something like this. I suspect it will be non-trivial to get working, however. But meanwhile, there is a manual solution: have more than one test suite. "Pre-merge" tests run before each merge and contain small subset of testing. A bigger "continuous" test suite (if you use physical machines) or "every X hours" (if you use some sort of auto-scaling cloud) will run a bigger set of tests, and can be triggered manually on PRs if a developer suspects the PR is especially risky.
You can even have multiple levels (pre-merge, once per hour, 4 times per day) but this is often more trouble than it worth.
And of course it is absolutely critical to have reproducible tests first -- if you come up to work and find a bunch of continuous failures, you want to be able to re-run with extra debugging or bisect what happened.
Indeed. Where I work we have a bunch of repos, but they always reference each other via pinned commits. We happen to use Nix, with its built in 'fetchGit' function; it's also easy to override any of these dependencies with a different revision. For example:
This is a function taking two arguments ('helpers' and 'some-library'), with default arguments that fetch particular git commits. This gives us the option of calling the function with different values, to e.g. build against different commits.We run our CI on GitHub Actions, which allows some jobs to be marked as 'required' for PRs (using branch protection rules). The normal build/test jobs use the default arguments, and are marked as required: everything is pinned, so there should be no unexpected breakages.
Some of our libraries also define extra CI jobs, which are not marked as required. Those fetch the latest revision of various downstream projects which are known to use that library, and override the relevant argument with themselves. For example, the 'some-library' repo might have a test like this:
This lets us know if our PR would break downstream projects, if they were to subsequently update their pinned dependencies (either because we've broken the library, or the downstream project is buggy). It's useful for spotting problems early, regardless of whether the root cause is upstream or downstream.