> Back when I was a junior developer, there was a smoke test in our pipeline that never passed. I recall asking, “Why is this test failing?” The Senior Developer I was pairing with answered, “Ohhh, that one, yeah it hardly ever passes.” From that moment on, every time I saw a CI failure, I wondered: “Is this a flaky test, or a genuine failure?”
This is a really key insight. It erodes trust in the entire test suite and will lead to false negatives. If I couldn't get the time budget to fix the test, I'd delete it. I think a flaky test is worse than nothing.
"Normalisation of Deviance" is a concept that will change the way you look at the world once you learn to recognise it. It's made famous by Richard Feynman's report about the Challenger disaster, where he said that NASA management had started accepting recurring mission-critical failures as normal issues and ignored them.
My favourite one is: Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors. There's a decent chance that those errors are being ignored by everyone responsible for the system, because they're "the usual errors".
I've seen this go as far as cluster nodes crashing multiple times per day and rebooting over and over, causing mass fail-over events of services. That was written up as "the system is usually this slow", in the sense of "there is nothing we can do about it."
Oof, yes. I used to be an SRE at Google, with oncall responsibility for dozens of servers maintained by a dozen or so dev teams.
Trying to track down issues with requests that crossed or interacted with 10-15 services, when _all_ those services had logs full of 'normal' errors (that the devs had learned to ignore) was...pretty brutal. I don't know how many hours I wasted chasing red herrings while debugging ongoing prod issues.
You don't even have to go as far from your desk as a remote server to see this happening, or open a log file.
The whole concept of addressing issues on your computer by rebooting it is 'normalization of deviance', and yet IT people in support will rant and rave about how it's the fault of users for not rebooting their systems whenever they get complaints of performance problems or instability from users with high uptimes— as if it's not the IT department itself which has loaded that user's computer to the gills with software that's full of memory leaks, litters the disk with files, etc.
I agree with what you're saying, but this is a bad example:
> Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors.
It's true, but IME those "errors" are mostly worth ignoring. Developers, in general, are really bad at logging, and so most logs are full of useless noise. Doubly so for most "enterprise software".
The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message, so it's common that someone will put in a log.Error() call for that. In many cases though, that's just a user problem. The system operator isn't going to and in fact can't address it. "Email server unreachable" on the other hand is definitely an error the operator should care about.
I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..
Horrors from enterprise - few weeks ago a solution architect forced me to rollback a fix (a basic null check) that they "couldn't test" because its not a "real world" scenario (testers creating incorrect data would crash business process for everyone)...
Skip your flaky tests should be a religion. There's nothing else I feel as strongly about regarding CI optimization. If a test is flaky, it gets immediately skipped. Even if you're working on a fix, it's skipped until you solve it, if you ever do. Most of your CI problems can start to be solved by applying this simple rule.
How do you know if it's flaky? You keep a count and any time a test fails/recovers 3 times it gets skipped, even if there's weeks between failures. You can make it more complex for little gain, but I've found this system will have teams actually prioritize fixing important tests, but mostly it has proven that many "important tests to keep even if they are flaky" never were actually important or end up getting re-written in different ways later on.
I feel the same, I can't understand why people value more having a flaky test than not having it.
In both cases, the feature is not validated (the test failure is ignored), but not having the test is transparent (the feature is not being validated), improves pipeline speed and reduces the noise.
Maybe people just like the lie, that the feature is being validated.. I don't really understand.
This is indeed a religion, because in my experience people tend to feel strongly holding very different positions. You can already see it in this thread.
I think quantifying and prioritizing is key, like you wrote. Respected engineering organizations like Google and GitHub all came to the same place. Flakiness is often unevenly distributed, so find & tackle ones that are the worst. Don't try to eliminate the flakiness because that's not economically viable.
I'm trying to put my money where my mouth is... we'll see how it goes.
If you re-read my comment you'll see I already addressed this. Not having a test is better than having a flaky test.
The team should feel the pain of having their shit test skipped without them being able to stop it and it's up to them to either fix it and bring it back or cope.
Or just fix them? A test shouldn't ever, ever, ever be flaky. It can happen because we all make assumptions that can be wrong, or forget about non deterministic behaviours, but when it does a flaky test should be immediately fixed, with the highest priority.
That is out of your control across many teams. Your only hope is having hard rules on what to do with flakyness so that teams aren't able to spend 3 months telling you "please don't skip because we'll fix it in our next sprint".
If you have so many flaky tests, and their flakiness is so intractable, that you actually need to come up with SLOs for handling the flakiness to negotiate being allowed to address the flakiness in the test base, then quite frankly, you should be looking for another job, one where you can actually go ahead and just fix these things and get back to shipping value to customers, instead of one where you play bureaucratic games with management that cares neither about craftsmanship nor about getting value out the door as quickly as possible.
Some software projects, let's call them "integration projects" use third-party software they can do nothing about. And it just doesn't work well. But you have to use it in testing because... well, you are the integrator. The users have already accepted the fact that the software you are integrating doesn't work well, so, it's "all good", except it makes it very hard to distinguish between failures that need to be addressed and those that don't.
Just to give you one example of this situation: JupyterLab is an absolute pile of garbage in terms of how it's programmed. For example, the side-bar of the interface doesn't load properly quite often, and you need to click on "reload" button few times to get it to show up. Suppose now you are the integrator which provides some features that are supposed to be exposed to the user through JupyterLab interface. Well, what can you do? -- Yes. Nothing. Just suck it up. You can manipulate the threshold for how many times you will retry reloading the interface, but you absolutely have to have a threshold because sometimes the interface will never load (because of some other reason), and you will be stalling the test pipeline if you don't let this test fail.
In general, the larger the SUT, the more "foreign" components it has, the harder it is to predict the test behavior, and the more flaky the tests are.
But this isn't the only source of test flakiness. Hardware is another source. Especially in embedded software that has to be tested on hardware that the software company has limited access to (think something like Smart TV, where the TV vendor provides some means of accessing the OS running on the TV set, but they deliberately limit the access in such a way as to prevent the SW company from getting access to the proprietary bits installed by the vendor). So, sometimes things will fail. And you wouldn't know why and wouldn't be able to discover (as in, if you tried to break into vendor's part of the software, they'd sue you).
> For example, the side-bar of the interface doesn't load properly quite often, and you need to click on "reload" button few times to get it to show up... Well, what can you do? - Yes. Nothing.
So... go debug the JupyterLab code. Or open a bug with upstream. Or talk to their support. Or rant on Twitter to try and mobilize upstream to fix their shit. There's not nothing to be done, that's defeatist nonsense. In the meantime, in your codebase, mock the stuff you can't control and move on. That's why mocks exist, so that your test suite can be fast and deterministic even when speed and determinism are otherwise systematically difficult to achieve.
> hardware
Hardware is a totally different ballgame. And not one where you're trying to come up with SLOs for flakiness in a CI system. You don't continuously deploy hardware. You look over the test results, make a manual decision what needs to be retested, and just retest that before deploying. In dev, hopefully you have some kind of emulated device to develop against, in which case, the point is moot.
It sounds like your integration architecture is a mash-up. It is indeed quite difficult to handle external systems that are so tightly coupled to your own.
For standard server-server integrations, where the coupling is limited to an (hopefully eventually) well-documented API surface, it's much more straightforward to replace an external service with an internal mock.
For CI purposes, we eventually develop a more-or-less complete internal version of every external service, quirks and all. It's fun to see their bugfixes show up in our git history in behavioral mocks.
We don't put much emphasis into automating integration testing. We've found that with all the vagarities of flaky external services, it's usually necessary to have a human in the loop somewhere for integration tests.
Every flakey test is a production failure that really happens to some users sometimes. How often is just a matter of scale.
A 1 in 10,000 failure can be a daily annoyance for your users even with just 100 daily active users who each make 10 actions on your app. At “internet scale” a 1 in 10k error frustrates a user every few seconds.
If your tests are so flaky that you need SLOs … your poor users …
> Every flakey test is a production failure that really happens to some users sometimes
No. Tests can be flakey when the code they're testing is not. Every flakey test is not a customer problem.
Training developers to ignore flakey tests also trains them to ignore real failures. Potentially many of them. Any of those might be real customer problems.
Depending on the problem domain, a lot of times the flakiness is in the way the test is written and not the code under test. You have to judge the relative cost of tracking down and solving that.
I don't think the point of SLO = flakiness out of control. The point of framing it as SLO is the realization that neither extremes are good. Flakiness cannot be allowed to get out of control that some efforts must be spent to contain it, but it's also unnecessarily perfectionist and thus the waste of precious engineering bandwidth to eliminate them completely. The whole point is to avoid "bureaucratic games", as you call it.
My theory is that the lack of easy mechanism to measure the flakiness is stalling the progress. If the overall flakiness can be measured, and the top offending tests identified, then I think it becomes no brainer to spend efforts curtailing them back when the flakiness gets too high, but otherwise exclude flaky tests from, say, PR merge gate.
I'm just getting tired of these blog posts with meme's scattered throughout. It's so pre-2020. Oh haha, yet another Sean Bean 'one does not simply' meme <eyeroll>. Just give me the content and leave the snark and attempt at humor behind. Or maybe it's just me and I'm getting jaded in my old age.
So I got in a discussion with someone why we need to test post-deployment. I was like because your environments are different you want to eliminate a failure point even if you've tested at build. You can make everything as closely similar to each other as possible but you want to eliminate the failures of a bad configuration, schema or integration. I once was at a place where someone deployed something and just let produce malformed data because a schema wasn't applied to the transformation process of the workers. You know what would have solved that? Pos deploy testing. How does is this in an automated pipeline, you automate the test.
With risk of tooting my own horn too much, this is exactly why I started my company https://checklyhq.com
We approach it a bit different: we blur the lines between E2E testing and production monitoring. You run an E2E test and promote it to a monitor that runs around the clock.
It's quite powerful. Just an E2E test that logs into your production environment after deploy and then every 10 minutes will catch a ton of catastrophic bugs.
You can also trigger them in CI or right after production deployment.
I've played with this concept a while back when I noticed the E2E tests I was writing closely matched the monitoring scripts I would write afterwards. At one point I just took the E2E test suite, pointed it at production and added some provisions for test accounts. Then I just needed to output the E2E test results into the metrics database and we had some additional monitoring. It's a kind of monitoring-driven-developent and as with TDD it's great for validating your tests as well.
It amazes me that there's a drive to make deploys as seamless as possible to avoid customer impact, but nobody wants to invest in systems integration testing in order to make sure that what they're actually serving is indeed correct. Most of the time, the "acceptance test" is verifying that it renders and the basic behaviors of the page operate as expected. It is insulting to the people and teams that actually spend the time to do what is right.
If there are enough tests then flakiness is unavoidable. The key is to separate "flaky test" from "actual failure" so that you can react quickly to actual and ignore flakiness or at least not address it as urgently.
For 99% of software projects this isn't even necessary however. If the number of tests is resonable (say less than a few hours) then you can probably do better with "fix all tests immediately, including addressing flakiness".
But for the large project that e.g. tests several 100k tests across a matrix of different platforms, then you need to resort to the "we'll never reach 100% but we can't dip below 99%" mindset. You need to analyze whether that failure in a PR validation build is actually reasonable to expect from the area that was changed, and if it isn't, then you need to merge it despite the validation not passing. Otherwise it'll never be merged. Modularization helps somewhat, but the small tests that stay within modules aren't the ones that will cause the problems.
If I understand the article correctly, the author offers SLOs as a way to pressure the management to allocate resources to fix CI problems. This would work under assumption that the management has resources to spare or would be able to divert those resources from other departments towards fixing CI problems.
And, sometimes this will probably work. But, I can easily imagine a situation where I, as a CI personnel come to my manager and tell her that we've burned past our 87% SLO objective, and get an immediate response that starting today our SLO objective is 77%.
In my experience, QA (and therefore CI tests) are the first chunk of the budget allocated to overall development that is being subtracted from if any subtraction is to take place. Very few companies bet on the quality of their software as a sale's driver. Most will probably fire the whole QA department and throw away all tests, if times were tough instead of trying to allocate more resources to software quality when hitting some percentage of test failures.
This is a really key insight. It erodes trust in the entire test suite and will lead to false negatives. If I couldn't get the time budget to fix the test, I'd delete it. I think a flaky test is worse than nothing.
My favourite one is: Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors. There's a decent chance that those errors are being ignored by everyone responsible for the system, because they're "the usual errors".
I've seen this go as far as cluster nodes crashing multiple times per day and rebooting over and over, causing mass fail-over events of services. That was written up as "the system is usually this slow", in the sense of "there is nothing we can do about it."
It's not slow! It's broken!
Trying to track down issues with requests that crossed or interacted with 10-15 services, when _all_ those services had logs full of 'normal' errors (that the devs had learned to ignore) was...pretty brutal. I don't know how many hours I wasted chasing red herrings while debugging ongoing prod issues.
The whole concept of addressing issues on your computer by rebooting it is 'normalization of deviance', and yet IT people in support will rant and rave about how it's the fault of users for not rebooting their systems whenever they get complaints of performance problems or instability from users with high uptimes— as if it's not the IT department itself which has loaded that user's computer to the gills with software that's full of memory leaks, litters the disk with files, etc.
> Pick a server or a piece of enterprise software and go take a look at its logs. If it's doing anything interesting at all, it'll be full of errors.
It's true, but IME those "errors" are mostly worth ignoring. Developers, in general, are really bad at logging, and so most logs are full of useless noise. Doubly so for most "enterprise software".
The trouble is context. Eg: "malformed email address" is indeed an error that prevents the email process from sending a message, so it's common that someone will put in a log.Error() call for that. In many cases though, that's just a user problem. The system operator isn't going to and in fact can't address it. "Email server unreachable" on the other hand is definitely an error the operator should care about.
I still haven't actually done it yet, but someday I want to rename that call to log.PageEntireDevTeamAt3AM() and see what happens to log quality..
If you have flaky tests, you can isolate them to their own workflow, and deal with it as isolated away from the rest of your CI process.
Does wonders around this. The idea of monolithic CI job is backward to me now
How do you know if it's flaky? You keep a count and any time a test fails/recovers 3 times it gets skipped, even if there's weeks between failures. You can make it more complex for little gain, but I've found this system will have teams actually prioritize fixing important tests, but mostly it has proven that many "important tests to keep even if they are flaky" never were actually important or end up getting re-written in different ways later on.
In both cases, the feature is not validated (the test failure is ignored), but not having the test is transparent (the feature is not being validated), improves pipeline speed and reduces the noise.
Maybe people just like the lie, that the feature is being validated.. I don't really understand.
Deleted Comment
I think quantifying and prioritizing is key, like you wrote. Respected engineering organizations like Google and GitHub all came to the same place. Flakiness is often unevenly distributed, so find & tackle ones that are the worst. Don't try to eliminate the flakiness because that's not economically viable.
I'm trying to put my money where my mouth is... we'll see how it goes.
Congratulations, you now have an untested feature.
Flaky tests should be fixed, of course. But just deleting them isn't fixing them, and it leaves you worse off, not better.
May i ask how you feel about TODO comments?
The team should feel the pain of having their shit test skipped without them being able to stop it and it's up to them to either fix it and bring it back or cope.
Some software projects, let's call them "integration projects" use third-party software they can do nothing about. And it just doesn't work well. But you have to use it in testing because... well, you are the integrator. The users have already accepted the fact that the software you are integrating doesn't work well, so, it's "all good", except it makes it very hard to distinguish between failures that need to be addressed and those that don't.
Just to give you one example of this situation: JupyterLab is an absolute pile of garbage in terms of how it's programmed. For example, the side-bar of the interface doesn't load properly quite often, and you need to click on "reload" button few times to get it to show up. Suppose now you are the integrator which provides some features that are supposed to be exposed to the user through JupyterLab interface. Well, what can you do? -- Yes. Nothing. Just suck it up. You can manipulate the threshold for how many times you will retry reloading the interface, but you absolutely have to have a threshold because sometimes the interface will never load (because of some other reason), and you will be stalling the test pipeline if you don't let this test fail.
In general, the larger the SUT, the more "foreign" components it has, the harder it is to predict the test behavior, and the more flaky the tests are.
But this isn't the only source of test flakiness. Hardware is another source. Especially in embedded software that has to be tested on hardware that the software company has limited access to (think something like Smart TV, where the TV vendor provides some means of accessing the OS running on the TV set, but they deliberately limit the access in such a way as to prevent the SW company from getting access to the proprietary bits installed by the vendor). So, sometimes things will fail. And you wouldn't know why and wouldn't be able to discover (as in, if you tried to break into vendor's part of the software, they'd sue you).
So... go debug the JupyterLab code. Or open a bug with upstream. Or talk to their support. Or rant on Twitter to try and mobilize upstream to fix their shit. There's not nothing to be done, that's defeatist nonsense. In the meantime, in your codebase, mock the stuff you can't control and move on. That's why mocks exist, so that your test suite can be fast and deterministic even when speed and determinism are otherwise systematically difficult to achieve.
> hardware
Hardware is a totally different ballgame. And not one where you're trying to come up with SLOs for flakiness in a CI system. You don't continuously deploy hardware. You look over the test results, make a manual decision what needs to be retested, and just retest that before deploying. In dev, hopefully you have some kind of emulated device to develop against, in which case, the point is moot.
For standard server-server integrations, where the coupling is limited to an (hopefully eventually) well-documented API surface, it's much more straightforward to replace an external service with an internal mock.
For CI purposes, we eventually develop a more-or-less complete internal version of every external service, quirks and all. It's fun to see their bugfixes show up in our git history in behavioral mocks.
We don't put much emphasis into automating integration testing. We've found that with all the vagarities of flaky external services, it's usually necessary to have a human in the loop somewhere for integration tests.
A 1 in 10,000 failure can be a daily annoyance for your users even with just 100 daily active users who each make 10 actions on your app. At “internet scale” a 1 in 10k error frustrates a user every few seconds.
If your tests are so flaky that you need SLOs … your poor users …
No. Tests can be flakey when the code they're testing is not. Every flakey test is not a customer problem.
Training developers to ignore flakey tests also trains them to ignore real failures. Potentially many of them. Any of those might be real customer problems.
Not really, the one I’ve seen most often is a shared resource failing - for example a Gitlab Runner not handling a new VM for a DB so tests fail etc
My theory is that the lack of easy mechanism to measure the flakiness is stalling the progress. If the overall flakiness can be measured, and the top offending tests identified, then I think it becomes no brainer to spend efforts curtailing them back when the flakiness gets too high, but otherwise exclude flaky tests from, say, PR merge gate.
Deleted Comment
I've probably been in the "corporate" world too long, but I don't have time for that kind of nonsense.
We approach it a bit different: we blur the lines between E2E testing and production monitoring. You run an E2E test and promote it to a monitor that runs around the clock.
It's quite powerful. Just an E2E test that logs into your production environment after deploy and then every 10 minutes will catch a ton of catastrophic bugs.
You can also trigger them in CI or right after production deployment.
Big fat disclaimer: I'm a founder and CTO.
For 99% of software projects this isn't even necessary however. If the number of tests is resonable (say less than a few hours) then you can probably do better with "fix all tests immediately, including addressing flakiness".
But for the large project that e.g. tests several 100k tests across a matrix of different platforms, then you need to resort to the "we'll never reach 100% but we can't dip below 99%" mindset. You need to analyze whether that failure in a PR validation build is actually reasonable to expect from the area that was changed, and if it isn't, then you need to merge it despite the validation not passing. Otherwise it'll never be merged. Modularization helps somewhat, but the small tests that stay within modules aren't the ones that will cause the problems.
And, sometimes this will probably work. But, I can easily imagine a situation where I, as a CI personnel come to my manager and tell her that we've burned past our 87% SLO objective, and get an immediate response that starting today our SLO objective is 77%.
In my experience, QA (and therefore CI tests) are the first chunk of the budget allocated to overall development that is being subtracted from if any subtraction is to take place. Very few companies bet on the quality of their software as a sale's driver. Most will probably fire the whole QA department and throw away all tests, if times were tough instead of trying to allocate more resources to software quality when hitting some percentage of test failures.
Deleted Comment