Faster CI with Selective Testing

turboponyy · 8 months ago

Nix solves this as a byproduct (as it does with many things) of its design. You can have your tests be a "build", where the build succeeds if your tests pass. Any builds can be cached, which means you're essentially caching tests you've already run. Since Nix is deterministic, you never have to rerun any tests until anything about them that could change the evaluation changes.

fire_lake · 8 months ago

Nix is mostly deterministic due to lots of effort by the community but it’s hard to maintain full determinism in your own tests.

IshKebab · 8 months ago

Basically the only sane answer is Bazel (or similar). We currently use path based test running because I couldn't convince people it was wrong, and it regularly results in `master` being broken.

I don't really have a great answer for when that stops scaling - especially for low level things e.g. when Android changes bionic do they run every test because everything depends on it? Technically they should.

The only other cool technique I know of is coverage-based test ranking, but as far as I know it's only used in silicon verification. Basically if you have more tests than you can run, only run the top N based on coverage metrics.

turboponyy · 8 months ago

There's always the escape hatch of marking a derivation as impure.

However, most tests should really have no reason to be impure by design. And tests that do depend on side effects might as well be re-evaluated every time.

ay · 8 months ago

The better you can describe the interdependencies between the components, the more chances the selective approaches have.

However, often if you knew about a given dependency, you might have avoided bugs in the first place!

A simple scenario to illustrate what I have in mind: a system with two plugins, A and B. They provide completely independent functionality, and are otherwise entirely unrelated.

Plugin A adds a new function which allocates a large amount of memory. All tests for A pass. All tests for B pass. The tests for B when A is loaded fail.

Turns out A has done two things:

1) the new memory allocation together with the existing memory allocations in B causes an OOM when both are used.

2) the new function addition offset the function table in the main program, and the plugin B was relying on a haedcoded function index, which didn’t change, by sheer chance, for years.

Those are the the tales based on the real world experience, which made me abandon the idea for the project I am working on (VPP) - the trade offs didn’t seem to be worth it. For some other scenario they may be different though, so thanks for looking into this issue!

deathanatos · 8 months ago

Path-based selection of tests, in every single CI system I have ever seen it implemented in, is wrong. TFA thankfully gets the "run the downstream dependent tests" which is the biggest miss, but even then, you can get the wrong answer. Say I have the following: paths a/* need to run tests A, and paths b/* need to runs tests B. I configure the CI system as such. Commit A' is pushed, changing paths under a/* only, so the CI runs tests A, only. The tests fail. A separate engineer, quickly afterwards (perhaps, lets say, even before the prior commit has finished; this could just be two quick merges) pushes commit B', changing paths under b/* only. The CI system runs tests B only, and they all pass so the commit is marked green incorrectly. Automated deployment systems, or other engineers, proceed to see passing tests, and use a broken but "green" build.

Since ever rarely do I see the downstream-tests requirement correctly done, and almost always the "successive miss" bug, I'm convinced for these reasons path based is just basically a broken approach. I think a better approach is to a.) compute your inputs to the tests, and cache results, and b.) have the CI system be able to suss out flakes and non-determinism. But I think a.) is actually quite difficult (enough to be one of the hardest problems in CS, remember?) and b.) is not at all well supported by the current CI systems.

(I have seen both of these approaches result in bad (known bad, had the tests run!) builds pushed to production. The common complaint is "but CI is slow" followed by a need to do something, but without care towards the correctness of that something. Responsibility for a slow CI is often diffused across the whole company, managers do not want to do the hard task of getting their engineers to fix the tests they're responsible for, since that just doesn't get a promotion. So CI remains slow, and brittle.)

emidln · 8 months ago

It's possible to use bazel to do this. You need to be very explicit (The Bazel Way(tm)), but in exchange, you can ask the graph for everything that is an rdep of a given file. This isn't always necessary in bazel (if you avoid weird integration targets, deploy targets, etc) where `bazel test //...` generally does this by default anyway. It's sometimes necessary to manually express due to incomplete graphs, tests that are not executed every time (for non-determinism, execution cost, etc), and a few other reasons but at least it's possible.

deathanatos · 8 months ago

Yeah, bazel is sort of the exception that proves the rule, here. I wish it did not have such an absurd learning curve; I've found it next to impossible to get started with it.

robertlagrant · 8 months ago

Some of this is pretty simple, though, at the coarse-grained level. If you have a frontend and a backend, and you change the backend, you run all backend unit tests, backend component tests, and the end to end tests. If you change the frontend, you run the frontend unit tests, the frontend component tests, and the end to end tests.

To stop the problem you mentioned you can either tick the box in Github that says only up to date branches can merge, or in Gitlab you can use merge trains. "Running tests on an older version of the code" is a bigger problem than just this case, but in all cases I can think of enabling those features solves it.

deathanatos · 8 months ago

> To stop the problem you mentioned you can either tick the box in Github that says only up to date branches can merge

No, that checkbox doesn't save you. The example above still fails: the second branch, B, after rebase, still only touches files in b/*, and thus still fails to notice the failure in A. The branch CI run would be green, and the merge would be green, both false positives.

agos · 8 months ago

what if you ran tests based on the paths touched by the branch, instead of the single commit?

deathanatos · 8 months ago

The example assumes two branches with a single commit being merged to the repo HEAD, so "testing the branch" and "testing the commit" are equivalent / testing the whole branch does not save you.

(But if you only test the last commit — and some implementations of "did path change?" in CI systems do this — then it is more incorrect.)

motorest · 8 months ago

The premise of this article sounds an awful lot like a solution desperately searching for a problem.

The scenario used to drive this idea is a slippery slope fallacy that tests can take over an hour to run after years. That's rather dubious, but still this leaves out the fact that tests can be run in parallel. In fact, that is also a necessary condition of selective testing. So why bother with introducing with yet more complexity?

To make matters worse if your project grows so large that your hypothetical tests take over an hour to run, it sounds like the project would already be broken down into modules. That, alone, already allows tests to only run if a specific part of the project run.

So it's clear that test run time is not a problem that justifies throwing complexity to solve it. Excluding the test runtime argument, is there anything at all that justifies this?

atq2119 · 8 months ago

> To make matters worse if your project grows so large that your hypothetical tests take over an hour to run, it sounds like the project would already be broken down into modules.

Story time so that you can revisit your assumptions.

Imagine your product is a graphics driver. Graphics APIs have extensive test suites with millions of individual tests. Running them serially typically takes many hours, depending on the target hardware.

But over the years you invariably also run across bugs exposed by real applications that the conformance suites don't catch. So, you also accumulate additional tests, some of them distilled versions of those triggers, some of them captured frames with known good "golden" output pictures. Those add further to the test runtime.

Then, mostly due to performance pressures, there are many options that can affect the precise details of how the driver runs. Many of them are heuristics auto-tuned, but the conformance suite is unlikely to hit all the heuristic cases, so really you should also run all your tests with overrides for the heuristic. Now you have a combinatorial explosion that means the space of tests you really ought to run is at least in the quadrillions.

It's simply infeasible to run all tests on every PR, so what tends to happen in practice is that a manually curated subset of tests is run on every commit, and then more thorough testing happens on various asynchronous schedules (e.g. on release branches, daily on the development branch).

I'm not convinced that the article is the solution, but it could be part of one. New ideas in this space are certainly welcome.

lihaoyi · 8 months ago

> So it's clear that test run time is not a problem that justifies throwing complexity to solve it.

There's a lot I can respond to in this post, but I think the bottom line is: if you have not experienced the problem, count your blessings.

Lots of places do face these problems, with test suites that take hours or days to run if not parallelized. And while parallelization reduces latency, it does not reduce costs, and test suites taking 10, 20, or 50USD every time you update a pull request are not uncommon either

If you never hit these scenarios, just know that many others are not so lucky

motorest · 8 months ago

> There's a lot I can respond to in this post, but I think the bottom line is: if you have not experienced the problem, count your blessings.

You don't even specify what problem is it. Again, this is a solution searching for a problem, and one you can't even describe.

> Lots of places do face these problems, with test suites that take hours or days to run if not parallelized.

What do you mean "if not parallelized"? Are you running tests sequentially and then complaining about how long they take to run?

I won't even touch on the red flag which is the apparent lack of modularization.

> And while parallelization reduces latency, it does not reduce costs, and test suites taking 10, 20, or 50USD every time you update a pull request are not uncommon either

Please explain exactly how you managed to put together a test suite that costs up to 50€ to run.

I assure you the list of problems and red flags you state along the way will never even feature selective testing as a factor or as a solution.

csomar · 8 months ago

How does modules solve this problem? If you changed Module A, don't you need to also test every component that depends on that module to account for any possible regression?

> The scenario used to drive this idea is a slippery slope fallacy that tests can take over an hour to run after years.

It doesn't really take much for a project to grow to one hour for tests to run. If you have a CI/CD pipeline that executes tests on every commit and you have 30-40 commits daily, that's 30-40 hours.

Still, I'd rather setup a full 2 machines to run the tests than rather have to deal with selective testing. It might make sense if you have hundreds of commits per day.

mhlakhani · 8 months ago

In large mono-repos, like this one is presumably targeting, running all tests in the repo for a given PR would take years (maybe even decades/centuries) of compute time. You have to do some level of test selection, and there are full time engineers who just work on optimizing this.

The test runtime argument is the main one IMO.

(source: while I did not work on this at a prior job, I worked closely with the team that did this work).

motorest · 8 months ago

> In large mono-repos, like this one is presumably targeting, running all tests in the repo for a given PR would take years (maybe even decades/centuries) of compute time.

No, not really. That's a silly example. Your assertion makes as much sense as arguing that your monorepo would take years to build because all code is tracked by the same repo. It doesn't it, does it? Why?

How you store your code has nothing to do with how you build it.

rurban · 8 months ago

When your CI is too big, do at least a random selection. So you'll catch bugs at least as somewhen. With selective testing you just ignore them

brunoarueira · 8 months ago

On my last job, since the project is based on Ruby on Rails, we implementei this https://github.com/toptal/crystalball and additional modifications based on the gitlab setup. After that, the pull requests test suite runs pretty fast

lbriner · 8 months ago

Like others, I think this is a solution describing an idealised problem but it very quickly breaks down.

Firstly, if we could accurately know the dependencies that potentially affected a top-level test, we are not like to have a problem in the first place. Our code base is not particularly complex and is probably around 15 libraries and a web app + api in a single solution. A change to something in a library potentially affects about 50 places (but might not affect any of these) and most of the time there is no direct/easy visibility of what calls what to call what to call what. There is also no correlation between folders and top-level tests. Most code is shared, how would that work?

Secondly, we use some front-end code (like many on HN), where a simple change could break every single other front-end page. Might be bad architecture but that is what it is and so any front-end change would need to run every UI change. The breaks might be subtle like a specific button now disappears behind a sidebar. Not noticeable on the other pages but will definitely break a test.

Thirdly, you have to run all of your tests before deploying to production anyway so the fact you might get some fast feedback early on is nice but most likely you won't notice the bad stuff until the 45 minutes test suite has run at which point, you have blocked production and will have to prove that you have fixed it before waiting another 45 minutes.

Fourthly, a big problem for us (maybe 50% of the failures) are flaky tests (maybe caused by flaky code, timing issues, database state issue or just hardware problems) and running selective tests doesn't deal with this.

And lastly, we already run tests somewhat selectively - we run unit tests on branch uilds before building main, we have a number of test projects in parallel but still with less than perfect Developers, less than perfect Architecture, less than perfect CI tools and environments, I think we are just left to try and incrementally improve things by identifying parallelisation opportunities, not over-testing functionality that is not on the main paths etc.

atq2119 · 8 months ago

Another helpful tool that should be mentioned in this context is the idea of merge trains, where a thorough test run is amortized over many commits (that should each have first received lighter selective testing).

This doesn't necessarily reduce the latency until a commit lands, though it might by reducing the overall load on the testing infrastructure. But it ensures that all commits ultimately get the same test coverage with less redundant test expense.

That avoids the problem where a change in component A can accidentally break a test that is only tested on changes to component B.

(It also eliminates regressions that occur when two changes land that have a semantic conflict that wasn't detected because there was no textual conflict and the changes were only tested independently.)