Mercury: Ultra-fast language models based on diffusion

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.

As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.

It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

kccqzy · 5 months ago

> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green.

I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem. When I was at Google, it was somewhat common for me to debug non-deterministic bugs such as a missing synchronization or fence causing flakiness; and it was common to just launch 10000 copies of the same test on 10000 machines to find perhaps a single digit number of failures. My current employer has a clunkier implementation of the same thing (no UI), but there's also a single command to launch 1000 test workers to run all tests from your own checkout. The goal is to finish testing a 1M loc codebase in no more than five minutes so that you get quick feedback on your changes.

> make builds fully hermetic (so no inter-run caching)

These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

mike_hearn · 5 months ago

I was also at Google for years. Places like that are not even close to representative. They can afford to just-throw-more-resources, they get bulk discounts on hardware and they pay top dollar for engineers.

In more common scenarios that represent 95% of the software industry CI budgets are fixed, clusters are sized to be busy most of the time, and you cannot simply launch 10,000 copies of the same test on 10,000 machines. And even despite that these CI clusters can easily burn through the equivalent of several SWE salaries.

> These are orthogonal. You want maximum deterministic CI steps so that you make builds fully hermetic and cache every single thing.

Again, that's how companies like Google do it. In normal companies, build caching isn't always perfectly reliable, and if CI runs suffer flakes due to caching then eventually some engineer is gonna get mad and convince someone else to turn the caching off. Blaze goes to extreme lengths to ensure this doesn't happen, and Google spends extreme sums of money on helping it do that (e.g. porting third party libraries to use Blaze instead of their own build system).

In companies without money printing machines, they sacrifice caching to get determinism and everything ends up slow.

IshKebab · 5 months ago

Developer time is more expensive than machine time, but at most companies it isn't 10000x more expensive. Google is likely an exception because it pays extremely well and has access to very cheap machines.

Even then, there are other factors:

* You might need commercial licenses. It may be very cheap to run open source code 10000x, but guess how much 10000 Questa licenses cost.

* Moores law is dead Amdahl's law very much isn't. Not everything is embarrassingly parallel.

* Some people care about the environment. I worked at a company that spent 200 CPU hours on every single PR (even to fix typos; I failed to convince them they were insane for not using Bazel or similar). That's a not insignificant amount of CO2.

mark_undoio · 5 months ago

> I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain? It's just a throw-more-resources problem.

I'd personally agree. But this sounds like the kind of thing that, at many companies, could be a real challenge.

Ultimately, you can measure dollars spent on CI workers. It's much harder and less direct to quantify the cost of not having them (until, for instance, people start taking shortcuts with testing and a regression escapes to production).

That kind of asymmetry tends, unless somebody has a strong overriding vision of where the value really comes from, to result in penny pinching on the wrong things.

mystified5016 · 5 months ago

IME it's less of a "throw more resources" problem and more of a "stop using resources in literally the worst way possible"

CI caching is, apparently, extremely difficult. Why spend a couple of hours learning about your CI caches when you can just download and build the same pinned static library a billion times? The server you're downloading from is (of course) someone else's problem and you don't care about wasting their resources either. The power you're burning by running CI for there hours instead of one is also someone else's problem. Compute time? Someone else's problem. Cloud costs? You bet it's someone else's problem.

Sure, some things you don't want to cache. I always do a 100% clean build when cutting a release or merging to master. But for intermediate commits on a feature branch? Literally no reason not to cache builds the exact same way you do on your local machine.

ronbenton · 5 months ago

>Do companies not just double their CI workers after hearing people complain?

They do not.

I don't know if it's a matter of justifying management levels, but these discussions are often drawn out and belabored in my experience. By the time you get approval, or even worse, rejected, for asking for more compute (or whatever the ask is), you've spent way more money on the human resource time than you would ever spend on the requested resources.

wat10000 · 5 months ago

Many companies are strangely reluctant to spend money on hardware for developers. They might refuse to spend $1,000 on a better laptop to be used for the next three years by an employee, whose time costs them that much money in a single afternoon.

socalgal2 · 5 months ago

Even Google can not buy more old Intel Macs or Pixel 6s or Samsung S20s to increase their testing on those devices (as an example)

Maybe that affects less devs who don't need to test on actual hardware but plenty of apps do. Pretty much anything that touches a GPU driver for example like a game.

wbl · 5 months ago

No it is not. Senior management often has a barely disguised contempt for engineering and spending money to do a better job. They listen much more to sales complain.

anp · 5 months ago

I’m currently at google (opinions not representative of my employer’s etc) and this is true for things that run in a data center but it’s a lot harder for things that need to be tested on physical hardware like parts of Android or CrOS.

wavemode · 5 months ago

You're confusing throughput and latency. Lengthy CI runs increase the latency of developer output, but they don't significantly reduce overall throughput, given a developer will typically be working on multiple things at once, and can just switch tasks while CI is running. The productivity cost of CI is not zero, but it's way, way less than the raw wallclock time spent per run.

Then also factor in that most developer tasks are not even bottlenecked by CI. They are bottlenecked primarily by code review, and secondarily by deployment.

kevingadd · 5 months ago

My personal experience: We run over 1.1m test cases to verify every PR that I submit, and there are more test cases that don't get run on every commit and instead get run daily or on-demand.

At that scale getting quick turnaround is a difficult infrastructure problem, especially if you have individual tests that take multiple seconds or suites that take multiple minutes (we do, and it's hard to actually pull the execution time down on all of them).

I've never personally heard "we don't have the budget" or "we don't have enough machines" as answers for why our CI turnaround isn't 5 minutes, and it doesn't seem to me like the answer is just doubling the core count in every situation.

The scenario I work on daily (a custom multi-platform runtime with its own standard library) does by necessity mean that builds and testing are fairly complex though. I wouldn't be surprised if your assertion (just throw more resources at it) holds for more straightforward apps.

MangoToupe · 5 months ago

Writing testing infrastructure so that you can just double workers and get a corresponding doubling in productivity is non-trivial. Certainly I've never seen anything like Google's testing infrastructure anywhere else I've worked.

joshstrange · 5 months ago

> I don't understand this. Developer time is so much more expensive than machine time. Do companies not just double their CI workers after hearing people complain?

They don’t, and it’s annoying. Oh some do for sure but it’s the same with developer hardware. I’ve worked places where developers are waiting 4+ minutes for a build to complete when that could be halved or more by better developer hardware but companies are sometimes incredibly “penny wise and pound foolish”.

fy20 · 5 months ago

My last company was unsure about paying $20/mo to get a Copilot license for all the engineers.

noelwelsh · 5 months ago

I believe the solution is to run CI locally, and upload some signed proof that CI completed successfully. Running the same tests locally and then on an ~10 times slower CI build always felts like a ridiculous waste of time to me.

physicsguy · 5 months ago

Not really, in most small companies/departments, £100k a month is considered a painful cloud bill and adding more EC2 instances to provide cloud runners can add 10% to that easily.

rafaelmn · 5 months ago

> If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.

I am guesstimating (based on previous experience self-hosting the runner for MacOS builds) that the project I am working on could get like 2-5x pipeline performance at 1/2 cost just by using self-hosted runners on bare metal rented machines like Hetzner. Maybe I am naive, and I am not the person that would be responsible for it - but having a few bare metal machines you can use in the off hours to run regression tests, for less than you are paying the existing CI runner just for build, that speed up everything massively seems like a pure win for relatively low effort. Like sure everyone already has stuff on their plate and would rather pay external service to do it - but TBH once you have this kind of compute handy you will find uses anyway and just doing things efficiently. And knowing how to deal with bare metal/utilize this kind of compute sounds generally useful skill - but I rarely encounter people enthusiastic about making this kind of move. Its usually - hey lets move to this other service that has slightly cheaper instances and a proprietary caching layer so that we can get locked into their CI crap.

Its not like these services have 0 downtime/bug free/do not require integration effort - I just don't see why going bare metal is always such a taboo topic even for simple stuff like builds.

mike_hearn · 5 months ago

Yep. For my own company I used a bare metal machine in Hetzner running Linux and a Windows VM along with a bunch of old MacBook Pros wired up in the home office for CI.

It works, and it's cheap. A full CI run still takes half an hour on the Linux machine (the product [1] is a kind of build system for shipping desktop apps cross platform, so there's lots of file IO and cryptography involved). The Macs are by far the fastest. The M1 Mac is embarrassingly fast. It can complete the same run in five minutes despite the Hetzner box having way more hardware. In fairness, it's running both a Linux and Windows build simultaneously.

I'm convinced the quickest way to improve CI times in most shops is to just build an in-office cluster of M4 Macs in an air conditioned room. They don't have to be HA. The hardware is more expensive but you don't rent per month, and CI is often bottlenecked on serial execution speed so the higher single threaded performance of Apple Silicon is worth it. Also, pay for a decent CI system like TeamCity. It helps reduce egregious waste from problems like not caching things or not re-using checkout directories. In several years of doing this I haven't had build caching related failures.

[1] https://hydraulic.dev/

adamcharnock · 5 months ago

> 2-5x pipeline performance at 1/2 cost just by using self-hosted runners on bare metal rented machines like Hetzner

This is absolutely the case. Its a combination of having dedicated CPU cores, dedicated memory bandwidth, and (perhaps most of all) dedicated local NVMe drives. We see a 2x speed up running _within VMs_ on bare metal.

> And knowing how to deal with bare metal/utilize this kind of compute sounds generally useful skill - but I rarely encounter people enthusiastic about making this kind of move

We started our current company for this reason [0]. A lot of people know this makes sense on some level, but not many people want to do it. So we say we'll do it for you, give you the engineering time needed to support it, and you'll still save money.

> I just don't see why going bare metal is always such a taboo topic even for simple stuff like builds.

It is decreasingly so from what I see. Enough people have been variously burned by public cloud providers to know they are not a panacea. But they just need a little assistance in making the jump.

[0] - https://lithus.eu

azeirah · 5 months ago

At the last place I worked at, which was just a small startup with 5 developers, I calculated that a server workstation in the office would be both cheaper and more performant than renting a similar machine in the cloud.

Bare metal makes such a big difference for test and CI scenarios. It even has an integrated a GPU to speed up webdev tests. Good luck finding an affordable machine in the cloud that has a proper GPU for this kind of a use-case

daxfohl · 5 months ago

There are a couple mitigating considerations

1. As implementation phase gets faster, the bottleneck could actually switch to PM. In which case, changes will be more serial, so a lot fewer conflicts to worry about.

2. I think we could see a resurrection of specs like TLA+. Most engineers don't bother with them, but I imagine code agents could quickly create them, verify the code is consistent with them, and then require fewer full integration tests.

3. When background agents are cleaning up redundant code, they can also clean up redundant tests.

4. Unlike human engineering teams, I expect AIs to work more efficiently on monoliths than with distributed microservices. This could lead to better coverage on locally runnable tests, reducing flakes and CI load.

5. It's interesting that even as AI increases efficiency, that increased velocity and sheer amount of code it'll write and execute for new use cases will create its own problems that we'll have to solve. I think we'll continue to have new problems for human engineers to solve for quite some time.

valenterry · 5 months ago

> 2. I think we could see a resurrection of specs like TLA+.

I think so too. But it's not gonna be TLA+. It's just gonne be programming languages that allow to catch problems with their typesystem much more comprehensively, allowing AI to iterate quickly without even having to run unit-tests.

While developers don't want to spend the time to learn it and prefer easy-to-learn languages such as golang, LLMs only have to be trained once and then you can reap the benefits permanently.

TechDebtDevin · 5 months ago

LLM making a quick edit, <100 lines... Sure. Asking an LLM to rubber-duck your code, sure. But integrating an LLM into your CI is going to end up costing you 100s of hours productivity on any large project. That or spend half the time you should be spending learning to write your own code, dialing down context sizing and prompt accuracy.

I really really don't understand the hubris around llm tooling, and don't see it catching on outside of personal projects and small web apps. These things don't handle complex systems well at all, you would have to put a gun in my mouth to let one of these things work on an important repo of mine without any supervision... And if I'm supervising the LLM I might as well do it myself, because I'm going to end up redoing 50% of its work anyways..

kraftman · 5 months ago

I keep seeing this argument over and over again, and I have to wonder, at what point do you accept that maybe LLM's are useful? Like how many people need to say that they find it makes them more productive before you'll shift your perspective?

mike_hearn · 5 months ago

I've used Claude with a large, mature codebase and it did fine. Not for every possible task, but for many.

Probably, Mercury isn't as good at coding as Claude is. But even if it's not, there's lots of small tasks that LLMs can do without needing senior engineer level skills. Adding test coverage, fixing low priority bugs, adding nice animations to the UI etc. Stuff that maybe isn't critical so if a PR turns up and it's DOA you just close it, but which otherwise works.

Note that many projects already use this approach with bots like Renovate. Such bots also consume a ton of CI time, but it's generally worth it.

blitzar · 5 months ago

Do the opposite - integrate your CI into your LLM.

Make it run tests after it changes your code and either confirm it didnt break anything or go back and try again.

DSingularity · 5 months ago

He is simply observing that if PR numbers and launch rates increase dramatically CI cost will become untenable.

piva00 · 5 months ago

I haven't worked in places using off-the-shelf/SaaS CI in more than a decade so I feel my experience has been quite the opposite from yours.

We always worked hard to make the CI/CD pipeline as fast as possible. I personally worked on those kind of projects at 2 different employers as a SRE: a smaller 300-people shop which I was responsible for all their infra needs (CI/CD, live deployments, migrated later to k8s when it became somewhat stable, at least enough for the workloads we ran, but still in its beta-days), then at a different employer some 5k+ strong working on improving the CI/CD setup which used Jenkins as a backend but we developed a completely different shim on top for developer experience while also working on a bespoke worker scheduler/runner.

I haven't experienced a CI/CD setup that takes longer than 10 minutes to run in many, many years, got quite surprised reading your comment and feeling spoiled I haven't felt this pain for more than a decade, didn't really expect it was still an issue.

mike_hearn · 5 months ago

I think the prevalence of teams having a "CI guy" who often is developing custom glue, is a sign that CI is still not really working as well as it should given the age of the tech.

I've done a lot of work on systems software over the years so there's often tests that are very I/O or computation heavy, lots of cryptography, or compilation, things like that. But probably there are places doing just ordinary CRUD web app development where there's Playwright tests or similar that are quite slow.

A lot of the problems are cultural. CI times are a commons, so it can end in tragedy. If everyone is responsible for CI times then nobody is. Eventually management gets sick of pouring money into it and devs learn to juggle stacks of PRs on top of each other. Sometimes you get a lot of pushback on attempts to optimize CI because some devs will really scream about any optimization that might potentially go wrong (e.g. depending on your build system cache), even if caching nothing causes an explosion in CI costs. Not their money, after all.

grogenaut · 5 months ago

Before cars people spent little on petroleum products or motor oil or gasoline or mechanics. Now they do. That's how systems work. You wanna go faster well you need better roads, traffic lights, on ramps, etc. you're still going faster.

Use AI to solve the IP bottlenecks or build more features that ear more revenue that buy more ci boxes. Same as if you added 10 devs which you are with AI so why wouldn't some of the dev support costs go up.

Are you not in a place where you can make an efficiency argument to get more ci or optimize? What's a ci box cost?

pamelafox · 5 months ago

For Python apps, I've gotten good CI speedups by moving over to the astral.sh toolchain, using uv for the package installation with caching. Once I move to their type-checker instead of mypy, that'll speed the CI up even more. The playwright test running will then probably be the slowest part, and that's only in apps with frontends.

(Also, Hi Mike, pretty sure I worked with you at Google Maps back in early 2000s, you were my favorite SRE so I trust your opinion on this!)

mike_hearn · 5 months ago

Hi! :)

Astral's work is great but I wonder how they plan to become sustainable. Maybe it's one of those VC plays where they don't intend to ever really make money and it's essentially a productivity subsidy for the other startups.

My experience has been that most apps are bottlenecked on CPU outside of themselves during CI. Either in JIT runtimes, databases, browsers, or libraries they invoke. I guess now maybe models too. So implementation language won't necessarily make a huge difference to this - we need fresh ideas for how to make order of magnitude improvements here. They will probably vary between ecosystems.

droopyEyelids · 5 months ago

In most companies the CI/Dev Tools team is a career dead end. There is no possibility to show a business impact, it's just a money pit that leadership can't/won't understand (and if they do start to understand it, then it becomes _their_ money pit, which is a career dead end for them) So no one who has their head on straight wants to spend time improving it.

And you can't even really say it's a short sighted attitude. It definitely is from a developer's perspective, and maybe it is for the company if dev time is what decides the success of the business overall.

MangoToupe · 5 months ago

> it's just a money pit that leadership can't/won't understand

In my experience it's the opposite: they want more automated testing, but don't want to pay for the friction this causes on productivity.

hansvm · 5 months ago

- Just spin up more test instances. If the AI is as good as people claim then it's still way cheaper than extra programmers.

- Write fast code. At $WORK we can test roughly a trillion things per CPU physical core year for our primary workload, and that's in a domain where 20 microsecond processing time is unheard of. Orders of magnitude speed improvements pay dividends quickly.

- LLMs don't care hugely about the language. Avoid things like rust where compile times are always a drag.

- That's something of a strange human problem you're describing. Once the PR is reviewed, can't you just hit "auto-merge" and go to the next task, only circling back if the code was broken? Why is that a significant amount of developer time?

- The thing you're observing is something every growing team witnesses. You can get 90% of the way to what you want by giving the build system a greenfield re-write. If you really have to run 100x more tests, it's worth a day or ten sanity checking docker caching or whatever it is your CI/CD is using. Even hermetic builds have inter-run caching in some form; it's just more work to specify how the caches should work. Put your best engineer on the problem. It's important.

- Be as specific as possible in describing test dependencies. The fastest tests are the ones which don't run.

- Separate out unit tests from other forms of tests. It's hard to write software operating with many orders of magnitude of discrepancies, and tests are no exception. Your life is easier if conceptually they have a separate budget (e.g., continuous fuzz testing or load testing or whatever). Unit tests can then easily be fast enough for a developer to run all the changed ones on precommit. Slower tests are run locally when you think they might apply. The net effect is that you don't have the sort of back-and-forth with your CI that actually causes lost developer productivity because the PR shouldn't have a bunch of bullshit that's green locally and failing remotely.

mike_hearn · 5 months ago

These are all good suggestions, albeit many are hard to implement in practice.

> That's something of a strange human problem you're describing.

Are we talking about agent-written changes now, or human? Normally reviewers expect tests to pass before they review something, otherwise the work might change significantly after they did the review in order to fix broken tests. Auto merges can fail due to changes that happened in the meantime, they're aren't auto in many cases.

Once latency goes beyond a minute or two people get distracted and start switching tasks to something else, which slows everything down. And yes code review latency is a problem as well, but there are easier fixes for that.

SoftTalker · 5 months ago

Wow, your story gives me flashbacks to the 1990s when I worked in a mainframe environment. Compile jobs submitted by developers were among the lowest priorities. I could make a change to a program, submit a compile job, and wait literally half a day for it to complete. Then I could run my testing, which again might have to wait for hours. I generally had other stuff I could work on during those delays but not always.

mrkeen · 5 months ago

> Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers

No, this is common. The devs just haven't grokked dependency inversion. And I think the rate of new devs entering the workforce will keep it that way forever.

Here's how to make it slow:

* Always refer to "the database". You're not just storing and retrieving objects from anywhere - you're always using the database.

* Work with statements, not expressions. Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance. This will force you to sequentialise the tests (simultaneous tests would otherwise race and cause flakiness) plus you get to write a bunch of setup and teardown and wipe state between tests.

* If you've done the above, you'll probably need to wait for state changes before running an assertion. Use a thread sleep, and if the test is ever flaky, bump up the sleep time and commit it if the test goes green again.

bvrmn · 5 months ago

Nah. Tests could be run in N processes each with own database configured to skip full fsync. It resolves most of the issues and makes testing much much simpler.

zbentley · 5 months ago

> Instead of "the balance is the sum of the transactions", execute several transaction writes (to the database) and read back the resulting balance

Er, doesn’t this boil down to saying “not testing database end state (trusting in transactionality) is faster than testing it”?

I mean sure, trivially true, but not a good idea. I’ve seen lots of bugs caused by code that unexpectedly forced a commit, or even opened/used/committed a whole new DB connection, somewhere buried down inside a theoretically externally-transactional request handler. Bad code, to be sure, but common in many contexts in my experience.

pplonski86 · 5 months ago

We write and run tests to build trust in our code changes. But maybe tests aren’t the only way to achieve that trust.

When I was younger, I had a friend who was a senior software engineer. I remember he would make changes to production systems without even running the application locally or executing any tests, and yet his changes never failed. The team had a high level of trust in all his code changes.

theptip · 5 months ago

This might end up being less of an issue.

If I am coding, I want to stay in the flow and get my PR green asap, so I can continue on the project.

If I am orchestrating agents, I might have 10 or 100 PRs in the oven. In that case I just look at the ones that finish CI.

It’s gonna be less, or at least different, kind of flow IMO. (Until you can just crank out design docs and whiteboard sessions and have the agents fully autonomously get their work green.)

TheDudeMan · 5 months ago

This is because coders didn't spend enough time making their tests efficient. Maybe LLM coding agents can help with that.

dmitrycube · 5 months ago

100% agree.

One of the core premises of what we've been trying to do with our product (Testkube) is to decouple Testing from CI/CD's. Those were never built with testing in mind, let alone scaling to 100's or 1000's of efficient executions. We have a light weight open-source agent, which lives inside a K8s cluster, tests are stored as CRD's cloned from your GIT, executed as K8's jobs. Create whatever heuristics or parallelization necessary, leverage the power of K8s to dynamically scale compute resources as needed, trigger executions by whatever means (GitHub Actions, K8s' events, schedule, etc.), do it on your existing infra.

Admittedly, we don't solve the test creation problem. If there are new tools out there which could automagically generate tests along with code, please share.

Art9681 · 5 months ago

Any modern MacBook can run those tests 100x faster than the crappy cloud runners most companies use. You can also configure runners that run locally and get the benefit of those speed gains. So all of this is really a business and technical problem that is solved for those who want to solve it. It can be solved very cheap, or it can be solved very expensive. Regardless, it's precisely those types of efficiency gains that motivate companies to finally do something about it.

And if not, then enjoy being paid waiting for CI to go green. Maybe it's a reminder to go take a break.

It will be worse when the process is super optimized and the expectation changes. So now instead of those 2 PRs that went to prod today because everyone knows CI takes forever, you'll be expected to push 8 because in our super optimized pipeline it only takes seconds. No excuses. Now the bottleneck is you.

drzaiusx11 · 5 months ago

The nice part about most CI workloads is that they can almost always be split up and executed in parallel. Make sure you're utilizing every core on every CI worker and your worker pools are appropriately sized for the workload. Use spot instances and add auto scaling where it makes sense. No one should be waiting more than a few minutes for a PR build. Exception being compile time which can vary significantly between languages. I have a couple projects that are stuck on ancient compilers because of CPU architecture and C variant, so those will always be a dog without effort to move to something better. Ymmv

drzaiusx11 · 5 months ago

As an example we recently had a Ruby application that had a test suite that was taking literally an hour per build, but turned out it was running entirely sequential by default, using only 1 core. I spent an afternoon migrating our CI runners to split the workload across all available cores and now it's 5 minutes per build. And that was just the low hanging fruit, it can be significantly improved further but there's obviously diminishing returns

ASinclair · 5 months ago

Call me a skeptic but I do not believe LLMs are significantly altering the time between commits so much that CI is the problem.

However, improving CI performance is valuable regardless.

trhway · 5 months ago

>There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.

Testing every change incrementally is a vestige of the code being done by humans (and thus of the current approach where AI helps and/or replaces one given human), in small increments at that, and of the failures being analyzed by individual humans who can keep in their head only limited number of things/dependencies at once.

mathiaspoint · 5 months ago

Good God I hate CI. Just let me run the build automation myself dammit! If you're worried about reproducibility make it reproducible and hash the artifacts, make people include the hash in the PR comment if you want to enforce it.

The amount of time people waste futzing around in eg Groovy is INSANE and I'm honestly inclined to reject job offers from companies that have any serious CI code at this point.

esafak · 5 months ago

It takes more work (serious CI code) to make CI run anywhere, such as your own computer. So you prefer companies that just use GHA? You can't get simpler than that.

mdnahas · 5 months ago

We don’t. We switch to proven-correct code. Languages like Lean, Coq, and Idris allow proofs of correctness for code. The LLM can generate proofs for most of the correctness conditions.

CI is still needed for performance, UI testing, etc. but it can have a much smaller role than it does now.

blitzar · 5 months ago

Yet, now I have added a LLM workflow to my coding the value of my old and mostly useless workflows is now 10x'd.

Git checkpoints, code linting and my naive suite of unit and integration tests are now crucial to my LLM not wasting too much time generating total garbage.

elbear · 5 months ago

CI should just run on each developer's machine. As in, each developer should have a local instance of the CI setup in a VM or a docker container. If tests pass, the result is reported to a central server.

vjerancrnjak · 5 months ago

It’s because people don’t know how to write tests. All of the “don’t do N select queries in a for loop” comments made in PRs are completely ignored in tests.

Each test can output many db queries. And then you create multiple cases.

People don’t even know how to write code that just deals with N things at a time.

I am confident that tests run slowly because the code that is tested completely sucks and is not written for batch mode.

Ignoring batch mode, tests are most of the time written in a a way where test cases are run sequentially. Yet attempts to run them concurrently result in flaky tests, because the way you write them and the way you design interfaces does not allow concurrent execution at all.

Another comment, code done by the best AI model still sucks. Anything simple, like a music player with a library of 10000 songs is something it can’t do. First attempt will be horrible. No understanding of concurrent metadata parsing, lists showing 10000 songs at once in UI being slow etc.

So AI is just another excuse for people writing horrible code and horrible tests. If it’s so smart , try to speed up your CI with it.

rapind · 5 months ago

> This will make the CI bottleneck even worse.

I agree. I think there are potentially multiple solutions to this since there are multiple bottlenecks. The most obvious is probably network overhead when talking to a database. Another might be storage overhead if storage is being used.

Frankly another one is language. I suspect type-safe, compiled, functional languages are going to see some big advantages here over dynamic interpreted languages. I think this is the sweet spot that grants you a ton of performance over dynamic languages, gives you more confidence in the models changes, and requires less testing.

Faster turn-around, even when you're leaning heavily on AI, is a competitive advantage IMO.

mike_hearn · 5 months ago

It could go either way. Depends very much on what kind of errors LLMs make.

Type safe languages in theory should do well, because you get feedback on hallucinated APIs very fast. But if the LLM generally writes code that compiles, unless the compiler is very fast you might get out-run by an LLM just spitting out JavaScript at high speed, because it's faster to run the tests than wait for the compile.

The sweet spot is probably JIT compiled type safe languages. Java, Kotlin, TypeScript. The type systems can find enough bugs to be worth it, but you don't have to wait too long to get test results either.

yieldcrv · 5 months ago

then kill the CI/CD

these redundant processes are for human interoperability

gdiamos · 5 months ago

This sounds like a strawman.

GPUs can do 1 million trillion instructions per second.

Are you saying it’s impossible to write a test that finishes in less than one second on that machine?

Is that a fundamental limitation or an incredibly inefficient test?

mike_hearn · 5 months ago

It's amazing how easy it is to write tests that are slow. Taking >1 second per test is absolutely normal.

> Is that a fundamental limitation or an incredibly inefficient test?

That's the million dollar/month question. If an LLM can diffuse a patch in 3 seconds but it takes 3 hours to test then we have a problem, especially if the LLM needs more test feedback than a human would. But is it a fundamental problem or is it "just" a matter of effort?

I mostly work with JVM based apps in recent years and there's lots of low hanging fruit in tests there. JIT compilation is both a blessing and a curse. You don't waste any time compiling the tests themselves (to machine code), but also, the code that does get compiled is forgotten between runs and build systems like to test different modules in different processes. So every test run of every module starts with slow warmup. There is a lot of work being done at the moment on improving that situation, but a lot of it boils down to poor build systems and that's harder to fix (nobody agrees what a good build system looks like...)

In one of my current projects, I've made the entire test suite run in parallel at the level of individual test classes. This took a bit of work to stop different tests messing with each other's state inside the database, and it revealed some genuine race conditions when apparently unrelated features interacted in buggy ways. But it was definitely worth it for local testing. Unfortunately the CI configuration was then written in such a way that it starts by compiling one of its dependencies, which blows up test time to the point where improvements to the actual tests are nearly irrelevant. This particular CI system is non-standard/in house, and I haven't figured out how to fix it yet.

This kind of story is typical. Many such cases.

nradclif · 5 months ago

A million trillion operations per second is literally an exaflop. That's one hell of a GPU you have.

Dead Comment