I have never seen a system with documentation as awful as Jenkins, with plugins as broken as Jenkins, with behaviors as broken as Jenkins. Groovy is a cancer, and the pipelines are half assed, unfinished and incompatible with most things.
My conclusion was that this is near 100% a design taste and business model problem. That is, to make progress here will require a Steve Jobs of build systems. There's no technical breakthroughs required but a lot of stuff has to gel together in a way that really makes people fall in love with it. Nothing else can break through the inertia of existing practice.
Here are some of the technical problems. They're all solvable.
• Unifying local/remote execution is hard. Local execution is super fast. The bandwidth, latency and CPU speed issues are real. Users have a machine on their desk that compared to a cloud offers vastly higher bandwidth, lower latency to storage, lower latency to input devices and if they're Mac users, the fastest single-threaded performance on the market by far. It's dedicated hardware with no other users and offers totally consistent execution times. RCE can easily slow down a build instead of speeding it up and simulation is tough due to constantly varying conditions.
• As Gregory observes, you can't just do RCE as a service. CI is expected to run tasks devs aren't trusted to do, which means there has to be a way to prove that a set of tasks executed in a certain way even if the local tool driving the remote execution is untrusted, along with a way to prove that to others. As Gregory explores the problem he ends up concluding there's no way to get rid of CI and the best you can do is reduce the overlap a bit, which is hardly a compelling enough value prop. I think you can get rid of conventional CI entirely with a cleverly designed build system, but it's not easy.
• In some big ecosystems like JS/Python there aren't really build systems, just a pile of ad-hoc scripts that run linters, unit tests and Docker builds. Such devs are often happy with existing CI because the task DAG just isn't complex enough to be worth automating to begin with.
• In others like Java the ecosystem depends heavily on a constellation of build system plugins, which yields huge levels of lock-in.
• A build system task can traditionally do anything. Making tasks safe to execute remotely is therefore quite hard. Tasks may depend on platform specific tooling that doesn't exist on Linux, or that only exists on Linux. Installed programs don't helpfully offer their dependency graphs up to you, and containerizing everything is slow/resource intensive (also doesn't help for non-Linux stuff). Bazel has a sandbox that makes it easier to iterate on mapping out dependency graphs, but Bazel comes from Blaze which was designed for a Linux-only world inside Google, not the real world where many devs run on Windows or macOS, and kernel sandboxing is a mess everywhere. Plus a sandbox doesn't solve the problem, only offers better errors as you try to solve it. LLMs might do a good job here.
But the business model problems are much harder to solve. Developers don't buy tools only SaaS, but they also want to be able to do development fully locally. Because throwing a CI system up on top of a cloud is so easy it's a competitive space and the possible margins involved just don't seem that big. Plus, there is no way to market to devs that has a reasonable cost. They block ads, don't take sales calls, and some just hate the idea of running proprietary software locally on principle (none hate it in the cloud), so the only thing that works is making clients open source, then trying to saturate the open source space with free credits in the hope of gaining attention for a SaaS. But giving compute away for free comes at staggering cost that can eat all your margins. The whole dev tools market has this problem far worse than other markets do, so why would you write software for devs at all? If you want to sell software to artists or accountants it's much easier.
ERP rollouts can "fail" for lots of reasons that aren't to do with the software. They are usually business failures. Mostly, companies end up spending so much on trying to endlessly customize it to their idiosyncratic workflows that they exceed their project budgets and abandon the effort. In really bad cases like Birmingham they go live before actually finishing setup, and then lose control of their books and have to resort to hiring people to do the admin manually.
There's a saying about SAP: at some point gaining competitive advantage in manufacturing/retail became all about who could make SAP deployment a success.
This is no different to many other IT projects, most of them fail too. I think people who have never worked in an enterprise context don't realize that; it's not like working in the tech sector. In the tech industry if a project fails, it's probably because it was too ambitious and the tech itself just didn't work well. Or it was a startup whose tech worked, but they couldn't find PMF. But in normal, mature, profitable non-tech businesses a staggering number of business automation projects just fail for social or business reasons.
AI deployments inside companies are going to be like that. The tech works. The business side problems are where the failures are going to happen. Reasons will include:
• Not really knowing what they want the AI to do.
• No way to measure improved productivity, so no way to decide if the API spend is worth it.
• Concluding the only way to get a return is entirely replace people with AI and then having to re-hire them because the AI can't handle the last 5% of the work.
• Non-tech executives doing deals to use models or tech stacks that aren't the right kind or good enough.
etc
RiscOS wasn't even on the table for the likes of IBM and that is what it would have taken to succeed in the business market. But for many years the preferred machine to create Videotext or ATEX (automatic typesetting system) bitstreams was to have a BBC micro and there were quite a few other such interesting niches. I still know of a few BBCs running art installations that have been going non-stop for close to 45 years now. Power supplies are the biggest problem but there are people that specialize in repairing them, and there are various DIY resources as well (videos, articles).
I think it was just relative lack of apps in the end. Microsoft commodified the hardware so it became competitive and prices fell dramatically. Every other company stayed attached to their integrated designs and couldn't keep up on cost. Apple held on for a while because of the bigger US ecosystem and economy but nearly got wiped out also.
Also the RiscOS wasn't really backwards compatible with BBC apps and games, iirc. More like a clean-sheet design.
Acorn's CPU division is the most successful CPU design house in the world and sells around 10x more than all forms of Intel and Intel-compatible chips put together.
It was named after its first product, the Acorn RISC Machine: ARM. It is still called Arm Ltd. today.
Arm alone is one half of the entire CPU market.
https://morethanmoore.substack.com/p/arm-2025-q4-and-fy-fina...
An Acorn-compatible CPU is inside half of the processor-powered devices in the world.
How is that "dropping the ball"? It is the most successful processor design of all time, bar none.
It was such a pity. As a British schoolboy in the early 90s we had a mix of Acorns and PCs, and I had a BBC Model B at home and then a bit later also a PC. Very lucky in hindsight.
The Acorn machines were ridiculously better except for fewer games. At first I don't remember there being much of a gaming gap and there were plenty of games targeting the BBC Micros, but as games scaled up the bigger US economy started to matter much more and the app/game selection just wasn't as good.
But in terms of engineering the GUI was better than Windows, but more importantly the reliability was way higher. My primary school teachers (!) were constantly getting me to fix the computers or install new apps because they always broke. When an Acorn "broke" it was something like the printer being out of paper. When the PC "broke" it was always something much, much harder.
From the AI's perspective a filesystem that vector indexes data on the fly would make sense, perhaps, along with an ability for the user to share out fine-grained permissions with it.
In fact the training process is all about minimizing "perplexity", where perplexity is a measure of how surprised (perplexed) the model is by its training data. It's some exponential inverse of the loss function, I always forget the exact definition.
With enough parameters the models are able to mix and match things pretty well, so the examples of them generating funny jokes aren't necessarily a great rebuttal as there are so many jokes on the web and to find them requires nearly exact keyword matching. A better observation is that we haven't heard many stories of LLMs inventing things. I feel I read about AI a lot and yet the best example I can come up with was some Wordle-like game someone got GPT4 to invent and that was a couple of years ago.
I've found this to be consistently true in my own work. Any time I come up with an algorithm or product idea I think might be novel, I've asked a model to suggest solutions to the same problem. They never can do it. With some leading questions the smartest models will understand the proposal and agree it could work, but they never come up with such ideas cold. What they think of is always the most obvious, straight line, least common denominator kind of suggestions. It makes sense that this is because they're trained to be unsurprising.
Fixing this is probably the best definition of AGI we're going to get. Being surprising at the right time and unsurprising at others is one of the hardest things to do well even for people. We've all known the awkward guy who's learning how to be funny by just saying as much weird stuff as possible and seeing what gets a reaction. And in the corporate environment, my experience has been that innovative people are lauded and praised when they're inventing a golden goose, but shortly after are often demonized or kicked out. The problem being that they keep saying surprising things but people don't like being surprised, especially if it's an unpleasant surprise of the form "saying something true but unsayable", e.g. I don't want to work on product X because nobody is using it. What most people want is a machine that consistently generates pleasant surprises and is a personality-free cog otherwise, but that's hard for even very intelligent humans. It's often hard even to want to do that, because personality isn't something you can flick on and off like a lightswitch. A good example is how Mark Zuckerberg, one of the most successful executives of our era, would have been fired from his own company several times already if he didn't control the voting shares.
WORKFLOW
Every repository is personal and reviewer merges, kernel style. Merging is taking ownership: the reviewer merges into their own tree when they are happy and not before. By implication there is always one primary code reviewer, there is never a situation where someone chooses three reviewers and they all wait for someone else to do the work. The primary reviewer are on the hook for the deliverable as much as the reviewee is.
There is no web based review tool. Git is managed by a server configured with Gitolite. Everyone gets their own git repository under their own name, into which they clone the product repository. Everyone can push into everyone else's repos, but only to branches matching /rr/{username}/something and this is how you open a pull request. Hydraulic is an IntelliJ shop and the JetBrains git UI is really good, so it's easy to browse open RRs (review requests) and check them out locally.
Reviewing means pushing changes onto the rr branch. Either the reviewer makes the change directly (much faster than nitpicky comment roundtrips), or they add a //FIXME comment that IntelliJ is configured to render in lurid yellow and purple for visibility. It's up to the reviewee to clear all the FIXMEs before a change will be merged. Because IntelliJ is very good at refactoring, what you find is that reviewers are willing to make much bigger improvements to a change than you'd normally get via web based review discussions. All the benefits the article discusses are there except 100x because IntelliJ is so good at static analysis. A lot of bugs that sneak past regular code review are caught this way because reviewers can see live static analysis results.
Sometimes during a review you want to ask questions. 90% of the time, this is because the code isn't well documented enough and the solution is to put the question in a //FIXME that's cleared by adding more comments. Sometimes that would be inappropriate because the conversation would have no value to others, and it can be resolved via chat.
Both reviewee and reviewer are expected to properly squash and rebase things. It's usually easier to let commits pile up during the review so both sides have state on the changes, and the reviewer then squashes code review commits into the work before merging. To keep this easy most review requests should turn into one or two commits at most. There should not be cases where people are submitting an RR with 25 "WIP" commits that are all tangled up. So it does require discipline, but this isn't much different to normal development.
RATIONALE
1. Conventional code review can be an exhausting experience, especially for junior developers who make more mistakes. Every piece of work comes back with dozens of nitpicky comments that don't seem important and which is a lot of drudge work to apply. It leads to frustration, burnout and interpersonal conflicts. Reviewees may not understand what is being asked of them, resulting in wasted time. So, latency is often much lower if the reviewer just makes the changes directly in their IDE and pushes. People can then study the commits and learn from them.
2. Conventional projects can struggle to scale up because the codebase becomes a commons. Like in a communist state things degrade and litter piles up, because nobody is fully responsible. Junior developers or devs under time pressure quickly work out who will give them the easiest code review experience and send all the reviews to them. CODEOWNERS are the next step, but it's rare that the structure of your source tree matches the hierarchy of technical management in your organization so this can be a bad fit. Instead of improving widely shared code people end up copy/pasting it to avoid bringing in more mandatory reviewers. It's also easy for important but rarely changed directories to be left out, resulting in changes to core code slowing down because it'd require the founder of the company to approve a trivial refactoring PR.
FINDINGS
Well, it worked well for me at small scale (decent sized codebase but a small team). I never scaled it up to a big team although it was inspired by problems seen managing a big team.
Because most questions are answered by improving code comments rather than replying in a web UI the answers can help LLMs. LLMs work really well in my codebase and I think it's partly due to the plentiful documentation.
Sometimes the lack of a web UI for browsing code was an issue. I experimented with using IntelliJ link format, but of course not everyone wants to use IntelliJ. I could have set up a web UI over git just for source browsing, without the full GitHub experience, but in the end never bothered.
Gitolite is a very UNIXy set of Perl scripts. You need a gray beard to use it well. I thought about SaaSifying this workflow but it never seemed worth it.