There's a lot of love for monorepos nowadays, but after more than a decade of writing software, I still strongly believe it is an antipattern.
1. The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression. The next deploy took our service down. Monorepos mean loss of isolation of dependencies between services, which is absolutely necessary for the stability of mission-critical business services.
2. It encourages poor API contracts because it lets anyone import any code in any service arbitrarily. Shared functionality should be exposed as a standalone library with a clear, well-defined interface boundary. There are entire packaging ecosystems like npmjs and pypi for exactly this purpose.
3. It encourages a ton of code churn with very low signal. I see at least one PR every week to code owned by my team that changes some trivial configuration, library call, or build directive, simply because some shared config or code changed in another part of the repo and now the entire repo needs to be migrated in lockstep for things to compile.
I've read this paper, as well as watched the talk on this topic, and am absolutely stunned that these problems are not magnified by 100x at Google scale. Perhaps it's simply organizational inertia that prevents them from trying a more reasonable solution.
1) This is solved by 2 interlocking concepts: comprehensive tests & pre-submit checks of those tests. Upgrading a version shouldn’t break anything because any breaking changes should be dealt with in the same change as the version bump.
2) Google’s monorepo allows for visibility restrictions and publicly-visible build targets are not common & reserved for truly public interfaces & packages.
3) “Code churn” is a very uncharitable description of day-to-day maintenance of an active codebase.
Google has invested heavily in infrastructural systems to facilitate the maintenance and execution of tests & code at scale. Monorepos are an organizational design choice which may not work for other teams. It does work at Google.
It’s not really even a true monorepo. Little known feature - there is a versions map which pins major components like base or cfs. This breaks monorepo abstraction and makes full repo changes difficult, but keeps devs of individual components sane.
>> 3. It encourages a ton of code churn with very low signal.
> 3) “Code churn” is a very uncharitable description of day-to-day maintenance of an active codebase.
Also implicit in the discussion is the fact that Google and other big tech companies performance review based on "impact" rather than arbitrary metrics like "number of PRs/LOCs per month". This provides a check on spending too much engineer time on maintenance PRs, since they have no (or very little) impact on your performance rating.
> The single version dependencies are asinine. We are migrating to
> a monorepo at work, and someone bumped the version of an open
> source JS package that introduced a regression.
There's no requirement to have single versions of dependencies in a monorepo. Google allows[0] multiple versions of third-party dependencies such as jQuery or MySQL, and internal code is expected to specify which version it depends on.
> It encourages poor API contracts because it lets anyone import any
> code in any service arbitrarily.
Not true at Google, and I would argue that if you have a repository that allows arbitrary cross-module dependencies then it's not really a monorepo. It's just an extremely large single-project repo with poor structure. The defining feature of a monorepo is that it contains multiple unrelated projects. At Google, this principle was so important that Blaze/Bazel has built-in support for controlling cross-package dependencies.
> I see at least one PR every week [...] because some shared config
> or code changed in another part of the repo and now the entire repo
> needs to be migrated in lockstep for things to compile.
That really doesn't sound like a monorepo to me. If all the code has to be migrated "in lockstep", then that implies a single PR might change code across different parts of the company. At which point it's not independent projects in a monorepo, it's (merely) a single giant project.
I never worked at Google, but this post sums up everything I had to say about the matter. GP has a sh-tty monorepo experience at one company and decides to make a statement about another company where they never worked (so I presume). HN absurdism as its best!
I second your point about monorepo versus ball of mud. They are so different. And managing all of this is about social/culture, less science-y. If you don't have good culture around maintenance, well then, yeah, duh, it will fall apart pretty quickly. It sounds like Google spends crazy money to develop tools to enforce the culture. Hats off.
There's always been a very strong one version policy, multiple versions are usually only allowed to coexist for weeks or months, and are usually visibility restricted.
This prevents situations where "Gmail" ends up bundling 4 different, mildly incompatible versions of MySQL or whatever, and the aggravation that would cause. Or worse, in c++ you get ODR violations due to a function being used from two versions of the same library.
I think the catch, is that it isn't just third-party dependencies that are of concern. In particular, at a certain size, you are best off treating every project in the company as a third party item. But, that is typically not what you are wanting with source dependencies.
You can see this some with how obnoxious Guava was, back in the day. It seems a sane strategy where you can deprecate things quickly by getting all callers to migrate. This is fantastic for the cases where it works. But, it is mind numbingly frustrating in the cases where it doesn't. Worse, it is the kind of work that burns out employees and causes them to not care about the product you are trying to make. "What did you do last month?" "I managed to roll out an upgrade that had no bearing on what we do."
> There's no requirement to have single versions of dependencies in a monorepo. Google allows[0] multiple versions of third-party dependencies such as jQuery or MySQL, and internal code is expected to specify which version it depends on.
Sure, but this is unsustainable. If service Foo depends on myjslib v3.0.0, but service Bar needs to pull in myjslib v3.1.0, in order to make sure Foo is entirely unchanged, you'd have to add a new dependency @myjslib_v3_1_0 used only by Bar. After two years you'd have 10 unique dependencies for 10 versions of myjslib in the monorepo.
At this point you've basically replicated the dependency semantics of a multi-repo world to a monorepo, with extra cruft. This problem is already implicitly solved in a multi-repo world because each service simply declares its own dependencies.
After more than a decade of having tiny repos, I strongly believe that monorepos are the right way to go.
When you're pinning on old versions of software it quickly turns into a depsolving mess.
Software developers have difficulty figuring out which version of code is actually being deployed and used.
When dealing with major version bumps and semver pins around different repositories that creates a massive amount of make-work and configuration churn, and creates entire FTE roles practically dedicated to that job (or else grinds away at the time available for devs to do actual work and not just bump pins and deal with depsolving).
In any successful team which is using many dozens of repos, there's probably one dev running around like fucking nuts making sure everyhing is up to date and in synch who is keeping the whole thing going. If they leave because they're not getting career advancement then the pain is going to get surfaced.
The ability to pin also creates and encourages tech debt and encourages stale library code with security vulnerabilities. All that pinning flexibility is engineering to make tech debt really easy to start generating and to push all that maintenance into the future.
How would multi-repo change this? A dependency updated, and code broke, and the new version was broken—but you update dependencies in multi-repo anyway, and deployments can be broken anyway. I don’t see how multi-repo mitigates this.
> It encourages poor API contracts because it lets anyone import any code in any service arbitrarily.
This has nothing at all to do with monorepos. Google’s own software is built with a tool called Bazel, and Meta has something similar called Buck. These tools let you build the same kind of fine-grained boundaries that you would expect from packaged libraries. In fact, I’d say that the boundaries and API contracts are better when you use tools like Bazel or Buck—instead of just being stuck with something like a private/public distinction, you basically have the freedom to define ACLs on your packages. This is often way too much power for common use cases but it is nice to have it around when you need it, and it’s very easy to work with.
A common way to use this—suppose you have a service. The service code is private, you can’t depend on it. The client library is public, you can import it. The client library may have some internal code which has an ACL so it can only be imported from the client library front-end.
Here’s how we updated services—first add new functionality to the service. Then make the corresponding changes to the client. Finally, push any changes downstream. The service may have to work with multiple versions of the client library at any time, so you have to test with old client libraries. But we also have a “build horizon”—binaries older than some threshold, like 90 days or 180 days or something, are not permitted in production. Because of the build horizon, we know that we only have to support versions of the client library made within the last 90 or 180 days or whatever.
This is for services with “thick clients”—you could cut out the client library and just make RPCs directly, if that was appropriate for your service.
> It encourages a ton of code churn with very low signal.
The places I worked at that had monorepos, you might filter out the automated code changes there to do automated migrations to new APIs. One PR per week sounds pretty manageable, when spread across a team.
Then again, I’ve also worked at places where I had a high meeting load, and barely enough time to get my work done, so maybe one PR per week is burdensome if your are scheduled to death in meetings.
> How would multi-repo change this? A dependency updated, and code broke, and the new version was broken—but you update dependencies in multi-repo anyway, and deployments can be broken anyway. I don’t see how multi-repo mitigates this.
In a multi-repo world, I control the repo for my own service. For a business-critical service in maintenance mode (with no active feature development), there's no reason for me to upgrade the dependencies. Code changes are the #1 cause of incidents; why fix something that isn't broken?
We would have avoided this problem had we not migrated to the monorepo simply because, well, we would have never pulled in the dependency upgrade in the first place.
> In fact, I’d say that the boundaries and API contracts are better when you use tools like Bazel or Buck
I'm familiar with both of these tools, and I agree with this point. However, you are making an implicit assumption that 1. the monorepo in question is built with a tool like Bazel that can enforce code visibility, and 2. that there exists a team or group of volunteers to maintain such a build system across the entire repo. I suspect both of these are not true for the vast majority of codebases outside of FAANG.
> The places I worked at that had monorepos, you might filter out the automated code changes there to do automated migrations to new APIs
Sure, this solves a logistical problem, but not the underlying technical problem of low-signal PRs. I would argue that doing this is an antipattern because it desensitizes service owners from reviewing PRs.
You’re describing bad habits as if they’re a forgone conclusion. Repository-level separation between code makes certain bad habits impossible so a sloppy team will be more effective with many-repos because they physically can’t perform an entire class of fuck-ups but there’s lots of organisations where these fuck-ups… just don’t happen, and so the co-locating code in a monorepo isn’t a concern.
If your organisation can’t work effectively within a monorepo then you should absolutely address the problem, either by fixing the problematic behaviour or by switching away from a monorepo. The problem isn’t monorepos, the problem is monorepos in your organisation.
While 2nd and 3rd points are not really something unique to monorepo, the first point is actually valid. This is why monorepo usually should be packaged with bunch of other development practices, especially comprehensive tests combined with presubmit hook.
IMO, it's more of a development paradigm rather than a mere technology. You cannot simply use monorepo in isolation since its trade-off is strongly coupled with many other tooling and workflow. Because of this reason, I usually don't recommend migration toward monorepo unless there's strong organizational level support.
> 1. The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression
Is this convention for monorepos to all share the same dependencies? Does monorepo imply monolith? Surely one could have dependencies per "service" for example a python app with its own pipfile per directory.
> It encourages poor API contracts because it lets anyone import any code in any service arbitrarily.
Perhaps that might be the default case, but the build system has a visibility system[1] that means that you can carefully control who depends on what parts of your code.
Separately, while some might build against your code directly, a lot of code just gets built into services, and then folk write their code against your published API, i.e. your protobuf specification.
I agree with every single point you made. Unfortunately, it's one of those discussions that is never going to be resolved because like so much else, it's difficult to find common ground when there are competing priorities.
My point is that in reality, we use what best matches our knowledge, experience and perception and prioritisation of the problems. I, for one, believe that a monorepo is dangerous for small teams because it encourages coupling - not only do I believe it, but I saw it with my own eyes. It also creates unnecessary dependency chains. Monorepos contribute to a fallacy that every dependent on an object must be immediately updated or tech debt happens. But that's not even remotely given.
In any case, companies like Google and Amazon have more than enough resources to deal systematically with the problems of a monorepo. I'm sure they have entire teams whose job it is to fix problems in the VCS. But for small teams I remain unconvinced that it is a good idea. We shouldn't even be trying to do the things the big guys do, unless we want to spend all our time working on the tools instead of our businesses.
Personally, I am looking forward to switch to a monorepo as it makes things a lot easier. Makes testing a lot easier when you don’t need to deal with 70 repositories to test something. Also it’s easier to ensure dependencies such as API libraries are up to date in each service. Quicker feedback whether code changes break the things. Now I have to wait at least 24 hours to find if my PR that I merged breaks things.
I've been saying this for half a decade. The solution to having to constantly update dependency version numbers is to ensure that dependencies are more generic than the logic which uses them. If a module is generic and can handle a lot of use cases in a flexible way, then you won't need to update it too often.
One problem is that a lot of developers at big companies code business logic into their modules/dependencies... So whenever the business domain requirements change, they need to update many dependencies... Sometimes they depend on each other and so it's like a tangled web of dependencies which need to be constantly updated whenever requirements change.
Instead of trying to design modules properly to avoid everything becoming a giant tangled web, they prefer to just facilitate it with a monorepo which makes it easier to create and work with the mess (until the point when nobody can make sense of it anymore)... But for sure, this approach introduces vulnerabilities into the system. I don't know how most of the internet still functions.
> The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression. The next deploy took our service down
You're doing it wrong.
The point of monorepo is that if someone breaks something, it breaks right away, at build time, not at deployment time.
I find 1) to be a good property assuming you have some safeguards or rollback procedure, at a cultural/code ownership level it moves the efforts of shared-code changes on the person doing them rather than on the ones depending on shared code, which reduces communications, frustration points and increase responsibility.
For instance in multi-repo environments I've often seen this pattern: own some code, bump an internal dependency to a new version, see it break, ask the person maintaining it what's us, realize this case wasn't taken into account, few back and forth before finding an agreement.
On the other hand in mono-repo environments, it's usually more difficult to introduce a wide changes as you face all consequences immediately, but difficulty is mainly a technical/engineering difficulty rather than a social one, and the outcome is better than the series of compromises made left and right after a big multi-repo change.
That sounds like good arguments for monorepos. Bumping a js package that is used in several places should break the build, that how you test it. It sounds like the fallout of the version bump was caught already on the next build, so hopefully it didn't make it into the master-equivalent branch.
Compare that with hundreds of tiny repos, each with their own little dependency system. Testing a version bump across the board before mainlining it is much more involved and you are more likely to hit stuff in production which should have been caught in test.
The other two points sounds more like cultural issues which may touch on branch strategies, code review, and what's expected of a developer. Those mostly cultural issues that overlaps with technical are hard in a way that repository strategy isn't.
> 2. It encourages poor API contracts because it lets anyone import any code in any service arbitrarily. Shared functionality should be exposed as a standalone library with a clear, well-defined interface boundary. There are entire packaging ecosystems like npmjs and pypi for exactly this purpose.
I don't believe this is true, except in the short term. Unless the writing party is guaranteeing you forward compatibility, your consuming code will break when you update.
This is (almost) the only reason API contracts are worth having; the reason doesn't go away just because you can technically see all the code.
1, 2 and 3: Use separate dependencies for each package, so this doesn't happen. Use e.g. GitHub Actions or another CI/CD file filtering wisely: if a file is needed by two packages, tests for both packages needs to run whenever it's changed, before merging, in addition to usual end-to-end tests. Have vulnerable dependencies alerting and make sure to upgrade it everywhere it occurs.
2: Also have some guidelines on that and enforce it either automatically or manually in PRs.
1 and 2 could be solved by using proper gradle multi-module projects and tests. So I would say this is a problem of tooling of the language you're using. This is one of the reasons why I still can't understand how people operate with inferior ecosystems like node in the backend and I also wish go would have these things.
Code Monoliths make just about as much sense as Runtime Monoliths, that is to say, if you are splitting your project into different micro-services, you can split your code base into different repositories too.
Honestly their systems are almost identical. Amazon just creates a monotonically increasing watermark outside the “repo”. Google uses “the repo” to create the monotonically increasing watermark.
Otherwise, Google calls it “merge into g3” Amazon calls it “merge into live”.
Amazon has the extra vocabulary of VersionSets/Packages/Build files. Google has all the same concepts, but just calls them Dependencies/Folders/Build files.
Amazon’s workflows are “git-like”, Google is migrating to “git-like” workflows (but has a lot of unnecessary vocabulary around getting there - Piper/Fig/Workspace/etc).
I really can’t tell if the specific difference between “mono-repo” or “multi-repo” makes much practical difference to the devs working on either system.
There are no presubmits that prevent breaking changes from "going into live". If some shared infra updates are released, the merge from live breaks for multiple individual teams rather than preventing the code from getting submitted in the first place.
“Merging to live” builds and tests all packages that depend on the update.
So for example, building the new JDK to live will build and test all Java packages in previous live, all of them need to pass their package’s tests, only then will the JDK update be “committed into live”.
The only difference is that Google runs all the presubmits / “dry run to live checks” in the CL workflow. Amazon runs them post CL in the “merge VersionSet” workflow.
With an appropriately configured CI pipeline, submitted / pushed code does not go live anyway, unless all tests and other checks pass. Unless a test case is missing, which can happen in a mono repo just as well, the code is always checked for the defect.
One thing I remember from my time at Amazon that didn’t exist at Google is the massive waste of time trying to fix dependencies issues.
Every week our pipeline would get stuck and some poor college grad would spend a few days poking around at Brazil trying to get it to build. Usually took 3 commits to find a working pattern. The easy path was always to pins all indirect dependencies you relied on- but that was brittle and it’d inevitably break until another engineer wiped the whole list of pins out and discovered it built. Then the cycle repeats. I worked on very old services that had years of history. I’ve often discovered that packages had listed dependencies that went unused, but no one spent time pruning them, even when they were the broken dependency.
At Google, I have no memory of ever tinkering with dependency issues outside of library visibility changes.
Amazon pipelines and versionsets and all that are impressive engineering feats, but I think a version-set was a solution to a problem of their own creation.
I haven’t worked at google but I think there is one other difference. At amazon teams “merge from live” and have control of their own service’s CD pipeline. They might manually release the merged changes or have full integ test coverage. The Amazon workflow offers more flexibility to teams (whether or not that might be desirable).
Not sure how deployments and CD work at google but I think the picture is different at google for unit tests, integ tests etc. Amazon teams have more control over their own codebase and development practices whereas, based on what I know, google has standardized many parts of their development process.
Monorepos are great... but only if you can invest in the tooling scale to handle them, and most companies can't invest in that like Google can. Hyrum Wright class tooling experts don't grow on trees.
You don't need google scale tooling to work with a mono repo until you are actually at google scale. Gluing together a bunch of separate repos isn't exactly free either. See, for example, the complicated disaster Amazon has with brazil.
In the limit, there are only two options:
1. All code lives one repo
2. Every function/class/entity lives in its own repo
with a third state in between
3. You accept code duplication
This compromise state where some code duplication is (maybe implicitly) acceptable is what most people have in mind with a poly-repo.
The problem though is that (3) is not a stable equilibrium.
Most engineers have such a kneejerk reaction against code duplication that (3) is practically untenable. Even if your engineers are more reasonable, (3) style compromise means they constantly have to decide "should this code from package A be duplicated in package B, or split off into a new smaller package C, which A and B depend on". People will never agree on the right answer, which generates discussion and wastes engineering time. In my experience, the trend is almost never to combine repos, but always to generate more and more repos.
The limiting case of a mono repo (which is basically it's natural state) is far more palatable than the limiting case of poly-repo.
I don't understand why this was downvoted. Your list of three states is important to the debate. I never saw it that way. Another, more hostile way to put it: "What is a better or worse alternative and why?" Pretty much everything fits into one of those three states -- with warts.
This mostly seems like a problem for pure library code. If some bit of logic is only needed by a single independently-released service, then there's no reason not to put it in that service's repo.
I completely agree, and I think 2 is partially the forcing function behind a push for “serverless functions” as a unit of computing instead of some larger unit.
> You don't need google scale tooling to work with a mono repo until you are actually at google scale.
I really don't see how that would work for most companies in practice. Most of the off the shelf tooling used by companies with hundreds or thousands of developers assumes working with polyrepos. It's good we're seeing simpler alternative to Bazel but that's just one piece of the puzzle.
i’ve made this argument before, but you can run a 1k engineering company in a monorepo with the tools and services that exist today. between improvements to bazel (and alternatives) and adjacent tooling like build caching/target diffs, core git scalability, merge queues, and other services you can just plug things together over a few days/as needed and it will just work.
all of the stuff that you can’t do easily yet (vfs for repo, remote builds) just isn’t relevant enough at this scale.
Using bazel is nontrivial amount of effort (most of the open-source rules don't really work in a standard way due to the fact that google doesn't work in a standard way).
I guess with a 1K engineering company you can afford a substantial build team.
You can get better tools now though, like Turbo Repo or NX. They don’t require the same level of investment as Bazel but they don’t always have the same hermetic build guarantees, though for most it’s “good enough”.
I love monorepos. I feel like they are even more helpful for small teams and smaller scale. The productivity of being able to add libraries by creating a new folder or refactor across services is unbeatable.
Because Google does something, doesn't mean it's a good thing to do for anyone else. This kind of infrastructure is very expensive to maintain, and suffers from many flaws like -almost- everyone being stuck using SDKs that are several versions behind the latest production one even for the internal GCP ones.
> The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence.
Wait, that's an average of nearly 30 new files per commit. Not 30 files changed per commit, but whatever changes are happening to existing files, plus 30 brand new files. For every single commit.
Although...
> The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, [...]
I'm not quite sure what this is saying.
Is it saying that if `main` contains 1,000 files, and then someone creates a branch called `release`, then the repo now contains 2,000 files? And if someone then deletes 500 files from `main` in the next commit, the repo still contains 2,000 files, not 1,500?
If that's the case, why not just call every different version of every file in the repo a different file? If I have a new repo and in the first commit I create a single 100-line file called `foo.c`, and then I change one line of `foo.c` for the second commit, do I now have a repo with two files?
I mean, if you look at the plumbing for e.g. `git`, yes, the repo is storing two file objects for the repo history. But I don't think I've ever seen someone discuss the Linux git repo and talk about the total number of file objects in the repo object store. And when the linked paper itself mentions Linux, it says "The Linux kernel is a prominent example of a large open source software repository
containing approximately 15 million lines of code in 40,000 files" - and in that case it's definitely not talking about the total number of file objects in the store.
I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository, or if it even means the same thing consistently. I'm not sure it's using the most obvious interpretation, but I can't understand why it would pick a non-obvious interpretation. Especially if it's not going to explain what it means, let alone explain why it chose one meaning over another.
> The total number of files also includes source files copied into release branches
I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories. they are not used very much.
> files that are deleted at the latest revision
so it means "one billion files have existed in the history repo, some are currently deleted".
> I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository,
seems pretty clear - a source code repo has lots of files. at the most recent revision, some exist, some were deleted in some past revision. more will be added (and deleted) in later revisions.
> > The total number of files also includes source files copied into release branches
> I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories.
Still not sure I see the distinction. Surely "sparse" or "not sparse" is an implementation detail. If I create a new branch in git, the files that are unchanged from its parent branch share the same storage, but the files that have changed use their own storage.
> so it means "one billion files have existed in the history repo, some are currently deleted".
I guess I'm struggling to understand what the point of this metric is? I get why "Total number of commits", "Total storage size of repo in GB/TB/PB", "Number of files in current head/main/trunk", or even "total number of distinct file revisions in repo history", could be useful metrics.
But why "number of files (including ones that have been deleted)"? What can we do with this number?
1. The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression. The next deploy took our service down. Monorepos mean loss of isolation of dependencies between services, which is absolutely necessary for the stability of mission-critical business services.
2. It encourages poor API contracts because it lets anyone import any code in any service arbitrarily. Shared functionality should be exposed as a standalone library with a clear, well-defined interface boundary. There are entire packaging ecosystems like npmjs and pypi for exactly this purpose.
3. It encourages a ton of code churn with very low signal. I see at least one PR every week to code owned by my team that changes some trivial configuration, library call, or build directive, simply because some shared config or code changed in another part of the repo and now the entire repo needs to be migrated in lockstep for things to compile.
I've read this paper, as well as watched the talk on this topic, and am absolutely stunned that these problems are not magnified by 100x at Google scale. Perhaps it's simply organizational inertia that prevents them from trying a more reasonable solution.
1) This is solved by 2 interlocking concepts: comprehensive tests & pre-submit checks of those tests. Upgrading a version shouldn’t break anything because any breaking changes should be dealt with in the same change as the version bump.
2) Google’s monorepo allows for visibility restrictions and publicly-visible build targets are not common & reserved for truly public interfaces & packages.
3) “Code churn” is a very uncharitable description of day-to-day maintenance of an active codebase.
Google has invested heavily in infrastructural systems to facilitate the maintenance and execution of tests & code at scale. Monorepos are an organizational design choice which may not work for other teams. It does work at Google.
Does this mean that some things will never get updated, as the effort required is impossibly high?
> 3) “Code churn” is a very uncharitable description of day-to-day maintenance of an active codebase.
Also implicit in the discussion is the fact that Google and other big tech companies performance review based on "impact" rather than arbitrary metrics like "number of PRs/LOCs per month". This provides a check on spending too much engineer time on maintenance PRs, since they have no (or very little) impact on your performance rating.
Deleted Comment
Deleted Comment
Dead Comment
[0] Or allowed -- I last worked there in 2017.
I second your point about monorepo versus ball of mud. They are so different. And managing all of this is about social/culture, less science-y. If you don't have good culture around maintenance, well then, yeah, duh, it will fall apart pretty quickly. It sounds like Google spends crazy money to develop tools to enforce the culture. Hats off.
This prevents situations where "Gmail" ends up bundling 4 different, mildly incompatible versions of MySQL or whatever, and the aggravation that would cause. Or worse, in c++ you get ODR violations due to a function being used from two versions of the same library.
You can see this some with how obnoxious Guava was, back in the day. It seems a sane strategy where you can deprecate things quickly by getting all callers to migrate. This is fantastic for the cases where it works. But, it is mind numbingly frustrating in the cases where it doesn't. Worse, it is the kind of work that burns out employees and causes them to not care about the product you are trying to make. "What did you do last month?" "I managed to roll out an upgrade that had no bearing on what we do."
The third party documentation is public, one-version policies exist but they are exemptions.
Sure, but this is unsustainable. If service Foo depends on myjslib v3.0.0, but service Bar needs to pull in myjslib v3.1.0, in order to make sure Foo is entirely unchanged, you'd have to add a new dependency @myjslib_v3_1_0 used only by Bar. After two years you'd have 10 unique dependencies for 10 versions of myjslib in the monorepo.
At this point you've basically replicated the dependency semantics of a multi-repo world to a monorepo, with extra cruft. This problem is already implicitly solved in a multi-repo world because each service simply declares its own dependencies.
When you're pinning on old versions of software it quickly turns into a depsolving mess.
Software developers have difficulty figuring out which version of code is actually being deployed and used.
When dealing with major version bumps and semver pins around different repositories that creates a massive amount of make-work and configuration churn, and creates entire FTE roles practically dedicated to that job (or else grinds away at the time available for devs to do actual work and not just bump pins and deal with depsolving).
In any successful team which is using many dozens of repos, there's probably one dev running around like fucking nuts making sure everyhing is up to date and in synch who is keeping the whole thing going. If they leave because they're not getting career advancement then the pain is going to get surfaced.
The ability to pin also creates and encourages tech debt and encourages stale library code with security vulnerabilities. All that pinning flexibility is engineering to make tech debt really easy to start generating and to push all that maintenance into the future.
How would multi-repo change this? A dependency updated, and code broke, and the new version was broken—but you update dependencies in multi-repo anyway, and deployments can be broken anyway. I don’t see how multi-repo mitigates this.
> It encourages poor API contracts because it lets anyone import any code in any service arbitrarily.
This has nothing at all to do with monorepos. Google’s own software is built with a tool called Bazel, and Meta has something similar called Buck. These tools let you build the same kind of fine-grained boundaries that you would expect from packaged libraries. In fact, I’d say that the boundaries and API contracts are better when you use tools like Bazel or Buck—instead of just being stuck with something like a private/public distinction, you basically have the freedom to define ACLs on your packages. This is often way too much power for common use cases but it is nice to have it around when you need it, and it’s very easy to work with.
A common way to use this—suppose you have a service. The service code is private, you can’t depend on it. The client library is public, you can import it. The client library may have some internal code which has an ACL so it can only be imported from the client library front-end.
Here’s how we updated services—first add new functionality to the service. Then make the corresponding changes to the client. Finally, push any changes downstream. The service may have to work with multiple versions of the client library at any time, so you have to test with old client libraries. But we also have a “build horizon”—binaries older than some threshold, like 90 days or 180 days or something, are not permitted in production. Because of the build horizon, we know that we only have to support versions of the client library made within the last 90 or 180 days or whatever.
This is for services with “thick clients”—you could cut out the client library and just make RPCs directly, if that was appropriate for your service.
> It encourages a ton of code churn with very low signal.
The places I worked at that had monorepos, you might filter out the automated code changes there to do automated migrations to new APIs. One PR per week sounds pretty manageable, when spread across a team.
Then again, I’ve also worked at places where I had a high meeting load, and barely enough time to get my work done, so maybe one PR per week is burdensome if your are scheduled to death in meetings.
In a multi-repo world, I control the repo for my own service. For a business-critical service in maintenance mode (with no active feature development), there's no reason for me to upgrade the dependencies. Code changes are the #1 cause of incidents; why fix something that isn't broken?
We would have avoided this problem had we not migrated to the monorepo simply because, well, we would have never pulled in the dependency upgrade in the first place.
> In fact, I’d say that the boundaries and API contracts are better when you use tools like Bazel or Buck
I'm familiar with both of these tools, and I agree with this point. However, you are making an implicit assumption that 1. the monorepo in question is built with a tool like Bazel that can enforce code visibility, and 2. that there exists a team or group of volunteers to maintain such a build system across the entire repo. I suspect both of these are not true for the vast majority of codebases outside of FAANG.
> The places I worked at that had monorepos, you might filter out the automated code changes there to do automated migrations to new APIs
Sure, this solves a logistical problem, but not the underlying technical problem of low-signal PRs. I would argue that doing this is an antipattern because it desensitizes service owners from reviewing PRs.
If your organisation can’t work effectively within a monorepo then you should absolutely address the problem, either by fixing the problematic behaviour or by switching away from a monorepo. The problem isn’t monorepos, the problem is monorepos in your organisation.
IMO, it's more of a development paradigm rather than a mere technology. You cannot simply use monorepo in isolation since its trade-off is strongly coupled with many other tooling and workflow. Because of this reason, I usually don't recommend migration toward monorepo unless there's strong organizational level support.
Is this convention for monorepos to all share the same dependencies? Does monorepo imply monolith? Surely one could have dependencies per "service" for example a python app with its own pipfile per directory.
Perhaps that might be the default case, but the build system has a visibility system[1] that means that you can carefully control who depends on what parts of your code.
Separately, while some might build against your code directly, a lot of code just gets built into services, and then folk write their code against your published API, i.e. your protobuf specification.
[1]: https://bazel.build/concepts/visibility
My point is that in reality, we use what best matches our knowledge, experience and perception and prioritisation of the problems. I, for one, believe that a monorepo is dangerous for small teams because it encourages coupling - not only do I believe it, but I saw it with my own eyes. It also creates unnecessary dependency chains. Monorepos contribute to a fallacy that every dependent on an object must be immediately updated or tech debt happens. But that's not even remotely given.
In any case, companies like Google and Amazon have more than enough resources to deal systematically with the problems of a monorepo. I'm sure they have entire teams whose job it is to fix problems in the VCS. But for small teams I remain unconvinced that it is a good idea. We shouldn't even be trying to do the things the big guys do, unless we want to spend all our time working on the tools instead of our businesses.
One problem is that a lot of developers at big companies code business logic into their modules/dependencies... So whenever the business domain requirements change, they need to update many dependencies... Sometimes they depend on each other and so it's like a tangled web of dependencies which need to be constantly updated whenever requirements change.
Instead of trying to design modules properly to avoid everything becoming a giant tangled web, they prefer to just facilitate it with a monorepo which makes it easier to create and work with the mess (until the point when nobody can make sense of it anymore)... But for sure, this approach introduces vulnerabilities into the system. I don't know how most of the internet still functions.
You're doing it wrong.
The point of monorepo is that if someone breaks something, it breaks right away, at build time, not at deployment time.
You're not really using a monorepo.
For instance in multi-repo environments I've often seen this pattern: own some code, bump an internal dependency to a new version, see it break, ask the person maintaining it what's us, realize this case wasn't taken into account, few back and forth before finding an agreement.
On the other hand in mono-repo environments, it's usually more difficult to introduce a wide changes as you face all consequences immediately, but difficulty is mainly a technical/engineering difficulty rather than a social one, and the outcome is better than the series of compromises made left and right after a big multi-repo change.
Compare that with hundreds of tiny repos, each with their own little dependency system. Testing a version bump across the board before mainlining it is much more involved and you are more likely to hit stuff in production which should have been caught in test.
The other two points sounds more like cultural issues which may touch on branch strategies, code review, and what's expected of a developer. Those mostly cultural issues that overlaps with technical are hard in a way that repository strategy isn't.
I don't believe this is true, except in the short term. Unless the writing party is guaranteeing you forward compatibility, your consuming code will break when you update.
This is (almost) the only reason API contracts are worth having; the reason doesn't go away just because you can technically see all the code.
1, 2 and 3: Use separate dependencies for each package, so this doesn't happen. Use e.g. GitHub Actions or another CI/CD file filtering wisely: if a file is needed by two packages, tests for both packages needs to run whenever it's changed, before merging, in addition to usual end-to-end tests. Have vulnerable dependencies alerting and make sure to upgrade it everywhere it occurs.
2: Also have some guidelines on that and enforce it either automatically or manually in PRs.
1. Have some concept of visibility restriction e.g. Go language has internal package.
2. Ensure that every single package has a command to build the code.
3. Ensure that CI builds all the packages that changed our impacted by the change in a given pull request.
These three steps are mostly sufficient in having a monorepo. What you get in return is high code consistency and code visibility for the whole team.
2) Private/public/internal modifiers
3) Independent builds/project in a monorepo
Honestly their systems are almost identical. Amazon just creates a monotonically increasing watermark outside the “repo”. Google uses “the repo” to create the monotonically increasing watermark.
Otherwise, Google calls it “merge into g3” Amazon calls it “merge into live”.
Amazon has the extra vocabulary of VersionSets/Packages/Build files. Google has all the same concepts, but just calls them Dependencies/Folders/Build files.
Amazon’s workflows are “git-like”, Google is migrating to “git-like” workflows (but has a lot of unnecessary vocabulary around getting there - Piper/Fig/Workspace/etc).
I really can’t tell if the specific difference between “mono-repo” or “multi-repo” makes much practical difference to the devs working on either system.
“Merging to live” builds and tests all packages that depend on the update.
So for example, building the new JDK to live will build and test all Java packages in previous live, all of them need to pass their package’s tests, only then will the JDK update be “committed into live”.
The only difference is that Google runs all the presubmits / “dry run to live checks” in the CL workflow. Amazon runs them post CL in the “merge VersionSet” workflow.
Every week our pipeline would get stuck and some poor college grad would spend a few days poking around at Brazil trying to get it to build. Usually took 3 commits to find a working pattern. The easy path was always to pins all indirect dependencies you relied on- but that was brittle and it’d inevitably break until another engineer wiped the whole list of pins out and discovered it built. Then the cycle repeats. I worked on very old services that had years of history. I’ve often discovered that packages had listed dependencies that went unused, but no one spent time pruning them, even when they were the broken dependency.
At Google, I have no memory of ever tinkering with dependency issues outside of library visibility changes.
Amazon pipelines and versionsets and all that are impressive engineering feats, but I think a version-set was a solution to a problem of their own creation.
It’s then “merged into g3” from that workspace.
Not sure how deployments and CD work at google but I think the picture is different at google for unit tests, integ tests etc. Amazon teams have more control over their own codebase and development practices whereas, based on what I know, google has standardized many parts of their development process.
A good article to reference when this topic gets raised: http://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-yo...
In the limit, there are only two options:
with a third state in between This compromise state where some code duplication is (maybe implicitly) acceptable is what most people have in mind with a poly-repo.The problem though is that (3) is not a stable equilibrium. Most engineers have such a kneejerk reaction against code duplication that (3) is practically untenable. Even if your engineers are more reasonable, (3) style compromise means they constantly have to decide "should this code from package A be duplicated in package B, or split off into a new smaller package C, which A and B depend on". People will never agree on the right answer, which generates discussion and wastes engineering time. In my experience, the trend is almost never to combine repos, but always to generate more and more repos.
The limiting case of a mono repo (which is basically it's natural state) is far more palatable than the limiting case of poly-repo.
I really don't see how that would work for most companies in practice. Most of the off the shelf tooling used by companies with hundreds or thousands of developers assumes working with polyrepos. It's good we're seeing simpler alternative to Bazel but that's just one piece of the puzzle.
all of the stuff that you can’t do easily yet (vfs for repo, remote builds) just isn’t relevant enough at this scale.
I guess with a 1K engineering company you can afford a substantial build team.
Why Google Stores Billions of Lines of Code in a Single Repository (2016) - https://news.ycombinator.com/item?id=22019827 - Jan 2020 (121 comments)
Why Google Stores Billions of Lines of Code in a Single Repository (2016) - https://news.ycombinator.com/item?id=17605371 - July 2018 (281 comments)
Why Google stores billions of lines of code in a single repository (2016) - https://news.ycombinator.com/item?id=15889148 - Dec 2017 (298 comments)
Why Google Stores Billions of Lines of Code in a Single Repository - https://news.ycombinator.com/item?id=11991479 - June 2016 (218 comments)
Wait, that's an average of nearly 30 new files per commit. Not 30 files changed per commit, but whatever changes are happening to existing files, plus 30 brand new files. For every single commit.
Although...
> The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, [...]
I'm not quite sure what this is saying.
Is it saying that if `main` contains 1,000 files, and then someone creates a branch called `release`, then the repo now contains 2,000 files? And if someone then deletes 500 files from `main` in the next commit, the repo still contains 2,000 files, not 1,500?
If that's the case, why not just call every different version of every file in the repo a different file? If I have a new repo and in the first commit I create a single 100-line file called `foo.c`, and then I change one line of `foo.c` for the second commit, do I now have a repo with two files?
I mean, if you look at the plumbing for e.g. `git`, yes, the repo is storing two file objects for the repo history. But I don't think I've ever seen someone discuss the Linux git repo and talk about the total number of file objects in the repo object store. And when the linked paper itself mentions Linux, it says "The Linux kernel is a prominent example of a large open source software repository containing approximately 15 million lines of code in 40,000 files" - and in that case it's definitely not talking about the total number of file objects in the store.
I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository, or if it even means the same thing consistently. I'm not sure it's using the most obvious interpretation, but I can't understand why it would pick a non-obvious interpretation. Especially if it's not going to explain what it means, let alone explain why it chose one meaning over another.
> The total number of files also includes source files copied into release branches
I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories. they are not used very much.
> files that are deleted at the latest revision
so it means "one billion files have existed in the history repo, some are currently deleted".
> I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository,
seems pretty clear - a source code repo has lots of files. at the most recent revision, some exist, some were deleted in some past revision. more will be added (and deleted) in later revisions.
it's very much not the same model as git.
hope that clears things up.
It certainly feels that way :-)
> > The total number of files also includes source files copied into release branches
> I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories.
Still not sure I see the distinction. Surely "sparse" or "not sparse" is an implementation detail. If I create a new branch in git, the files that are unchanged from its parent branch share the same storage, but the files that have changed use their own storage.
> so it means "one billion files have existed in the history repo, some are currently deleted".
I guess I'm struggling to understand what the point of this metric is? I get why "Total number of commits", "Total storage size of repo in GB/TB/PB", "Number of files in current head/main/trunk", or even "total number of distinct file revisions in repo history", could be useful metrics.
But why "number of files (including ones that have been deleted)"? What can we do with this number?
> hope that clears things up.
It's helping. Thanks.