Google stores billions of lines of code in a single repository (2016) [pdf]

There's a lot of love for monorepos nowadays, but after more than a decade of writing software, I still strongly believe it is an antipattern.

1. The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression. The next deploy took our service down. Monorepos mean loss of isolation of dependencies between services, which is absolutely necessary for the stability of mission-critical business services.

2. It encourages poor API contracts because it lets anyone import any code in any service arbitrarily. Shared functionality should be exposed as a standalone library with a clear, well-defined interface boundary. There are entire packaging ecosystems like npmjs and pypi for exactly this purpose.

3. It encourages a ton of code churn with very low signal. I see at least one PR every week to code owned by my team that changes some trivial configuration, library call, or build directive, simply because some shared config or code changed in another part of the repo and now the entire repo needs to be migrated in lockstep for things to compile.

I've read this paper, as well as watched the talk on this topic, and am absolutely stunned that these problems are not magnified by 100x at Google scale. Perhaps it's simply organizational inertia that prevents them from trying a more reasonable solution.

gresrun · 3 years ago

Context: Staff Eng @ Google for 7+ years

1) This is solved by 2 interlocking concepts: comprehensive tests & pre-submit checks of those tests. Upgrading a version shouldn’t break anything because any breaking changes should be dealt with in the same change as the version bump.

2) Google’s monorepo allows for visibility restrictions and publicly-visible build targets are not common & reserved for truly public interfaces & packages.

3) “Code churn” is a very uncharitable description of day-to-day maintenance of an active codebase.

Google has invested heavily in infrastructural systems to facilitate the maintenance and execution of tests & code at scale. Monorepos are an organizational design choice which may not work for other teams. It does work at Google.

spion · 3 years ago

> any breaking changes should be dealt with in the same change as the version bump

Does this mean that some things will never get updated, as the effort required is impossibly high?

vl · 3 years ago

It’s not really even a true monorepo. Little known feature - there is a versions map which pins major components like base or cfs. This breaks monorepo abstraction and makes full repo changes difficult, but keeps devs of individual components sane.

password11 · 3 years ago

>> 3. It encourages a ton of code churn with very low signal.

> 3) “Code churn” is a very uncharitable description of day-to-day maintenance of an active codebase.

Also implicit in the discussion is the fact that Google and other big tech companies performance review based on "impact" rather than arbitrary metrics like "number of PRs/LOCs per month". This provides a check on spending too much engineer time on maintenance PRs, since they have no (or very little) impact on your performance rating.

ghosty141 · 3 years ago

How do you deal with wanting to see the history, graph etc of just one sub-project? Does the tooling handle this?

baq · 3 years ago

Is monorepo an important reason for Google to kill products? Or is it just my imagination?

Deleted Comment

oikawa_tooru_ · 3 years ago

Hi, unrelated to this, but since you are working at Google, were there actually "code red" meetings at Google concerning chatgpt?

Deleted Comment

Dead Comment

jmillikin · 3 years ago

  > The single version dependencies are asinine. We are migrating to
  > a monorepo at work, and someone bumped the version of an open
  > source JS package that introduced a regression.

There's no requirement to have single versions of dependencies in a monorepo. Google allows[0] multiple versions of third-party dependencies such as jQuery or MySQL, and internal code is expected to specify which version it depends on.

  > It encourages poor API contracts because it lets anyone import any
  > code in any service arbitrarily.

Not true at Google, and I would argue that if you have a repository that allows arbitrary cross-module dependencies then it's not really a monorepo. It's just an extremely large single-project repo with poor structure. The defining feature of a monorepo is that it contains multiple unrelated projects. At Google, this principle was so important that Blaze/Bazel has built-in support for controlling cross-package dependencies.

  > I see at least one PR every week [...] because some shared config
  > or code changed in another part of the repo and now the entire repo
  > needs to be migrated in lockstep for things to compile.

That really doesn't sound like a monorepo to me. If all the code has to be migrated "in lockstep", then that implies a single PR might change code across different parts of the company. At which point it's not independent projects in a monorepo, it's (merely) a single giant project.

[0] Or allowed -- I last worked there in 2017.

throwaway2037 · 3 years ago

I never worked at Google, but this post sums up everything I had to say about the matter. GP has a sh-tty monorepo experience at one company and decides to make a statement about another company where they never worked (so I presume). HN absurdism as its best!

I second your point about monorepo versus ball of mud. They are so different. And managing all of this is about social/culture, less science-y. If you don't have good culture around maintenance, well then, yeah, duh, it will fall apart pretty quickly. It sounds like Google spends crazy money to develop tools to enforce the culture. Hats off.

joshuamorton · 3 years ago

There's always been a very strong one version policy, multiple versions are usually only allowed to coexist for weeks or months, and are usually visibility restricted.

This prevents situations where "Gmail" ends up bundling 4 different, mildly incompatible versions of MySQL or whatever, and the aggravation that would cause. Or worse, in c++ you get ODR violations due to a function being used from two versions of the same library.

taeric · 3 years ago

I think the catch, is that it isn't just third-party dependencies that are of concern. In particular, at a certain size, you are best off treating every project in the company as a third party item. But, that is typically not what you are wanting with source dependencies.

You can see this some with how obnoxious Guava was, back in the day. It seems a sane strategy where you can deprecate things quickly by getting all callers to migrate. This is fantastic for the cases where it works. But, it is mind numbingly frustrating in the cases where it doesn't. Worse, it is the kind of work that burns out employees and causes them to not care about the product you are trying to make. "What did you do last month?" "I managed to roll out an upgrade that had no bearing on what we do."

ASinclair · 3 years ago

There’s a policy against multiple versions of third party dependencies. Though there is a mechanism for exceptions.

zelphirkalt · 3 years ago

I guess the question then becomes: Is it worth all the extra tooling required to manage a monorepo properly?

graveltongue · 3 years ago

https://opensource.google/documentation/reference

The third party documentation is public, one-version policies exist but they are exemptions.

lopkeny12ko · 3 years ago

> There's no requirement to have single versions of dependencies in a monorepo. Google allows[0] multiple versions of third-party dependencies such as jQuery or MySQL, and internal code is expected to specify which version it depends on.

Sure, but this is unsustainable. If service Foo depends on myjslib v3.0.0, but service Bar needs to pull in myjslib v3.1.0, in order to make sure Foo is entirely unchanged, you'd have to add a new dependency @myjslib_v3_1_0 used only by Bar. After two years you'd have 10 unique dependencies for 10 versions of myjslib in the monorepo.

At this point you've basically replicated the dependency semantics of a multi-repo world to a monorepo, with extra cruft. This problem is already implicitly solved in a multi-repo world because each service simply declares its own dependencies.

lamontcg · 3 years ago

After more than a decade of having tiny repos, I strongly believe that monorepos are the right way to go.

When you're pinning on old versions of software it quickly turns into a depsolving mess.

Software developers have difficulty figuring out which version of code is actually being deployed and used.

When dealing with major version bumps and semver pins around different repositories that creates a massive amount of make-work and configuration churn, and creates entire FTE roles practically dedicated to that job (or else grinds away at the time available for devs to do actual work and not just bump pins and deal with depsolving).

In any successful team which is using many dozens of repos, there's probably one dev running around like fucking nuts making sure everyhing is up to date and in synch who is keeping the whole thing going. If they leave because they're not getting career advancement then the pain is going to get surfaced.

The ability to pin also creates and encourages tech debt and encourages stale library code with security vulnerabilities. All that pinning flexibility is engineering to make tech debt really easy to start generating and to push all that maintenance into the future.

klodolph · 3 years ago

> The next deploy took our service down.

How would multi-repo change this? A dependency updated, and code broke, and the new version was broken—but you update dependencies in multi-repo anyway, and deployments can be broken anyway. I don’t see how multi-repo mitigates this.

> It encourages poor API contracts because it lets anyone import any code in any service arbitrarily.

This has nothing at all to do with monorepos. Google’s own software is built with a tool called Bazel, and Meta has something similar called Buck. These tools let you build the same kind of fine-grained boundaries that you would expect from packaged libraries. In fact, I’d say that the boundaries and API contracts are better when you use tools like Bazel or Buck—instead of just being stuck with something like a private/public distinction, you basically have the freedom to define ACLs on your packages. This is often way too much power for common use cases but it is nice to have it around when you need it, and it’s very easy to work with.

A common way to use this—suppose you have a service. The service code is private, you can’t depend on it. The client library is public, you can import it. The client library may have some internal code which has an ACL so it can only be imported from the client library front-end.

Here’s how we updated services—first add new functionality to the service. Then make the corresponding changes to the client. Finally, push any changes downstream. The service may have to work with multiple versions of the client library at any time, so you have to test with old client libraries. But we also have a “build horizon”—binaries older than some threshold, like 90 days or 180 days or something, are not permitted in production. Because of the build horizon, we know that we only have to support versions of the client library made within the last 90 or 180 days or whatever.

This is for services with “thick clients”—you could cut out the client library and just make RPCs directly, if that was appropriate for your service.

> It encourages a ton of code churn with very low signal.

The places I worked at that had monorepos, you might filter out the automated code changes there to do automated migrations to new APIs. One PR per week sounds pretty manageable, when spread across a team.

Then again, I’ve also worked at places where I had a high meeting load, and barely enough time to get my work done, so maybe one PR per week is burdensome if your are scheduled to death in meetings.

lopkeny12ko · 3 years ago

> How would multi-repo change this? A dependency updated, and code broke, and the new version was broken—but you update dependencies in multi-repo anyway, and deployments can be broken anyway. I don’t see how multi-repo mitigates this.

In a multi-repo world, I control the repo for my own service. For a business-critical service in maintenance mode (with no active feature development), there's no reason for me to upgrade the dependencies. Code changes are the #1 cause of incidents; why fix something that isn't broken?

We would have avoided this problem had we not migrated to the monorepo simply because, well, we would have never pulled in the dependency upgrade in the first place.

> In fact, I’d say that the boundaries and API contracts are better when you use tools like Bazel or Buck

I'm familiar with both of these tools, and I agree with this point. However, you are making an implicit assumption that 1. the monorepo in question is built with a tool like Bazel that can enforce code visibility, and 2. that there exists a team or group of volunteers to maintain such a build system across the entire repo. I suspect both of these are not true for the vast majority of codebases outside of FAANG.

> The places I worked at that had monorepos, you might filter out the automated code changes there to do automated migrations to new APIs

Sure, this solves a logistical problem, but not the underlying technical problem of low-signal PRs. I would argue that doing this is an antipattern because it desensitizes service owners from reviewing PRs.

phphphphp · 3 years ago

You’re describing bad habits as if they’re a forgone conclusion. Repository-level separation between code makes certain bad habits impossible so a sloppy team will be more effective with many-repos because they physically can’t perform an entire class of fuck-ups but there’s lots of organisations where these fuck-ups… just don’t happen, and so the co-locating code in a monorepo isn’t a concern.

If your organisation can’t work effectively within a monorepo then you should absolutely address the problem, either by fixing the problematic behaviour or by switching away from a monorepo. The problem isn’t monorepos, the problem is monorepos in your organisation.

summerlight · 3 years ago

While 2nd and 3rd points are not really something unique to monorepo, the first point is actually valid. This is why monorepo usually should be packaged with bunch of other development practices, especially comprehensive tests combined with presubmit hook.

IMO, it's more of a development paradigm rather than a mere technology. You cannot simply use monorepo in isolation since its trade-off is strongly coupled with many other tooling and workflow. Because of this reason, I usually don't recommend migration toward monorepo unless there's strong organizational level support.

barbazoo · 3 years ago

> 1. The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression

Is this convention for monorepos to all share the same dependencies? Does monorepo imply monolith? Surely one could have dependencies per "service" for example a python app with its own pipfile per directory.

zhengyi13 · 3 years ago

> It encourages poor API contracts because it lets anyone import any code in any service arbitrarily.

Perhaps that might be the default case, but the build system has a visibility system[1] that means that you can carefully control who depends on what parts of your code.

Separately, while some might build against your code directly, a lot of code just gets built into services, and then folk write their code against your published API, i.e. your protobuf specification.

[1]: https://bazel.build/concepts/visibility

doctor_eval · 3 years ago

I agree with every single point you made. Unfortunately, it's one of those discussions that is never going to be resolved because like so much else, it's difficult to find common ground when there are competing priorities.

My point is that in reality, we use what best matches our knowledge, experience and perception and prioritisation of the problems. I, for one, believe that a monorepo is dangerous for small teams because it encourages coupling - not only do I believe it, but I saw it with my own eyes. It also creates unnecessary dependency chains. Monorepos contribute to a fallacy that every dependent on an object must be immediately updated or tech debt happens. But that's not even remotely given.

In any case, companies like Google and Amazon have more than enough resources to deal systematically with the problems of a monorepo. I'm sure they have entire teams whose job it is to fix problems in the VCS. But for small teams I remain unconvinced that it is a good idea. We shouldn't even be trying to do the things the big guys do, unless we want to spend all our time working on the tools instead of our businesses.

wdb · 3 years ago

Personally, I am looking forward to switch to a monorepo as it makes things a lot easier. Makes testing a lot easier when you don’t need to deal with 70 repositories to test something. Also it’s easier to ensure dependencies such as API libraries are up to date in each service. Quicker feedback whether code changes break the things. Now I have to wait at least 24 hours to find if my PR that I merged breaks things.

jongjong · 3 years ago

I've been saying this for half a decade. The solution to having to constantly update dependency version numbers is to ensure that dependencies are more generic than the logic which uses them. If a module is generic and can handle a lot of use cases in a flexible way, then you won't need to update it too often.

One problem is that a lot of developers at big companies code business logic into their modules/dependencies... So whenever the business domain requirements change, they need to update many dependencies... Sometimes they depend on each other and so it's like a tangled web of dependencies which need to be constantly updated whenever requirements change.

Instead of trying to design modules properly to avoid everything becoming a giant tangled web, they prefer to just facilitate it with a monorepo which makes it easier to create and work with the mess (until the point when nobody can make sense of it anymore)... But for sure, this approach introduces vulnerabilities into the system. I don't know how most of the internet still functions.

hota_mazi · 3 years ago

> The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression. The next deploy took our service down

You're doing it wrong.

The point of monorepo is that if someone breaks something, it breaks right away, at build time, not at deployment time.

You're not really using a monorepo.

aimxhaisse · 3 years ago

I find 1) to be a good property assuming you have some safeguards or rollback procedure, at a cultural/code ownership level it moves the efforts of shared-code changes on the person doing them rather than on the ones depending on shared code, which reduces communications, frustration points and increase responsibility.

For instance in multi-repo environments I've often seen this pattern: own some code, bump an internal dependency to a new version, see it break, ask the person maintaining it what's us, realize this case wasn't taken into account, few back and forth before finding an agreement.

On the other hand in mono-repo environments, it's usually more difficult to introduce a wide changes as you face all consequences immediately, but difficulty is mainly a technical/engineering difficulty rather than a social one, and the outcome is better than the series of compromises made left and right after a big multi-repo change.

xorcist · 3 years ago

That sounds like good arguments for monorepos. Bumping a js package that is used in several places should break the build, that how you test it. It sounds like the fallout of the version bump was caught already on the next build, so hopefully it didn't make it into the master-equivalent branch.

Compare that with hundreds of tiny repos, each with their own little dependency system. Testing a version bump across the board before mainlining it is much more involved and you are more likely to hit stuff in production which should have been caught in test.

The other two points sounds more like cultural issues which may touch on branch strategies, code review, and what's expected of a developer. Those mostly cultural issues that overlaps with technical are hard in a way that repository strategy isn't.

robertlagrant · 3 years ago

> 2. It encourages poor API contracts because it lets anyone import any code in any service arbitrarily. Shared functionality should be exposed as a standalone library with a clear, well-defined interface boundary. There are entire packaging ecosystems like npmjs and pypi for exactly this purpose.

I don't believe this is true, except in the short term. Unless the writing party is guaranteeing you forward compatibility, your consuming code will break when you update.

This is (almost) the only reason API contracts are worth having; the reason doesn't go away just because you can technically see all the code.

precommunicator · 3 years ago

Context: happy monorepo user.

1, 2 and 3: Use separate dependencies for each package, so this doesn't happen. Use e.g. GitHub Actions or another CI/CD file filtering wisely: if a file is needed by two packages, tests for both packages needs to run whenever it's changed, before merging, in addition to usual end-to-end tests. Have vulnerable dependencies alerting and make sure to upgrade it everywhere it occurs.

2: Also have some guidelines on that and enforce it either automatically or manually in PRs.

ashishb · 3 years ago

I have worked at Google and have built multi-language outside Google.

1. Have some concept of visibility restriction e.g. Go language has internal package.

2. Ensure that every single package has a command to build the code.

3. Ensure that CI builds all the packages that changed our impacted by the change in a given pull request.

These three steps are mostly sufficient in having a monorepo. What you get in return is high code consistency and code visibility for the whole team.

RamblingCTO · 3 years ago

1 and 2 could be solved by using proper gradle multi-module projects and tests. So I would say this is a problem of tooling of the language you're using. This is one of the reasons why I still can't understand how people operate with inferior ecosystems like node in the backend and I also wish go would have these things.

lenkite · 3 years ago

Code Monoliths make just about as much sense as Runtime Monoliths, that is to say, if you are splitting your project into different micro-services, you can split your code base into different repositories too.

ikekkdcjkfke · 3 years ago

1) You can have several independent projects in a monorepo

2) Private/public/internal modifiers

3) Independent builds/project in a monorepo

> The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence.

Wait, that's an average of nearly 30 new files per commit. Not 30 files changed per commit, but whatever changes are happening to existing files, plus 30 brand new files. For every single commit.

Although...

> The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, [...]

I'm not quite sure what this is saying.

Is it saying that if `main` contains 1,000 files, and then someone creates a branch called `release`, then the repo now contains 2,000 files? And if someone then deletes 500 files from `main` in the next commit, the repo still contains 2,000 files, not 1,500?

If that's the case, why not just call every different version of every file in the repo a different file? If I have a new repo and in the first commit I create a single 100-line file called `foo.c`, and then I change one line of `foo.c` for the second commit, do I now have a repo with two files?

I mean, if you look at the plumbing for e.g. `git`, yes, the repo is storing two file objects for the repo history. But I don't think I've ever seen someone discuss the Linux git repo and talk about the total number of file objects in the repo object store. And when the linked paper itself mentions Linux, it says "The Linux kernel is a prominent example of a large open source software repository containing approximately 15 million lines of code in 40,000 files" - and in that case it's definitely not talking about the total number of file objects in the store.

I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository, or if it even means the same thing consistently. I'm not sure it's using the most obvious interpretation, but I can't understand why it would pick a non-obvious interpretation. Especially if it's not going to explain what it means, let alone explain why it chose one meaning over another.

bananapub · 3 years ago

you're misunderstanding a bunch of things.

> The total number of files also includes source files copied into release branches

I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories. they are not used very much.

> files that are deleted at the latest revision

so it means "one billion files have existed in the history repo, some are currently deleted".

> I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository,

seems pretty clear - a source code repo has lots of files. at the most recent revision, some exist, some were deleted in some past revision. more will be added (and deleted) in later revisions.

it's very much not the same model as git.

hope that clears things up.

Karellen · 3 years ago

> you're misunderstanding a bunch of things.

It certainly feels that way :-)

> > The total number of files also includes source files copied into release branches

> I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories.

Still not sure I see the distinction. Surely "sparse" or "not sparse" is an implementation detail. If I create a new branch in git, the files that are unchanged from its parent branch share the same storage, but the files that have changed use their own storage.

> so it means "one billion files have existed in the history repo, some are currently deleted".

I guess I'm struggling to understand what the point of this metric is? I get why "Total number of commits", "Total storage size of repo in GB/TB/PB", "Number of files in current head/main/trunk", or even "total number of distinct file revisions in repo history", could be useful metrics.

But why "number of files (including ones that have been deleted)"? What can we do with this number?

> hope that clears things up.

It's helping. Thanks.