We shrunk our Javascript monorepo git size

tux3 · 10 months ago

For those wondering where this new git-survey command is, it's actually not in git.git yet!

The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667

masklinn · 10 months ago

I assume full-name-hash and path-walk are also only in the fork as well (or in git HEAD)? Can't see them in the man pages, or in the 2.47 changelog.

tux3 · 10 months ago

Yep. Path-walk is currently pending review here: https://lore.kernel.org/all/pull.1813.git.1728396723.gitgitg...

It more or less replaces the --full-name-hash option (again a very good cover letter that explains the differences and pros/cons of each very well!)

Dead Comment

yunusabd · 10 months ago

> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?

> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo

The sentence seems to be cut off.

Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.

anon-3988 · 10 months ago

> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".

I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.

peebeebee · 10 months ago

The European Union doesn't like when a file get too big and powerful. It needs to be broken apart in order to give smaller files a chance of success.

acdha · 10 months ago

My guess is that “Europe” is being used as a proxy for “high latency, low bandwidth” – especially if the person in question uses a VPN (especially one of those terrible “SSL VPN” kludges). It’s still surprisingly common to encounter software with poor latency handling or servers with broken window scaling because most of the people who work on them are relatively close and have high bandwidth connection.

jerf · 10 months ago

And given the way of internal corporate networks, probably also "high failure rate", not because of "the internet", but the pile of corporate infrastructure needed for auditability, logging, security access control, intrusion detection, maxed out internal links... it's amazing any of this ever functions.

sroussey · 10 months ago

Or high packet loss.

Every once in a while, my router used to go crazy with seemingly packet loss (I think a memory issue).

Normal websites would become super slow for any pc or phone in the house.

But git… git would fail to clone anything not really small.

My fix was to unplug the modem and router and plug back in. :)

It took a long time to discover the router was reporting packet loss, and that the slowness the browsers were experiencing has to do with some retries, and that git just crapped out.

Eventually when git started misbehaving I restarted the router to fix.

And now I have a new router. :)

hinkley · 10 months ago

Sounds, based on other responders, like high latency high bandwidth, which is a problem many of us have trouble wrapping our heads around. Maybe complicated by packet loss.

After COVID I had to set up a compressing proxy for Artifactory and file a bug with JFrog about it because some of my coworkers with packet loss were getting request timeouts that npm didn’t handle well at all. Npm of that era didn’t bother to check bytes received versus content-length and then would cache the wrong answer. One of my many, many complaints about what total garbage npm was prior to ~8 when the refactoring work first started paying dividends.

benkaiser · 10 months ago

I can actually weigh in here. Working from Australia for another team inside Microsoft with a large monorepo on Azure devops. I pretty much cannot do a full (unshallow) clone of our repo because Azure devops cloning gets nowhere close to saturating my gigabit wired connection, and eventually due to the sheer time it takes cloning something will hang up on either my end of the Azure devops end to the point I would just give up.

Thankfully, we do our work almost entirely in shallow clones inside codespaces so it's not a big deal. I hope the problems presented in the 1JS repro from this blog post are causing similar size blowout in our repo and can be fixed.

thrance · 10 months ago

The repo is probably hosted on the west coast, meaning it has to cross the Atlantic whenever you clone it from Europe?

tazjin · 10 months ago

> What's up with folks in Europe that they can't clone a big repo, but others can?

They might be in a country with underdeveloped internet infrastructure, e.g. Germany))

avianlyric · 10 months ago

I do t think there’s any country in Europe with internet infrastructure as underdeveloped as the US. Most of Europe has fibre-to-the-premise, and all of Europe has consumer internet packages that are faster and cheaper than you’re gonna find anywhere in the U.S.

eviks · 10 months ago

upd: silly mistake - file name does not include its full path

The explanation probably got lost among all the gifs, but the last 16 chars here are different:

> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!

tux3 · 10 months ago

Derrick provides a better explanation in this cover letter: https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitg...

(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)

The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.

Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.

Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)

lastdong · 10 months ago

Thank you! I ended up having to look at the PR to make any sense of the blog post, but your explanation and links makes things much clearer

derriz · 10 months ago

I wish they had provided an actual explanation of what exactly was happening and skipped all the “color” in the story. By filename do they mean path? Or is it that git will just pick any file with a matching name to generate a diff? Is there any pattern to the choice of other file to use?

snthpy · 10 months ago

+1

js2 · 10 months ago

> file name does not include its full path

No, it is the full path that's considered. Look at the commit message on the first commit in the `--full-name-hash` PR:

https://github.com/git-for-windows/git/pull/5157/commits/d5c...

Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.

The grouping algorithm puts less weight on each character the further it is from the right-side of the name:

  hash = (hash >> 2) + (c << 24)

Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:

https://go.dev/play/p/JQpdUGXdQs7

Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.

rurban · 10 months ago

Sounds like it needs to be fixed to FNV1a

eviks · 10 months ago

Thanks for the deep dive!

daenney · 10 months ago

File name doesn’t necessarily include the whole path. The last 16 characters of CHANGELOG.md is the full file name.

If we interpret it that way, that also explains why the filepathwalk solution solves the problem.

But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.

floam · 10 months ago

It did shrink Chromium’s repo quite a bit!

eviks · 10 months ago

yes, this makes sense, thanks for pointing it out, silly confusion on my part

p4bl0 · 10 months ago

I was also bugged by that. I imagine that the meta variables foo and bar are at fault here, and that probably the actual package names had a common suffix like firstPkg and secondPkg. A common suffix of length three is enough in this case to get 16 chars in common as "/CHANGELOG.md" is already 13 chars long.

jonathancreamer · 10 months ago

Sorry about the gifs. Haha. And yeah I guess my understanding wasn't quite right either reading the reply to this thread, I'll try to clean it up in the post.

tazjin · 10 months ago

I just tried this on nixpkgs (~5GB when cloned straight from Github).

The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.

Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...

jakub_g · 10 months ago

The article mentions Derick Stolee who dig the digging and shipped the necessary changes. If you're interested in git internals, shrinking git clone sizes locally and in CI etc, Derrick wrote some amazing blogs on GitHub blog:

https://github.blog/author/dstolee/