> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.
What's up with folks in Europe that they can't clone a big repo, but others can?
Also it sounds like they still won't be able to clone, until the change is implemented on the server side?
> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo
The sentence seems to be cut off.
Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.
> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.
I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".
I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.
My guess is that “Europe” is being used as a proxy for “high latency, low bandwidth” – especially if the person in question uses a VPN (especially one of those terrible “SSL VPN” kludges). It’s still surprisingly common to encounter software with poor latency handling or servers with broken window scaling because most of the people who work on them are relatively close and have high bandwidth connection.
And given the way of internal corporate networks, probably also "high failure rate", not because of "the internet", but the pile of corporate infrastructure needed for auditability, logging, security access control, intrusion detection, maxed out internal links... it's amazing any of this ever functions.
Every once in a while, my router used to go crazy with seemingly packet loss (I think a memory issue).
Normal websites would become super slow for any pc or phone in the house.
But git… git would fail to clone anything not really small.
My fix was to unplug the modem and router and plug back in. :)
It took a long time to discover the router was reporting packet loss, and that the slowness the browsers were experiencing has to do with some retries, and that git just crapped out.
Eventually when git started misbehaving I restarted the router to fix.
Sounds, based on other responders, like high latency high bandwidth, which is a problem many of us have trouble wrapping our heads around. Maybe complicated by packet loss.
After COVID I had to set up a compressing proxy for Artifactory and file a bug with JFrog about it because some of my coworkers with packet loss were getting request timeouts that npm didn’t handle well at all. Npm of that era didn’t bother to check bytes received versus content-length and then would cache the wrong answer. One of my many, many complaints about what total garbage npm was prior to ~8 when the refactoring work first started paying dividends.
I can actually weigh in here. Working from Australia for another team inside Microsoft with a large monorepo on Azure devops. I pretty much cannot do a full (unshallow) clone of our repo because Azure devops cloning gets nowhere close to saturating my gigabit wired connection, and eventually due to the sheer time it takes cloning something will hang up on either my end of the Azure devops end to the point I would just give up.
Thankfully, we do our work almost entirely in shallow clones inside codespaces so it's not a big deal. I hope the problems presented in the 1JS repro from this blog post are causing similar size blowout in our repo and can be fixed.
I do t think there’s any country in Europe with internet infrastructure as underdeveloped as the US. Most of Europe has fibre-to-the-premise, and all of Europe has consumer internet packages that are faster and cheaper than you’re gonna find anywhere in the U.S.
upd: silly mistake - file name does not include its full path
The explanation probably got lost among all the gifs, but the last 16 chars here are different:
> was actually only checking the last 16 characters of a filename
> For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!
The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.
Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.
Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)
I wish they had provided an actual explanation of what exactly was happening and skipped all the “color” in the story. By filename do they mean path? Or is it that git will just pick any file with a matching name to generate a diff? Is there any pattern to the choice of other file to use?
Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.
The grouping algorithm puts less weight on each character the further it is from the right-side of the name:
hash = (hash >> 2) + (c << 24)
Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:
Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.
File name doesn’t necessarily include the whole path. The last 16 characters of CHANGELOG.md is the full file name.
If we interpret it that way, that also explains why the filepathwalk solution solves the problem.
But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.
I was also bugged by that. I imagine that the meta variables foo and bar are at fault here, and that probably the actual package names had a common suffix like firstPkg and secondPkg. A common suffix of length three is enough in this case to get 16 chars in common as "/CHANGELOG.md" is already 13 chars long.
Sorry about the gifs. Haha. And yeah I guess my understanding wasn't quite right either reading the reply to this thread, I'll try to clean it up in the post.
I just tried this on nixpkgs (~5GB when cloned straight from Github).
The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.
Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...
The article mentions Derick Stolee who dig the digging and shipped the necessary changes. If you're interested in git internals, shrinking git clone sizes locally and in CI etc, Derrick wrote some amazing blogs on GitHub blog:
> Large blobs happens when someone accidentally checks in some binary, so, not much you can do
> Retroactively, once the file is there though, it's semi stuck in history.
Arguably, the fix for that is to run filter-branch, remove the offending binary, teach and get everyone setup to use git-lfs for binaries, force push, and help everyone get their workstation to a good place.
Far from ideal,
but better than having a large not-even-used file in git.
As someone else noted, this is about small, frequently changing files, so you could remove old versions from the history to save space, and use LFS going forward.
Code line count tends to grow exponentially. The bigger the code base, the more unreasonable it is to expect people not to reinvent an existing wheel, due to ignorance of the code or fear of breaking what exists by altering it to handle your use case (ignorance of the uses of the code).
IME it takes less time to go from 100 modules to 200 than it takes to go from 50 to 100.
Can’t we split the packages into logical groups and maybe have 20 or 30 monorepos of 70-100 packages? I doubt that all the devs involved in that monorepo have to deal with all the 2500 packages. And I doubt that there is a circular dependency that requires all of these packages to be managed in a single monorepo.
Changing 100 CI pipelines is a giant pain in the ass. The third time I split the work with two other people. The 4th time someone wrote a tool and switched to a config file in the repo. 2500 is nuts. How do you even track red builds?
When you have hundreds of developers you’re going to get millions of lines of code. Thats partly Parkinson’s Law but also we have not fully perfected the three way merge, encouraging devs spread out more than intrinsically necessary in order to avoid tripping over each other.
If you really dig down into why we code the way we do, the “best practices” in software development, about half of them are heavily influenced by merge conflict, if not the primary cause.
If I group like functions together in a large file, then I (probably) won’t conflict with another person doing an unrelated ticket that touches the same file. But if we both add new functions at the bottom of the file, we’ll conflict. As long as one of us does the right thing everything is fine.
Thanks for this post. Really interesting and a great win for OSS!
I've been watching all the recent GitMerge talks put up by GitButler and following the monorepo / scaling developments - lots of great things being put out there by Microsoft, Github, and Gitlab.
I'd like to understand this last 16 char vs full path check issue better. How does this fit in with delta compression, pack indexes, multi-pack indexes etc ... ?
The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667
It more or less replaces the --full-name-hash option (again a very good cover letter that explains the differences and pros/cons of each very well!)
Dead Comment
What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?
> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo
The sentence seems to be cut off.
Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.
I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".
I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.
Every once in a while, my router used to go crazy with seemingly packet loss (I think a memory issue).
Normal websites would become super slow for any pc or phone in the house.
But git… git would fail to clone anything not really small.
My fix was to unplug the modem and router and plug back in. :)
It took a long time to discover the router was reporting packet loss, and that the slowness the browsers were experiencing has to do with some retries, and that git just crapped out.
Eventually when git started misbehaving I restarted the router to fix.
And now I have a new router. :)
After COVID I had to set up a compressing proxy for Artifactory and file a bug with JFrog about it because some of my coworkers with packet loss were getting request timeouts that npm didn’t handle well at all. Npm of that era didn’t bother to check bytes received versus content-length and then would cache the wrong answer. One of my many, many complaints about what total garbage npm was prior to ~8 when the refactoring work first started paying dividends.
Thankfully, we do our work almost entirely in shallow clones inside codespaces so it's not a big deal. I hope the problems presented in the 1JS repro from this blog post are causing similar size blowout in our repo and can be fixed.
They might be in a country with underdeveloped internet infrastructure, e.g. Germany))
The explanation probably got lost among all the gifs, but the last 16 chars here are different:
> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!
(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)
The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.
Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.
Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)
No, it is the full path that's considered. Look at the commit message on the first commit in the `--full-name-hash` PR:
https://github.com/git-for-windows/git/pull/5157/commits/d5c...
Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.
The grouping algorithm puts less weight on each character the further it is from the right-side of the name:
Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:https://go.dev/play/p/JQpdUGXdQs7
Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.
If we interpret it that way, that also explains why the filepathwalk solution solves the problem.
But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.
The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.
Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...
https://github.blog/author/dstolee/
See also his website:
https://stolee.dev/
Kudos to Derrick, I learnt so much from those!
> Retroactively, once the file is there though, it's semi stuck in history.
Arguably, the fix for that is to run filter-branch, remove the offending binary, teach and get everyone setup to use git-lfs for binaries, force push, and help everyone get their workstation to a good place.
Far from ideal, but better than having a large not-even-used file in git.
As someone else noted, this is about small, frequently changing files, so you could remove old versions from the history to save space, and use LFS going forward.
https://github.com/newren/git-filter-repo
Deleted Comment
IME it takes less time to go from 100 modules to 200 than it takes to go from 50 to 100.
If you really dig down into why we code the way we do, the “best practices” in software development, about half of them are heavily influenced by merge conflict, if not the primary cause.
If I group like functions together in a large file, then I (probably) won’t conflict with another person doing an unrelated ticket that touches the same file. But if we both add new functions at the bottom of the file, we’ll conflict. As long as one of us does the right thing everything is fine.
I've been watching all the recent GitMerge talks put up by GitButler and following the monorepo / scaling developments - lots of great things being put out there by Microsoft, Github, and Gitlab.
I'd like to understand this last 16 char vs full path check issue better. How does this fit in with delta compression, pack indexes, multi-pack indexes etc ... ?
Are they going to be opening a merge request to get their custom git command back in git proper then?