Readit News logoReadit News
tux3 · 10 months ago
For those wondering where this new git-survey command is, it's actually not in git.git yet!

The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667

masklinn · 10 months ago
I assume full-name-hash and path-walk are also only in the fork as well (or in git HEAD)? Can't see them in the man pages, or in the 2.47 changelog.
tux3 · 10 months ago
Yep. Path-walk is currently pending review here: https://lore.kernel.org/all/pull.1813.git.1728396723.gitgitg...

It more or less replaces the --full-name-hash option (again a very good cover letter that explains the differences and pros/cons of each very well!)

Dead Comment

yunusabd · 10 months ago
> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?

> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo

The sentence seems to be cut off.

Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.

anon-3988 · 10 months ago
> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".

I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.

peebeebee · 10 months ago
The European Union doesn't like when a file get too big and powerful. It needs to be broken apart in order to give smaller files a chance of success.
acdha · 10 months ago
My guess is that “Europe” is being used as a proxy for “high latency, low bandwidth” – especially if the person in question uses a VPN (especially one of those terrible “SSL VPN” kludges). It’s still surprisingly common to encounter software with poor latency handling or servers with broken window scaling because most of the people who work on them are relatively close and have high bandwidth connection.
jerf · 10 months ago
And given the way of internal corporate networks, probably also "high failure rate", not because of "the internet", but the pile of corporate infrastructure needed for auditability, logging, security access control, intrusion detection, maxed out internal links... it's amazing any of this ever functions.
sroussey · 10 months ago
Or high packet loss.

Every once in a while, my router used to go crazy with seemingly packet loss (I think a memory issue).

Normal websites would become super slow for any pc or phone in the house.

But git… git would fail to clone anything not really small.

My fix was to unplug the modem and router and plug back in. :)

It took a long time to discover the router was reporting packet loss, and that the slowness the browsers were experiencing has to do with some retries, and that git just crapped out.

Eventually when git started misbehaving I restarted the router to fix.

And now I have a new router. :)

hinkley · 10 months ago
Sounds, based on other responders, like high latency high bandwidth, which is a problem many of us have trouble wrapping our heads around. Maybe complicated by packet loss.

After COVID I had to set up a compressing proxy for Artifactory and file a bug with JFrog about it because some of my coworkers with packet loss were getting request timeouts that npm didn’t handle well at all. Npm of that era didn’t bother to check bytes received versus content-length and then would cache the wrong answer. One of my many, many complaints about what total garbage npm was prior to ~8 when the refactoring work first started paying dividends.

benkaiser · 10 months ago
I can actually weigh in here. Working from Australia for another team inside Microsoft with a large monorepo on Azure devops. I pretty much cannot do a full (unshallow) clone of our repo because Azure devops cloning gets nowhere close to saturating my gigabit wired connection, and eventually due to the sheer time it takes cloning something will hang up on either my end of the Azure devops end to the point I would just give up.

Thankfully, we do our work almost entirely in shallow clones inside codespaces so it's not a big deal. I hope the problems presented in the 1JS repro from this blog post are causing similar size blowout in our repo and can be fixed.

thrance · 10 months ago
The repo is probably hosted on the west coast, meaning it has to cross the Atlantic whenever you clone it from Europe?
tazjin · 10 months ago
> What's up with folks in Europe that they can't clone a big repo, but others can?

They might be in a country with underdeveloped internet infrastructure, e.g. Germany))

avianlyric · 10 months ago
I do t think there’s any country in Europe with internet infrastructure as underdeveloped as the US. Most of Europe has fibre-to-the-premise, and all of Europe has consumer internet packages that are faster and cheaper than you’re gonna find anywhere in the U.S.
eviks · 10 months ago
upd: silly mistake - file name does not include its full path

The explanation probably got lost among all the gifs, but the last 16 chars here are different:

> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!

tux3 · 10 months ago
Derrick provides a better explanation in this cover letter: https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitg...

(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)

The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.

Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.

Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)

lastdong · 10 months ago
Thank you! I ended up having to look at the PR to make any sense of the blog post, but your explanation and links makes things much clearer
derriz · 10 months ago
I wish they had provided an actual explanation of what exactly was happening and skipped all the “color” in the story. By filename do they mean path? Or is it that git will just pick any file with a matching name to generate a diff? Is there any pattern to the choice of other file to use?
snthpy · 10 months ago
+1
js2 · 10 months ago
> file name does not include its full path

No, it is the full path that's considered. Look at the commit message on the first commit in the `--full-name-hash` PR:

https://github.com/git-for-windows/git/pull/5157/commits/d5c...

Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.

The grouping algorithm puts less weight on each character the further it is from the right-side of the name:

  hash = (hash >> 2) + (c << 24)
Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:

https://go.dev/play/p/JQpdUGXdQs7

Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.

rurban · 10 months ago
Sounds like it needs to be fixed to FNV1a
eviks · 10 months ago
Thanks for the deep dive!
daenney · 10 months ago
File name doesn’t necessarily include the whole path. The last 16 characters of CHANGELOG.md is the full file name.

If we interpret it that way, that also explains why the filepathwalk solution solves the problem.

But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.

floam · 10 months ago
It did shrink Chromium’s repo quite a bit!
eviks · 10 months ago
yes, this makes sense, thanks for pointing it out, silly confusion on my part
p4bl0 · 10 months ago
I was also bugged by that. I imagine that the meta variables foo and bar are at fault here, and that probably the actual package names had a common suffix like firstPkg and secondPkg. A common suffix of length three is enough in this case to get 16 chars in common as "/CHANGELOG.md" is already 13 chars long.
jonathancreamer · 10 months ago
Sorry about the gifs. Haha. And yeah I guess my understanding wasn't quite right either reading the reply to this thread, I'll try to clean it up in the post.
tazjin · 10 months ago
I just tried this on nixpkgs (~5GB when cloned straight from Github).

The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.

Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...

jakub_g · 10 months ago
The article mentions Derick Stolee who dig the digging and shipped the necessary changes. If you're interested in git internals, shrinking git clone sizes locally and in CI etc, Derrick wrote some amazing blogs on GitHub blog:

https://github.blog/author/dstolee/

See also his website:

https://stolee.dev/

Kudos to Derrick, I learnt so much from those!

fragmede · 10 months ago
> Large blobs happens when someone accidentally checks in some binary, so, not much you can do

> Retroactively, once the file is there though, it's semi stuck in history.

Arguably, the fix for that is to run filter-branch, remove the offending binary, teach and get everyone setup to use git-lfs for binaries, force push, and help everyone get their workstation to a good place.

Far from ideal, but better than having a large not-even-used file in git.

abound · 10 months ago
There's also BFG (https://rtyley.github.io/bfg-repo-cleaner/) for people like me who are scared of filter-branch.

As someone else noted, this is about small, frequently changing files, so you could remove old versions from the history to save space, and use LFS going forward.

larusso · 10 months ago
The main issue is not a binary file that never changes. It’s the small binary file that changes often.
cocok · 10 months ago
filter-repo is the recommended way these days:

https://github.com/newren/git-filter-repo

Deleted Comment

lastdong · 10 months ago
It’s easier to blame Linus.
develatio · 10 months ago
Hacking Git sounds fun, but isn't there a way to just not have 2.500 packages in a monorepo?
hinkley · 10 months ago
Code line count tends to grow exponentially. The bigger the code base, the more unreasonable it is to expect people not to reinvent an existing wheel, due to ignorance of the code or fear of breaking what exists by altering it to handle your use case (ignorance of the uses of the code).

IME it takes less time to go from 100 modules to 200 than it takes to go from 50 to 100.

Cthulhu_ · 10 months ago
Yeah, have 2500 separate Git repos with all the associated overhead.
develatio · 10 months ago
Can’t we split the packages into logical groups and maybe have 20 or 30 monorepos of 70-100 packages? I doubt that all the devs involved in that monorepo have to deal with all the 2500 packages. And I doubt that there is a circular dependency that requires all of these packages to be managed in a single monorepo.
hinkley · 10 months ago
Changing 100 CI pipelines is a giant pain in the ass. The third time I split the work with two other people. The 4th time someone wrote a tool and switched to a config file in the repo. 2500 is nuts. How do you even track red builds?
lopkeny12ko · 10 months ago
This was exactly my first thought as well. This seems like an entirely self-manufactured problem.
hinkley · 10 months ago
When you have hundreds of developers you’re going to get millions of lines of code. Thats partly Parkinson’s Law but also we have not fully perfected the three way merge, encouraging devs spread out more than intrinsically necessary in order to avoid tripping over each other.

If you really dig down into why we code the way we do, the “best practices” in software development, about half of them are heavily influenced by merge conflict, if not the primary cause.

If I group like functions together in a large file, then I (probably) won’t conflict with another person doing an unrelated ticket that touches the same file. But if we both add new functions at the bottom of the file, we’ll conflict. As long as one of us does the right thing everything is fine.

snthpy · 10 months ago
Thanks for this post. Really interesting and a great win for OSS!

I've been watching all the recent GitMerge talks put up by GitButler and following the monorepo / scaling developments - lots of great things being put out there by Microsoft, Github, and Gitlab.

I'd like to understand this last 16 char vs full path check issue better. How does this fit in with delta compression, pack indexes, multi-pack indexes etc ... ?

_joel · 10 months ago
> Really interesting and a great win for OSS!

Are they going to be opening a merge request to get their custom git command back in git proper then?