The future of large files in Git is Git

> Large object promisors are special Git remotes that only house large files.

I like this approach. If I could configure my repos to use something like S3, I would switch away from using LFS. S3 seems like a really good synergy for large blobs in a VCS. The intelligent tiering feature can move data into colder tiers of storage as history naturally accumulates and old things are forgotten. I wouldn't mind a historical checkout taking half a day (i.e., restored from a robotic tape library) if I am pulling in stuff from a decade ago.

riedel · 8 days ago

The article mentions alternatives to git lfs like git annex that support S3 already (which IMHO is however still a bit of a pain in the ass on windows due to the symlink workflow). Also dvc plays nicely with git and S3. Gitlab btw also simply offloads git lfs to S3. All have their quirks. I typically opt for LFS as a no-brainer but use the others when it fits the workflow and the infrastructure requirements.

Edit: Particularly the hash algorithm and the change detection (also when this happens) makes a difference if you have 2 GB files and not only the 25MB file from the OP

a_t48 · 8 days ago

At my current job I've started caching all of our LFS objects in a bucket, for cost reasons. Every time a PR is run, I get the list of objects via `git lfs ls-files`, sync them from gcp, run `git lfs checkout` to actually populate the repo from the object store, and then `git lfs pull` to pick up anything not cached. If there were uncached objects, I push them back up via `gcloud storage rsync`. Simple, doesn't require any configuration for developers (who only ever have to pull new objects), keeps the Github UI unconfused about the state of the repo.

I'd initially at spinning up an LFS backends, but this solves the main pain point, for now. Github was charging us an arm and a leg for pulling LFS files for CI, because each checkout is fresh, the caching model is non-ideal (max 10GB cache, impossible to share between branches), so we end up pulling a bunch of data that is unfortunately in LFS, every commit, possibly multiple times. Because of this they happily charge us for all that bandwidth, because they don't provide tools to make it easy to reduce bandwidth (let me pay for more cache size, or warm workers with an entire cache disc, or better cache control, or...).

...and if I want to enable this for developers it's relatively easy, just add a new git hook to do the same set of operations locally.

tagraves · 8 days ago

We use a somewhat similar approach in RWX when pulling LFS files[1]. We run `git lfs ls-files` to get a list of the lfs files, then pass that list into a task which pulls each file from the LFS endpoint using curl. Since in RWX the output of tasks are cached as long as their inputs don't change, the LFS files just stay in the RWX cache and are pulled from there on future clones in CI. In addition to saving on GitHub's LFS bandwidth costs, the RWX cache is also _much_ faster to restore from than `git lfs pull`.

[1] https://github.com/rwx-cloud/packages/blob/main/git/clone/bi...

cyberax · 8 days ago

> let me pay for more cache size

Apparently, this is coming in Q3 according to their public roadmap: https://github.com/github/roadmap/issues/1029

gmm1990 · 8 days ago

Why not run some open source ci locally or the google equivalent ec2, if you’re already going to the trouble of this much customization with running GitHub ci?

nullwarp · 8 days ago

Same and never understood why it wasn't the default from the get go but maybe it wasn't so synonymous when it first came out.

I run a small git LFS server because of this and will be happy to switch away the second I can get git to natively support S3.

_bent · 8 days ago

I'm currently running https://github.com/datopian/giftless to store the LFS files belonging to repos I have on GitHub on my homelab miniio instance.

There are a couple other projects that bridge S3 and LFS, though I had the most success with this setup.

account42 · 6 days ago

Why does the git client need specific support for this though? What's stopping a git host from redirecting requests certain objects to a different host and refusing to pack them into bundles today?

Deleted Comment

johnisgood · 8 days ago

Is S3 related to Amazon?

bayindirh · 8 days ago

You can install your own S3 compatible storage system on premises. It can be anything from a simple daemon (Scality, JuiceFS) to a small appliance (TrueNAS) to a full-blown storage cluster (Ceph). OpenStack has it own object storage service (Swift).

If you fancy it for your datacenter, big players (Fujitsu, Lenovo, Huawei, HPE) will happily sell you "object storage" systems which also support S3 at very high speeds.

StopDisinfo910 · 8 days ago

Yes, S3 is the name of Amazon Object Storage Service. Various players in the industry have started offering solutions with a compatible API which some people abusively call S3 too.

flohofwoe · 8 days ago

Yeah it's AWS's 'cloud storage service'.

No. This is not a solution.

While git LFS is just a kludge for now, writing a filter argument during the clone operation is not the long-term solution either.

Git clone is the very first command most people will run when learning how to use git. Emphasized for effect: the very first command.

Will they remember to write the filter? Maybe, if the tutorial to the cool codebase they're trying to access mentions it. Maybe not. What happens if they don't? It may take a long time without any obvious indication. And if they do? The cloned repo might not be compilable/usable since the blobs are missing.

Say they do get it right. Will they understand it? Most likely not. We are exposing the inner workings of git on the very first command they learn. What's a blob? Why do I need to filter on it? Where are blobs stored? It's classic abstraction leakage.

This is a solved problem: Rsync does it. Just port the bloody implementation over. It does mean supporting alternative representations or moving away from blobs altogether, which git maintainers seem unwilling to do.

IshKebab · 9 days ago

I totally agree. This follows a long tradition of Git "fixing" things by adding a flag that 99% of users won't ever discover. They never fix the defaults.

And yes, you can fix defaults without breaking backwards compatibility.

Jenk · 8 days ago

> They never fix the defaults

Not strictly true. They did change the default push behaviour from "matching" to "simple" in Git 2.0.

TGower · 8 days ago

> The cloned repo might not be compilable/usable since the blobs are missing.

Only the histories of the blobs are filtered out.

ks2048 · 8 days ago

> This is a solved problem: Rsync does it.

Can you explain what the solution is? I don't mean the details of the rsync algorithm, but rather what it would like like from the users' perspective. What files are on your local filesystem when you do a "git clone"?

hinkley · 8 days ago

When you do a shallow clone, no files would be present. However when doing a full clone you’ll get a full copy of each version of each blob, and what is being suggested is treat each revision as an rsync operation upon the last. And the more times you muck with a file, which can happen a lot both with assets and if you check in your deps to get exact snapshotting of code, that’s a lot of big file churn.

bogwog · 8 days ago

Maybe a manual filter isn't the right solution, but this does seem to add a lot of missing pieces.

The first time you try to commit on a new install, git nags you to set your email address and name. I could see something similar happen the first time you clone a repo that hits the default global filter size, with instructions on how to disable it globally.

> The cloned repo might not be compilable/usable since the blobs are missing.

Maybe I misunderstood the article, but isn't the point of the filter to prevent downloading the full history of big files, and instead only check out the required version (like LFS does).

So a filter of 1 byte will always give you a working tree, but trying to checkout a prior commit will require a full download of all files.

spyrja · 8 days ago

Would it be incorrect to say that most of the bloat relates to historical revisions? If so, maybe an rsync-like behavior starting with the most current version of the files would be the best starting point. (Which is all most people will need anyhow.)

pizza234 · 8 days ago

> Would it be incorrect to say that most of the bloat relates to historical revisions?

Based on my experience (YMMV), I think it is incorrect, yes, because any time I've performed a shallow clone of a repository, the saving wasn't as much as one would intuitively imagine (in other words: history is stored very efficiently).

expenses3 · 8 days ago

Exactly. If large files suck in git then that's because the git backend and cloning mechanism sucks for them. Fix that and then let us move on.

krupan · 7 days ago

That's exactly what these changes do, but they don't become the default because a lot of people only store text in got so they don't want the downsides of these changes

matheusmoreira · 8 days ago

It is a solution. The fact beginners might not understand it doesn't really matter, solutions need not perish on that alone. Clone is a command people usually run once while setting up a repository. Maybe the case could be made that this behavior should be the default and that full clones should be opt-in but that's a separate issue.

TZubiri · 8 days ago

"Will they remember to write the filter? Maybe, "

Nothing wrong with "forgetting" to write the filter, and then if it's taking more than 10 minutes, write the filter.

Too · 8 days ago

What? Why would you want to expose a beginner to waiting 10 minutes unnecessarily. How would they even know what they did wrong or what's a reasonable time to wait, ask chatgpt "why is my git clone taking 10 minutes"?!

Is this really the best we can do in terms of user experience? No. git need to step up.

Deleted Comment

theli0nheart · 8 days ago

I wrote git-bigstore [0] almost 10 (!) years ago to solve this problem—even before Git LFS—and as far as I know, bigstore still works perfectly.

You specify the files you want to store in your storage backend via .gitattributes, and use two separate commands to sync files. I have not touched this code in years but the general implementation should still work.

GitHub launched LFS not too long after I wrote this, so I kind of gave up on the idea thinking that no one would want to use it in lieu of GitHub's solution, but based on the comments I think there's a place for it.

It needs some love but the idea is solid. I wrote a little description on the wiki about the low-level implementation if you want to check it out. [1]

Also, all of the metadata is stored using git notes, so is completely portable and is frontend agnostic—doesn't lock you into anything (except, of course, the storage backend you use).

[0]: https://github.com/lionheart/git-bigstore

[1]: https://github.com/lionheart/git-bigstore/wiki

bob1029 · 8 days ago

jauer · 9 days ago

TFA asserts that Git LFS is bad for several reasons including because proprietary with vendor lock-in which I don't think is fair to claim. GitHub provided an open client and server which negates that.

LFS does break disconnected/offline/sneakernet operations which wasn't mentioned and is not awesome, but those are niche workflows. It sounds like that would also be broken with promisors.

The `git partial clone` examples are cool!

The description of Large Object Promisors makes it sound like they take the client-side complexity in LFS, move it server-side, and then increases the complexity? Instead of the client uploading to a git server and to a LFS server it uploads to a git server which in turn uploads to an object store, but the client will download directly from the object store? Obviously different tradeoffs there. I'm curious how often people will get bit by uploading to public git servers which upload to hidden promisor remotes.

LFS is bad. The server implementations suck. It conflates object contents with the storage method. It's opt-in, in a terrible way - if you do the obvious thing you get tiny text files instead of the files you actually want.

I dunno if their solution is any better but it's fairly unarguable that LFS is bad.

jayd16 · 8 days ago

It does seem like this proposal has exactly the same issue. Unless this new method blocks cloning when unable to access the promisors, you'll end up with similar problems of broken large files.

ozim · 8 days ago

I think maybe not storing large files in repo but managing those separately.

Mostly I did not run into such use case but in general I don’t see any upsides trying to shove some big files together with code within repositories.

AceJohnny2 · 8 days ago

Another way that LFS is bad, as I recently discovered, is that the migration will pollute the `.gitattributes` of ancestor commits that do not contain the LFS objects.

In other words, if you migrate a repo that has commits A->B->C, and C adds the large files, then commits A & B will gain a `.gitattributes` referring to the large files that do not exist in A & B.

This is because the migration function will carry its ~gitattributes structure backwards as it walks the history, for caching purposes, and not cross-reference it against the current commit.

actinium226 · 8 days ago

That doesn't sound right. There's no way it's adding a file to previous commits, that would change the hash and thereby break a lot of things.

gradientsrneat · 8 days ago

> LFS does break disconnected/offline/sneakernet operations which wasn't mentioned and is not awesome

Yea, I had the same thought. And TBD on large object promisors.

Git annex is somewhat more decentralized as it can track the presence of large files across different remotes. And it can pull large files from filesystem repos such as USB drives. The downside is that it's much more complicated and difficult to use. Some code forges used to support it, but support has since been dropped.

cma · 8 days ago

Git LFS didn't work with SSH, you had to get an SSL cert which github knew was a barrier for people self hosting at home. I think gitlab got it patched for SSH finally though.

remram · 8 days ago

letsencrypt launched 3 years before git-lfs

Ferret7446 · 8 days ago

This article treats LFS unfairly. It does not in any way lock you in to GitHub; the protocol is open. The downsides of LFS are unavoidable as a Git extension. Promisors are basically the same concept as LFS, except as it's built into Git it is able to provide a better UX than is possible as an extension.

andrewmcwatters · 8 days ago

Using LFS once in a repository locks you in permanently. You actually have to delete the repository from GitHub to remove the space consumed. It’s entirely a non-starter.

Nowhere is this behavior explicitly stated.

I used to use Git LFS on GitHub to do my company’s study on GitHub statistics because we stored large compressed databases on users and repositories.

throwaway290 · 8 days ago

This conflates Git and Github. Github is crap, news at 11. Git itself is fine and LFS is an extension for Git. There is nothing in LFS spec that discusses storage billing. Anyone can write a better server

KronisLV · 8 days ago

> And the problems are significant:

> High vendor lock-in – When GitHub wrote Git LFS, the other large file systems—Git Fat, Git Annex, and Git Media—were agnostic about the server-side. But GitHub locked users to their proprietary server implementation and charged folks to use it.

Is this a current issue?

I used Git LFS with a GitLab instance this week, seemed to work fine.

https://docs.gitlab.com/topics/git/lfs/

I also used Git LFS with my Gitea instance a week before that, it was fine too.

https://docs.gitea.com/administration/git-lfs-setup

At the same time it feels odd to hear mentions of LFS being deprecated in the future, while I’ve seldom seen anyone even use it - people just don’t seem to care and shove images and such into regular Git which puzzles me.

technoweenie · 8 days ago

I'm really happy to see large file support in Git core. Any external solution would have similar opt-in procedures. I really wanted it to work seamlessly with as few extra commands as possible, so the API was constrained to the smudge and clean filters in the '.gitattributes' file.

Though I did work hard to remove any vendor lock-in by working directly with Atlassian and Microsoft pretty early in the process. It was a great working relationship, with a lot of help from Atlassian in particular on the file locking API. LFS shipped open source with compatible support in 3 separate git hosts.

glitchc · 9 days ago

wbillingsley · 8 days ago

What I used to recommend to my sofware engineering classes is that instead of putting large files (media etc) into Git, put them into the artifact repository (Artifactory or something like it). That lets you for instance publish it as a snapshot dependency that the build system will automatically fetch for you, but control how much history of it you keep and only require your colleagues to fetch the latest version. Even better, a simple clean of their build system cache will free up the space used by old versions on their machines.

People like storing everything in git because it significantly simplifies configuration management. A build can be cleanly linked to a git hash instead of being a hash and a bunch of artifacts versions especially if you vendor your dependencies in your source control and completely stop using an artifact repository.

With a good build system using a shared cache, it makes for a very pleasant development environment.

This has its own issues. Now you need to provision additional credentials into your CI/CD and to your developers.

Commits become multi-step, as you need to first commit the artifacts to get their artifact IDs to put in the repo. You can automate that via git hooks, but then you're back at where you started: git-lfs.

firesteelrain · 8 days ago

Do you teach CI/CD systems architecture in your classes? Because I am finding that is what the junior engineers that we have hired seem to be missing.

Tying it all in with GitLab, Artifactory, CodeSonar, Anchore etc

I think the OP refers to assets that truly belong in Git because they are source code but large, like 3d models.

Release artifacts like a .exe would NOT belong in Git because it is not source code.

Yes

astrobe_ · 8 days ago

It sounds like a submodule... But certainly if the problem could be solved with a submodule, people would have found out long ago. Git's submodules also support shallow-cloning already [1]. I can only guess what the issues are with large files since I didn't face it myself - I deal with pure source code most of the times. I'm interested to know why it would be a bad idea to do that, just in case. The caveats pointed out in the second SO answer don't seem to be a big deal.

[1] https://stackoverflow.com/questions/2144406/how-to-make-shal...

pnt12 · 8 days ago

It sounds different to me - a regular git submodule would keep all history, unlike a file storage with occasional snapshotting.