Unlike Git LFS, we don’t just store the files. We use content-defined chunking and Merkle Trees to dedupe against everything in history. This allows small changes in large files to be stored compactly. Read more here: https://xethub.com/assets/docs/how-xet-deduplication-works
Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go. XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. Even at 1 TB, we know downloading an entire repository is painful, so we built git-xet mount - which, in seconds, provides a user-mode filesystem view over the repo.
XetHub is available today (Linux & Mac today, Windows coming soon) and we would love your feedback!
Read more here:
If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)
The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.
Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)
...isn't that just parsing git diff --name-only A..B tho ? "Process only files that changed since last commit" is extremely simple problem to solve.
How about cleaning up old versions?
[1] When you run `git annex add` it hashes the file and moves the original file to a `.git/annex/data` folder under the hash/content addressable file system, like git. Then it replaces the original file with a symlink to this hashed file path. The file is marked as read only, so any command in any language which tries to write to it will error (you can always `git annex unlock` so you can write to it). If you have duplicated files, they easily point to the same hashed location. As long as you git push normally and back up the `.git/annex/data` you're totally version controlled, and you can share the subset of files as needed
Imagine you have a 500MB file (lastmonth.csv) where every day 1MB is changed.
With file-based deduplication every day 500MB will be uploaded, and all clones of the repo will need to download 500MB.
With block-based deduplication, only around the 1MB that changed is uploaded and downloaded.
I am not a user of git annex but I do know that it works perfectly with an rsync.net account as a target:
https://git-annex.branchable.com/forum/making_good_use_of_my...
... which means that you could do a dumb mirror of your repo(s) - perhaps just using rsync - and then let the ZFS snapshots handle the versioning/rotation which would give you the benefits of block level diffs.
One additional benefit, beyond more efficient block level diffs, is that the ZFS snapshots are immutable/readonly as opposed to your 'git' or 'git annex' produced versions which could be destroyed by Mallory ...
When you have checked something out and fetched it, then it consumes space on disk, but that is true with git-lfs, and most other tools like it. It does NOT consume any space in any git object files.
I regularly use a git-annex repo that contains about 60G of files, which I can use with github or any git host, and uses about 6G in its annex, and 1M in the actual git repo itself. I chain git-annex to an internal .bup repo, so I can keep track of the location, and benefit from dedup.
I honestly have not found anything that comes close to the versatility of git-annex.
[1]: https://news.ycombinator.com/item?id=33976418
[2]: https://git-annex.branchable.com/special_remotes/
Dolt hasn't come up here yet, probably because we're focused on OLTP use cases, not MLOps, but we do have some customers using Dolt as the backing store for their training data.
https://github.com/dolthub/dolt
Dolt also scales to the 1TB range and offers you full SQL query capabilities on your data and diffs.
As someone who'd love to put their data into a git like system, this sounds pretty interesting. Aside from not offering a tier for someone like me who would maybe have a couple of repositories of size O(250GB) it's unclear how e.g. bandwidth would work & whether other people could simply mount and clone the full repo if desired for free etc.
In general, we are thinking about usage-based pricing (which would include bandwidth and storage) - what are your thoughts for that?
Also, where would you be mounting your repos from? We have local caching options that can greatly reduce the overall bandwidth needed to support data center workloads.
Generally usage based pricing sounds fair. In the end for cases like mine where it's "read rarely, but should be available publicly long term" it would need to compute with pricing offered by the big cloud providers.
I'm about to leave my academic career and I'm thinking about how to make sure all my detector data will be available to other researchers in my field in the future. Aside from the obvious candidate https://zenodo.org it's an annoying problem as usually most universities I'm familiar with only archive data internally, which is hard to access for researchers from different institutions. As I don't want to rely on a single place to have that data available I'm looking for an additional alternative (that I'm willing to pay for out of my own pocket, it just shouldn't be a financial burden).
In particular while still taking data a couple of years ago I would have loved being able to commit each daily data taking in the same way as I commit code. That way having things timestamped, backed up and all possible notes that came up that day associated straight in the commit message would have been very nice.
Regarding mounting I don't have any specific needs there anymore. Just thinking about how other researchers would be able to clone the repo to access the data.
First, it's all open-source, so I can take it and run it. Second, you provide a hosted service, and by virtue of being the author, you're the default SaaS host. You charge a premium over AWS fees for self-hosting, which works out to:
1. Enough to sustain you.
2. Less than the cost of doing dev-ops myself (AWS fees + engineer).
3. A small premium over potential cut-rate competitors.
You offer value-added premium services too. Whether that's economically viable, I don't know.
Is the Merkle true used because it brings something else than deduplication, like chunks integrity verification or something like that?
And the datalad project has neuro imaging repos that are tens of TB in size.
Consider whether you actually need to track differences in all of your files. Honestly git-annex is one of the most powerful tools I have ever used. You can use git for tracking changes in text, but use a different system for tracking binaries.
I love how satisfying it is to be able to store the index for hundreds of gigs of files on a floppy disk if I wanted.