Readit News logoReadit News
Posted by u/reverius42 3 years ago
Show HN: We scaled Git to support 1 TB reposxethub.com/user/login...
I’ve been in the MLOps space for ~10 years, and data is still the hardest unsolved open problem. Code is versioned using Git, data is stored somewhere else, and context often lives in a 3rd location like Slack or GDocs. This is why we built XetHub, a platform that enables teams to treat data like code, using Git.

Unlike Git LFS, we don’t just store the files. We use content-defined chunking and Merkle Trees to dedupe against everything in history. This allows small changes in large files to be stored compactly. Read more here: https://xethub.com/assets/docs/how-xet-deduplication-works

Today, XetHub works for 1 TB repositories, and we plan to scale to 100 TB in the next year. Our implementation is in Rust (client & cache + storage) and our web application is written in Go. XetHub includes a GitHub-like web interface that provides automatic CSV summaries and allows custom visualizations using Vega. Even at 1 TB, we know downloading an entire repository is painful, so we built git-xet mount - which, in seconds, provides a user-mode filesystem view over the repo.

XetHub is available today (Linux & Mac today, Windows coming soon) and we would love your feedback!

Read more here:

- https://xetdata.com/blog/2022/10/15/why-xetdata

- https://xetdata.com/blog/2022/12/13/introducing-xethub

jrockway · 3 years ago
There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.

If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)

The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)

ylow · 3 years ago
We have found pointer files to be surprisingly efficient as long as you don't have to actually materialize those files. (Git's internals actually very well done). Our mount mechanism does avoid materializing pointer files which makes it pretty fast even for repos with very large number of files.
unqueued · 3 years ago
For bigger annex repos with lots of pointer files, I just disable the git-annex smudge filters. Consider whether smudge filters are requirement, or a convenience. The smudge filter interface does not scale that well at all.
ylow · 3 years ago
By the way, our mount mechanism has one very interesting novelty. It does not depend on a FUSE driver on Mac :-)
jrockway · 3 years ago
That's smart! I think users have to install a kext still?
ilyt · 3 years ago
> The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

...isn't that just parsing git diff --name-only A..B tho ? "Process only files that changed since last commit" is extremely simple problem to solve.

jrockway · 3 years ago
Yeah, that's the basics of it. Just make sure all the output is atomic, scale up the workers, handle inputs that are joins, retry when the workers get rescheduled, etc.
chubot · 3 years ago
Is DVC useful/efficient at storing container images (Docker)? As far as I remember they are just compressed tar files. Does the compression defeat its chunking / differential compression?

How about cleaning up old versions?

persedes · 3 years ago
Wouldn't any container registry be more suitable for that task than dvc ..?
JZL003 · 3 years ago
I also have a lot of issues with versioning data. But look at git annex - it's free, self hosted and has a very easy underlying data structure [1]. So I don't even use the magic commands it has for remote data mounting/multi-device coordination, just backup using basic S3 commands and can use rclone mounting. Very robust, open source, and useful

[1] When you run `git annex add` it hashes the file and moves the original file to a `.git/annex/data` folder under the hash/content addressable file system, like git. Then it replaces the original file with a symlink to this hashed file path. The file is marked as read only, so any command in any language which tries to write to it will error (you can always `git annex unlock` so you can write to it). If you have duplicated files, they easily point to the same hashed location. As long as you git push normally and back up the `.git/annex/data` you're totally version controlled, and you can share the subset of files as needed

kspacewalk2 · 3 years ago
Sounds like `git annex` is file-level deduplication, whereas this tool is block-level, but with some intelligent, context-specific way of defining how to split up the data (i.e. Content-Defined Chunking). For data management/versioning, that's usually a big difference.
rajatarya · 3 years ago
XetHub Co-founder here. Yes, one illustrative example of the difference is:

Imagine you have a 500MB file (lastmonth.csv) where every day 1MB is changed.

With file-based deduplication every day 500MB will be uploaded, and all clones of the repo will need to download 500MB.

With block-based deduplication, only around the 1MB that changed is uploaded and downloaded.

rsync · 3 years ago
"Sounds like `git annex` is file-level deduplication, whereas this tool is block-level ..."

I am not a user of git annex but I do know that it works perfectly with an rsync.net account as a target:

https://git-annex.branchable.com/forum/making_good_use_of_my...

... which means that you could do a dumb mirror of your repo(s) - perhaps just using rsync - and then let the ZFS snapshots handle the versioning/rotation which would give you the benefits of block level diffs.

One additional benefit, beyond more efficient block level diffs, is that the ZFS snapshots are immutable/readonly as opposed to your 'git' or 'git annex' produced versions which could be destroyed by Mallory ...

unqueued · 3 years ago
No, that is not correct, git-annex uses a variety of special remotes[2], some of which support deduplication. Mentioned in another comment[1]

When you have checked something out and fetched it, then it consumes space on disk, but that is true with git-lfs, and most other tools like it. It does NOT consume any space in any git object files.

I regularly use a git-annex repo that contains about 60G of files, which I can use with github or any git host, and uses about 6G in its annex, and 1M in the actual git repo itself. I chain git-annex to an internal .bup repo, so I can keep track of the location, and benefit from dedup.

I honestly have not found anything that comes close to the versatility of git-annex.

[1]: https://news.ycombinator.com/item?id=33976418

[2]: https://git-annex.branchable.com/special_remotes/

cma · 3 years ago
If git annex stores large files uncompressed you could use filesystem bl9ck level deduplication in combination with it.
timbotron · 3 years ago
If you like git annex check out [datalad](http://handbook.datalad.org/en/latest/), it provides some useful wrappers around git annex oriented towards scientific computing.
timsehn · 3 years ago
Founder of DoltHub here. One of my team pointed me at this thread. Congrats on the launch. Great to see more folks tackling the data versioning problem.

Dolt hasn't come up here yet, probably because we're focused on OLTP use cases, not MLOps, but we do have some customers using Dolt as the backing store for their training data.

https://github.com/dolthub/dolt

Dolt also scales to the 1TB range and offers you full SQL query capabilities on your data and diffs.

ylow · 3 years ago
CEO/Cofounder here. Thanks! Agreed, we think data versioning is an important problem and we are at related, but opposite parts of the space. (BTW we really wanted gitfordata.com. Or perhaps we can split the domain? OLTP goes here, Unstructured data goes there :-) Shall we chat? )
V1ndaar · 3 years ago
You say you support up to 1TB repositories, but from your pricing page all I see is the free tier for up to 20GB and one for teams. The latter doesn't have a price and only a contact option and I assume likely will be too expensive for an individual.

As someone who'd love to put their data into a git like system, this sounds pretty interesting. Aside from not offering a tier for someone like me who would maybe have a couple of repositories of size O(250GB) it's unclear how e.g. bandwidth would work & whether other people could simply mount and clone the full repo if desired for free etc.

rajatarya · 3 years ago
XetHub Co-founder here. We are still trying to figure out pricing and would love to understand what sort of pricing tier would work for you.

In general, we are thinking about usage-based pricing (which would include bandwidth and storage) - what are your thoughts for that?

Also, where would you be mounting your repos from? We have local caching options that can greatly reduce the overall bandwidth needed to support data center workloads.

V1ndaar · 3 years ago
Thanks for the reply!

Generally usage based pricing sounds fair. In the end for cases like mine where it's "read rarely, but should be available publicly long term" it would need to compute with pricing offered by the big cloud providers.

I'm about to leave my academic career and I'm thinking about how to make sure all my detector data will be available to other researchers in my field in the future. Aside from the obvious candidate https://zenodo.org it's an annoying problem as usually most universities I'm familiar with only archive data internally, which is hard to access for researchers from different institutions. As I don't want to rely on a single place to have that data available I'm looking for an additional alternative (that I'm willing to pay for out of my own pocket, it just shouldn't be a financial burden).

In particular while still taking data a couple of years ago I would have loved being able to commit each daily data taking in the same way as I commit code. That way having things timestamped, backed up and all possible notes that came up that day associated straight in the commit message would have been very nice.

Regarding mounting I don't have any specific needs there anymore. Just thinking about how other researchers would be able to clone the repo to access the data.

blagie · 3 years ago
My preferences on pricing.

First, it's all open-source, so I can take it and run it. Second, you provide a hosted service, and by virtue of being the author, you're the default SaaS host. You charge a premium over AWS fees for self-hosting, which works out to:

1. Enough to sustain you.

2. Less than the cost of doing dev-ops myself (AWS fees + engineer).

3. A small premium over potential cut-rate competitors.

You offer value-added premium services too. Whether that's economically viable, I don't know.

TacticalCoder · 3 years ago
What does a Merkle Tree bring here? (honest question) I mean: for content-based addressing of chunks (and hence deduplication of these chunks), a regular tree works too if I'm not mistaken (I may be wrong but I literally wrote a "deduper" splitting files into chunks and using content-based addressing to dedupe the chunks: but I just used a dumb tree).

Is the Merkle true used because it brings something else than deduplication, like chunks integrity verification or something like that?

dandigangi · 3 years ago
One monorepo to rule them all and the in the darkness, pull them. - Gandalf, probably
irrational · 3 years ago
And in the darkness merge conflicts.
Izmaki · 3 years ago
If I had to "version control" a 1 TB large repo - and assuming I wouldn't quit in anger - I would use a tool which is built for this kind of need and has been used in the industry for decades: Perforce.
mentos · 3 years ago
I work in gamedev and think perforce is good but far from great. Would love to see someone bring some competition to the space maybe XetHub can.
tinco · 3 years ago
So, you wouldn't consider using a new tool that someone developed to solve the same problem despite an older solution already existing? Your advice to that someone is to just use the old solution?
TylerE · 3 years ago
When the new solution involves voluntary use of git? Not just yea, but hell yes. I hate git.
ryneandal · 3 years ago
This was my thought as well. Perforce has its own issues, but is an industry standard in game dev for a reason: it can handle immense amounts of data.
Phelinofist · 3 years ago
What does immense mean in the context of game dev?
unqueued · 3 years ago
I have a 1.96 TB git repo: https://github.com/unqueued/repo.macintoshgarden.org-fileset (It is a mirror of a Macintosh abandoneware site)

  git annex info .
Of course, it uses pointer files for the binary blobs that are not going to change much anyway.

And the datalad project has neuro imaging repos that are tens of TB in size.

Consider whether you actually need to track differences in all of your files. Honestly git-annex is one of the most powerful tools I have ever used. You can use git for tracking changes in text, but use a different system for tracking binaries.

I love how satisfying it is to be able to store the index for hundreds of gigs of files on a floppy disk if I wanted.