Content Defined Chunking is one of my favorite algorithms because it has some "magic" similar to HyperLogLogs, Bloom filters, etc... This algorithm is good to explain to people, to get them inspired by computer science. I usually explain the simplest variant with rolling hashes.
It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).
PS. I don't use this algorithm in ClickHouse, but it always remains tempting.
Do you have a suggestion on what to read on the topic since then?
I don't keep up with these things. A quick search came up with the following but I haven't read it yet.
Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems", in Proceedings of 2019 ACM Symposium on Cloud Computing (ACM SoCC'19), Santa Cruz, CA, November, 2019.
> It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).
Yes this is truly promising but beware of dragons. Under current legal doctrine, blobs need some form of chain of custody. You can’t just deliver chunks to whomever has a hash (unless you’re decentralized, and you can move this problem to your users). Why? Because this is how bittorrent works, and we all know the legal dangers there. Encryption helps against eavesdropping, but not against an adversary who already has the hash and simply wants to prove you are distributing pirated material or even CSAM. You may be able to circumvent this to shift blame back on the user, in some cases. For instance, say you are re-syncing dangerous goods that you initially uploaded over Dropbox, then Dropbox can probably blame you, even though they are technically distributing. But that requires Dropbox to be reasonably confident that “you” (ie the same legal entity) had those chunks in the first place.
That's an interesting extension of the illegal numbers or coloured bits theories, but we don't really see it used that way in practise. When governments or media industry groups crack down on this stuff, they don't go after everybody that ever had those bits in memory. Maybe that's just for practical reasons, but we've never seen every router in between a buyer and seller get confiscated too as they've been somehow tainted. Honestly this doesn't seem like more than a dystopian mental exercise
Isn't the main use of bittorrent for ML and research data? Academic torrents is a wonderful resource and what every developer should be using if they need to provide their neural network weights, training data, etc.
How is there any legal problem using bittorrent? It's simply much more tailored for this problem than http. It doesn't make any sense to talk about 'Legal problems' for torrent protocols.
If you’re distributing CSAM on your blob storage, and someone lets you know, you should probably remove it. This is independent of whether you distribute chunks or the whole file.
15 years ago when I built a deduplication file storage system, rolling hash was on the table during design but there were some patents on it. Ended up using fixed size chunking which working less well but still gave incredible storage saving.
Hah! I also built a similar storage system, optimized for whole disk images, for work, around 2007!! I used fixed size chunks as well. I called it "data block coalescion", having never heard of anyone else doing so we figured I invented it and we were granted the patent(!). I used it to cram disk images for I think 6 different fresh install configurations onto a single DVD. :D
Later on I used it and vmware to build a proof of concept for a future model of computing where data was never lost. (it would snapshot the vm and add it to the storage system daily or hourly. look ma, infinite undo at a system level!)
The next version of the algorithm was going to use what I now know is essentially rolling hash (I called it "self delimiting data blocks"), but then the company went under.
> but there were some patents on it.
The patent system is quite silly and internally inconsistent. I'm older now, and suspect someone thought of saving disk space through coalescion of identical blocks before I did in 06/07 but not according to the USPTO!
I take heart when reading about recovered parts of expensive ashes.
But I firmly believe that no project is wasted especially one that pushed the boundaries of what is commonly done. All the people working there learnt something. They might apply their new found knowledge to other fields or future carriers. The money invested is not lost but converted into bigger brains.
I look at VC money in much the same way. It's not great when a startup fails. But a lot is learnt.
1. I thought this was going to be something about the Center for Disease Control
2. This is a really neat project, and super expensive ashes to rise from Stadia.
3. This is targeted at Windows to Linux, but given the speed advantages that it has over rsync couldn’t this be Linux to Linux? That said, I’d be afraid to use this over rsync just from the body of experience and knowledge that exists about rsync.
Slightly OT, but I like the schematic gifs used in the Readme.md (pretty amazing doc overall!) like this one [0]. Does anyone have suggestions what tools they might have used (or might be used in general) to create those?
We used an internal Chrome extension to capture the gifs. There also seem to be some externally available GIF capturing tools, like https://chrome.google.com/webstore/detail/chrome-capture-scr... (disclaimer: I haven't tried those). We then used https://ezgif.com/optimize to remove dupe/unwanted frames, tweak timings and compress the gif. The actual content was done in Google Slides.
This might be a dumb question.. but when I read this
"scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression."
my thinking was: If you want to send diffs.. why not just use git?
It does compression and chunks and all that. Maybe the defaults perform poorly for binary files? But couldn't you fix that with a custom diff algo? (if it's somehow not appropriate, it'd still be nice to port whatever secret sauce they use here to git..)
They are separate problems and not really comparable, git is a higher level tool. The diffing in git is for humans to compare code, and as such it is actually computed on the fly and not stored anywhere (unless you use git patch). Similarly, git does not provide any mechanism for sending/receiving files, it re-uses existing solutions like HTTP and SSH.
This tool exists to send/receive files, and the diffing is an implementation detail used to achieve a high level of performance. It would make more sense for git to use this library under the hood as you mentioned.
I'm sort of in the same boat, but with the sentence,
> To help this situation, we developed two tools, cdc_rsync
Why not use rsync?
(It does seem they ended up faster than rsync, so perhaps that's "why", but that seems more like a post-hoc justification.)
The "we use variable chunk windows" bit is intriguing, but the example GIF sort of just pre-supposes that the local & remote chunks match up. That could have happened in the rsync case/GIF, but that wasn't the case considered, so it's an oranges/apples comparison. (Or, how is it that the local manages to be clairvoyant enough to choose the same windows?)
>> Or, how is it that the local manages to be clairvoyant enough to choose the same windows?
Yes! In content defined chunking, the chunk boundaries depend on the content, in our case a 64 byte window. If the local and the remote files have the same 64 byte sequence anywhere, and that 64 byte sequence has some magic pattern of 0s and 1s, they will both have chunk boundaries there. A chunk is the range of data between two chunk boundaries, so if N consecutive chunk boundaries match, then N-1 consecutive chunks match.
I was wondering the same thing. It isn't explicit in the write-up but it's because rsync does not have a native Windows implementation, only rsync under Cygwin, so they developed this to achieve the same thing except it turned out faster than rsync.
> However, this was impractical, especially with the shift to working from home during the pandemic with sub-par internet connections. scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.
I didn't appreciate the scope of this problem until a friend of mine visited from the valley. I live in semirural Canada, and they were floored by the speed and low latency of the fibre connection I have.
I sort of took it for granted that people would spend the extra twenty bucks or so to make work from home a painless experience, or at least ask their employer to fund a better connection.
It caused me to reach out, and I found many wfh peers with terrible connections, low powered laptops, and few second monitors. And proper desks? The exception.
My family is mostly in trades, and so spending a little cash to improve my tools felt like common sense. Apparently it isn't.
You're not wrong. It's amazing how poor access can be.
I live in a tiny town on Vancouver Island and I have gigabit symmetric, for a very reasonable price. When I lived in Vancouver that simply wasn't an option.
that's wild, my parents in a pop.~ 1000 rural town in france get 2GB fiber; the next "average" town (60k inhabitants) is 30km away, and the next big town (Toulouse, pop. 460k) is 120km away
Yep! I lived all over the Bay Area for the last 20 years and I never had the option for fiber until I moved to Oakland two months ago. It’s only $40 a month for symmetric gigabit fiber. But before I simply had no option for it.
You don't appreciate the complete lack of incentive for companies to roll out truly high speed internet in the U.S. There is minimal gov't backing, little competition, most people don't know any better and at the end of the day those companies will still have to charge the same amount per month they do now.
I spend part of my time in a rural area. Only in the last couple years was fiber even available near my house, and then it's a matter of $300/mo on a 3-year contract for 250mbps symmetric, plus the cost of running last-mile from the road to my house.
Thankfully, Starlink came out, and I was able to get like 50/20 for $110/mo on a month-to-month + $500 fixed.
It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).
PS. I don't use this algorithm in ClickHouse, but it always remains tempting.
Do you have a suggestion on what to read on the topic since then?
I don't keep up with these things. A quick search came up with the following but I haven't read it yet.
Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems", in Proceedings of 2019 ACM Symposium on Cloud Computing (ACM SoCC'19), Santa Cruz, CA, November, 2019.
https://ranger.uta.edu/~sjiang/index.html
Yes this is truly promising but beware of dragons. Under current legal doctrine, blobs need some form of chain of custody. You can’t just deliver chunks to whomever has a hash (unless you’re decentralized, and you can move this problem to your users). Why? Because this is how bittorrent works, and we all know the legal dangers there. Encryption helps against eavesdropping, but not against an adversary who already has the hash and simply wants to prove you are distributing pirated material or even CSAM. You may be able to circumvent this to shift blame back on the user, in some cases. For instance, say you are re-syncing dangerous goods that you initially uploaded over Dropbox, then Dropbox can probably blame you, even though they are technically distributing. But that requires Dropbox to be reasonably confident that “you” (ie the same legal entity) had those chunks in the first place.
https://en.wikipedia.org/wiki/Illegal_numberhttps://ansuz.sooke.bc.ca/entry/23https://shkspr.mobi/blog/2022/11/illegal-hashes/
Later on I used it and vmware to build a proof of concept for a future model of computing where data was never lost. (it would snapshot the vm and add it to the storage system daily or hourly. look ma, infinite undo at a system level!)
The next version of the algorithm was going to use what I now know is essentially rolling hash (I called it "self delimiting data blocks"), but then the company went under.
> but there were some patents on it.
The patent system is quite silly and internally inconsistent. I'm older now, and suspect someone thought of saving disk space through coalescion of identical blocks before I did in 06/07 but not according to the USPTO!
Deleted Comment
Love it. Those were some very expensive ashes. I hope more comes from them too.
But I firmly believe that no project is wasted especially one that pushed the boundaries of what is commonly done. All the people working there learnt something. They might apply their new found knowledge to other fields or future carriers. The money invested is not lost but converted into bigger brains.
I look at VC money in much the same way. It's not great when a startup fails. But a lot is learnt.
2. This is a really neat project, and super expensive ashes to rise from Stadia.
3. This is targeted at Windows to Linux, but given the speed advantages that it has over rsync couldn’t this be Linux to Linux? That said, I’d be afraid to use this over rsync just from the body of experience and knowledge that exists about rsync.
I thought cultdeadcow. Tool for transferring t-files.
- 1-2 sentence summary of the project
- History (why we built this and didn't just use existing tools)
- What we've built, including what it does and how it compares to other tools (rsync in this case).
- How to install and use it
And the whole thing is full of images and animations showing the tool in action, and explaining enough of its internals.
I consider myself good at documentation, but I'm taking notes. This is excellent.
https://github.com/google/cdc-file-transfer/blob/main/fastcd...
Deleted Comment
[0] https://github.com/google/cdc-file-transfer/blob/main/docs/l...
"scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression."
my thinking was: If you want to send diffs.. why not just use git?
It does compression and chunks and all that. Maybe the defaults perform poorly for binary files? But couldn't you fix that with a custom diff algo? (if it's somehow not appropriate, it'd still be nice to port whatever secret sauce they use here to git..)
This tool exists to send/receive files, and the diffing is an implementation detail used to achieve a high level of performance. It would make more sense for git to use this library under the hood as you mentioned.
Damn..
Then I guess even using this under the hood would be essentially a rewrite or how git fundamentally works internally
I don't know if git sends or receives deltas though, I think it does?
> To help this situation, we developed two tools, cdc_rsync
Why not use rsync?
(It does seem they ended up faster than rsync, so perhaps that's "why", but that seems more like a post-hoc justification.)
The "we use variable chunk windows" bit is intriguing, but the example GIF sort of just pre-supposes that the local & remote chunks match up. That could have happened in the rsync case/GIF, but that wasn't the case considered, so it's an oranges/apples comparison. (Or, how is it that the local manages to be clairvoyant enough to choose the same windows?)
Yes! In content defined chunking, the chunk boundaries depend on the content, in our case a 64 byte window. If the local and the remote files have the same 64 byte sequence anywhere, and that 64 byte sequence has some magic pattern of 0s and 1s, they will both have chunk boundaries there. A chunk is the range of data between two chunk boundaries, so if N consecutive chunk boundaries match, then N-1 consecutive chunks match.
https://github.com/mkmik/imsy
It's less sophisticated but it uses the same core idea and the implementation is super simple
I didn't appreciate the scope of this problem until a friend of mine visited from the valley. I live in semirural Canada, and they were floored by the speed and low latency of the fibre connection I have.
I sort of took it for granted that people would spend the extra twenty bucks or so to make work from home a painless experience, or at least ask their employer to fund a better connection.
It caused me to reach out, and I found many wfh peers with terrible connections, low powered laptops, and few second monitors. And proper desks? The exception.
My family is mostly in trades, and so spending a little cash to improve my tools felt like common sense. Apparently it isn't.
I think you took for granted the mere availability of fiber as an option.
30Mbps DSL is the best option I have, other than Starlink. And I live in San Jose!
I live in a tiny town on Vancouver Island and I have gigabit symmetric, for a very reasonable price. When I lived in Vancouver that simply wasn't an option.
My family lives in old South Seattle neighborhoods and has excellent fiber.
I had something like 100mpbs co-axial 21 years ago in West Seattle.
What is the problem with San Jose?
I would pay $200/month extra for fiber.
Thankfully, Starlink came out, and I was able to get like 50/20 for $110/mo on a month-to-month + $500 fixed.