CDC File Transfer - Readit News

Content Defined Chunking is one of my favorite algorithms because it has some "magic" similar to HyperLogLogs, Bloom filters, etc... This algorithm is good to explain to people, to get them inspired by computer science. I usually explain the simplest variant with rolling hashes.

It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).

PS. I don't use this algorithm in ClickHouse, but it always remains tempting.

piceas · 3 years ago

Implementing fastCDC is fun (2016).

Do you have a suggestion on what to read on the topic since then?

I don't keep up with these things. A quick search came up with the following but I haven't read it yet.

Fan Ni and Song Jiang, "RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems", in Proceedings of 2019 ACM Symposium on Cloud Computing (ACM SoCC'19), Santa Cruz, CA, November, 2019.

https://ranger.uta.edu/~sjiang/index.html

windowsworkstoo · 3 years ago

Microsoft have this for DFS-R and the spec is open https://learn.microsoft.com/en-us/openspecs/windows_protocol... - its pretty straight forward to implement.

klabb3 · 3 years ago

> It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).

Yes this is truly promising but beware of dragons. Under current legal doctrine, blobs need some form of chain of custody. You can’t just deliver chunks to whomever has a hash (unless you’re decentralized, and you can move this problem to your users). Why? Because this is how bittorrent works, and we all know the legal dangers there. Encryption helps against eavesdropping, but not against an adversary who already has the hash and simply wants to prove you are distributing pirated material or even CSAM. You may be able to circumvent this to shift blame back on the user, in some cases. For instance, say you are re-syncing dangerous goods that you initially uploaded over Dropbox, then Dropbox can probably blame you, even though they are technically distributing. But that requires Dropbox to be reasonably confident that “you” (ie the same legal entity) had those chunks in the first place.

ketralnis · 3 years ago

That's an interesting extension of the illegal numbers or coloured bits theories, but we don't really see it used that way in practise. When governments or media industry groups crack down on this stuff, they don't go after everybody that ever had those bits in memory. Maybe that's just for practical reasons, but we've never seen every router in between a buyer and seller get confiscated too as they've been somehow tainted. Honestly this doesn't seem like more than a dystopian mental exercise

https://en.wikipedia.org/wiki/Illegal_number https://ansuz.sooke.bc.ca/entry/23 https://shkspr.mobi/blog/2022/11/illegal-hashes/

chaxor · 3 years ago

Isn't the main use of bittorrent for ML and research data? Academic torrents is a wonderful resource and what every developer should be using if they need to provide their neural network weights, training data, etc. How is there any legal problem using bittorrent? It's simply much more tailored for this problem than http. It doesn't make any sense to talk about 'Legal problems' for torrent protocols.

saagarjha · 3 years ago

If you’re distributing CSAM on your blob storage, and someone lets you know, you should probably remove it. This is independent of whether you distribute chunks or the whole file.

ww520 · 3 years ago

15 years ago when I built a deduplication file storage system, rolling hash was on the table during design but there were some patents on it. Ended up using fixed size chunking which working less well but still gave incredible storage saving.

loxias · 3 years ago

Hah! I also built a similar storage system, optimized for whole disk images, for work, around 2007!! I used fixed size chunks as well. I called it "data block coalescion", having never heard of anyone else doing so we figured I invented it and we were granted the patent(!). I used it to cram disk images for I think 6 different fresh install configurations onto a single DVD. :D

Later on I used it and vmware to build a proof of concept for a future model of computing where data was never lost. (it would snapshot the vm and add it to the storage system daily or hourly. look ma, infinite undo at a system level!)

The next version of the algorithm was going to use what I now know is essentially rolling hash (I called it "self delimiting data blocks"), but then the company went under.

> but there were some patents on it.

The patent system is quite silly and internally inconsistent. I'm older now, and suspect someone thought of saving disk space through coalescion of identical blocks before I did in 06/07 but not according to the USPTO!

petesergeant · 3 years ago

https://restic.net/blog/2015-09-12/restic-foundation1-cdc/

Deleted Comment

> However, this was impractical, especially with the shift to working from home during the pandemic with sub-par internet connections. scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression.

I didn't appreciate the scope of this problem until a friend of mine visited from the valley. I live in semirural Canada, and they were floored by the speed and low latency of the fibre connection I have.

I sort of took it for granted that people would spend the extra twenty bucks or so to make work from home a painless experience, or at least ask their employer to fund a better connection.

It caused me to reach out, and I found many wfh peers with terrible connections, low powered laptops, and few second monitors. And proper desks? The exception.

My family is mostly in trades, and so spending a little cash to improve my tools felt like common sense. Apparently it isn't.

adastra22 · 3 years ago

> I sort of took it for granted that people would spend the extra twenty bucks or so to make work from home a painless experience

I think you took for granted the mere availability of fiber as an option.

30Mbps DSL is the best option I have, other than Starlink. And I live in San Jose!

dleslie · 3 years ago

You're not wrong. It's amazing how poor access can be.

I live in a tiny town on Vancouver Island and I have gigabit symmetric, for a very reasonable price. When I lived in Vancouver that simply wasn't an option.

consumer451 · 3 years ago

I am sitting in a farming village in central EU and I have 900mbps fiber, as of last year.

My family lives in old South Seattle neighborhoods and has excellent fiber.

I had something like 100mpbs co-axial 21 years ago in West Seattle.

What is the problem with San Jose?

sgtnoodle · 3 years ago

Lol, we have 46Mbps DSL over in Half Moon Bay. My friend in Los Altos sent me his 8Gbps symmetric internet speed test result...

jcelerier · 3 years ago

that's wild, my parents in a pop.~ 1000 rural town in france get 2GB fiber; the next "average" town (60k inhabitants) is 30km away, and the next big town (Toulouse, pop. 460k) is 120km away

qup · 3 years ago

10/1 (Mbps) is the "guarantee" where I'm at. It's pretty rural, so I'm happy to have anything.

I would pay $200/month extra for fiber.

BooneJS · 3 years ago

Bay Area broadband is surprisingly bad, and doesn’t fit the outside assumptions of the “Silicon Valley”.

TaylorAlexander · 3 years ago

Yep! I lived all over the Bay Area for the last 20 years and I never had the option for fiber until I moved to Oakland two months ago. It’s only $40 a month for symmetric gigabit fiber. But before I simply had no option for it.

strangescript · 3 years ago

You don't appreciate the complete lack of incentive for companies to roll out truly high speed internet in the U.S. There is minimal gov't backing, little competition, most people don't know any better and at the end of the day those companies will still have to charge the same amount per month they do now.

ClassyJacket · 3 years ago

Nobody is avoiding fibre because it costs extra, we don't have fibre because it isn't offered. We're not as stupid as you're accusing us of being.

wyager · 3 years ago

I spend part of my time in a rural area. Only in the last couple years was fiber even available near my house, and then it's a matter of $300/mo on a 3-year contract for 250mbps symmetric, plus the cost of running last-mile from the road to my house.

Thankfully, Starlink came out, and I was able to get like 50/20 for $110/mo on a month-to-month + $500 fixed.

zX41ZdbW · 3 years ago

hot_gril · 3 years ago

> Born from the ashes of Stadia

Love it. Those were some very expensive ashes. I hope more comes from them too.

haarts · 3 years ago

I take heart when reading about recovered parts of expensive ashes.

But I firmly believe that no project is wasted especially one that pushed the boundaries of what is commonly done. All the people working there learnt something. They might apply their new found knowledge to other fields or future carriers. The money invested is not lost but converted into bigger brains.

I look at VC money in much the same way. It's not great when a startup fails. But a lot is learnt.

nayuki · 3 years ago

It's like how Google Docs' concurrent editing was born from the ashes of Google Wave.

collinmanderson · 3 years ago

I don’t believe that Google docs concurrent editing came from wave. I thought they were separate though I might be wrong.

kragen · 3 years ago

i thought it came from etherpad

archon810 · 3 years ago

Yeah, like opening up BT in the Stadia controllers.

Apofis · 3 years ago

I wonder if they'll open source some of the streaming tech...

ssth · 3 years ago

We can feel developer's bitterness from that sentence alone.

Stadia did impact my work a little, but not enough to matter to me. I'd be upset if I worked on it directly.

tmchow · 3 years ago

1. I thought this was going to be something about the Center for Disease Control

2. This is a really neat project, and super expensive ashes to rise from Stadia.

3. This is targeted at Windows to Linux, but given the speed advantages that it has over rsync couldn’t this be Linux to Linux? That said, I’d be afraid to use this over rsync just from the body of experience and knowledge that exists about rsync.

rsync · 3 years ago

"I thought this was going to be something about the Center for Disease Control"

I thought cultdeadcow. Tool for transferring t-files.

tga_d · 3 years ago

I was thinking something similar to your third point, though more for cdc_stream, since as of last year sshfs is no longer maintained anyway.

josephg · 3 years ago

I'm really impressed with the readme file in this repository. Its a master class in effective documentation. Just look at the structure:

- 1-2 sentence summary of the project

- History (why we built this and didn't just use existing tools)

- What we've built, including what it does and how it compares to other tools (rsync in this case).

- How to install and use it

And the whole thing is full of images and animations showing the tool in action, and explaining enough of its internals.

I consider myself good at documentation, but I'm taking notes. This is excellent.

This is how I strive to document things. Focus primarily on what the tool is used for and why, not trying to define what it is.

aembleton · 3 years ago

Also for commit messages - why the change is made, not what it is.

zmmmmm · 3 years ago

The documentation in the code itself is pretty great as well:

https://github.com/google/cdc-file-transfer/blob/main/fastcd...

SV_BubbleTime · 3 years ago

It’s always amazing to me how little emphasis people give to INTENT on their repo.

c7b · 3 years ago

Slightly OT, but I like the schematic gifs used in the Readme.md (pretty amazing doc overall!) like this one [0]. Does anyone have suggestions what tools they might have used (or might be used in general) to create those?

[0] https://github.com/google/cdc-file-transfer/blob/main/docs/l...

ljusten · 3 years ago

We used an internal Chrome extension to capture the gifs. There also seem to be some externally available GIF capturing tools, like https://chrome.google.com/webstore/detail/chrome-capture-scr... (disclaimer: I haven't tried those). We then used https://ezgif.com/optimize to remove dupe/unwanted frames, tweak timings and compress the gif. The actual content was done in Google Slides.

Awesome, thanks for replying here! If I may be greedy: I suppose you used something else for capturing the command prompt gif?

skies457 · 3 years ago

D2 (https://d2lang.com/tour/intro/) might be the tool you are looking for, but it doesn't handle animation as far as I know

That looks useful, thanks!

alixanderwang · 3 years ago

Coming very soon!

r4kbY44D · 3 years ago

ImageMagick does the trick, for example `convert -delay 20 -loop 0 *.png out.gif`

pornel · 3 years ago

https://gif.ski should give you higher quality and smaller file sizes.

yellow_lead · 3 years ago

Maybe just draw.io plus Photoshop or some gif generator. The gifs don't look very complicated

geokon · 3 years ago

This might be a dumb question.. but when I read this

"scp always copies full files, there is no "delta mode" to copy only the things that changed, it is slow for many small files, and there is no fast compression."

my thinking was: If you want to send diffs.. why not just use git?

It does compression and chunks and all that. Maybe the defaults perform poorly for binary files? But couldn't you fix that with a custom diff algo? (if it's somehow not appropriate, it'd still be nice to port whatever secret sauce they use here to git..)

mindwok · 3 years ago

They are separate problems and not really comparable, git is a higher level tool. The diffing in git is for humans to compare code, and as such it is actually computed on the fly and not stored anywhere (unless you use git patch). Similarly, git does not provide any mechanism for sending/receiving files, it re-uses existing solutions like HTTP and SSH.

This tool exists to send/receive files, and the diffing is an implementation detail used to achieve a high level of performance. It would make more sense for git to use this library under the hood as you mentioned.

Oh right, git doesn't store diffs. So each time you push/pull, it fetches/uploads the whole file.. for every commit?

Damn..

Then I guess even using this under the hood would be essentially a rewrite or how git fundamentally works internally

planede · 3 years ago

git does store deltas, but it's an implementation detail. This is what git pack files mostly are.

I don't know if git sends or receives deltas though, I think it does?

deathanatos · 3 years ago

I'm sort of in the same boat, but with the sentence,

> To help this situation, we developed two tools, cdc_rsync

Why not use rsync?

(It does seem they ended up faster than rsync, so perhaps that's "why", but that seems more like a post-hoc justification.)

The "we use variable chunk windows" bit is intriguing, but the example GIF sort of just pre-supposes that the local & remote chunks match up. That could have happened in the rsync case/GIF, but that wasn't the case considered, so it's an oranges/apples comparison. (Or, how is it that the local manages to be clairvoyant enough to choose the same windows?)

>> Or, how is it that the local manages to be clairvoyant enough to choose the same windows?

Yes! In content defined chunking, the chunk boundaries depend on the content, in our case a 64 byte window. If the local and the remote files have the same 64 byte sequence anywhere, and that 64 byte sequence has some magic pattern of 0s and 1s, they will both have chunk boundaries there. A chunk is the range of data between two chunk boundaries, so if N consecutive chunk boundaries match, then N-1 consecutive chunks match.

I was wondering the same thing. It isn't explicit in the write-up but it's because rsync does not have a native Windows implementation, only rsync under Cygwin, so they developed this to achieve the same thing except it turned out faster than rsync.

desi_ninja · 3 years ago

Even if one tries to repurpose git for it, git doesn't work well with diffs in binaries

ithkuil · 3 years ago

I did a similar thing a few years ago

https://github.com/mkmik/imsy

It's less sophisticated but it uses the same core idea and the implementation is super simple