Readit News logoReadit News
ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
xearl · 3 years ago
Is there any chance to get this faster sync algorithm into rsync itself?
ljusten · 3 years ago
ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
worldsavior · 3 years ago
But why not make cross platform? If already creating a file transfer program, make it cross platform. What're the complications?
ljusten · 3 years ago
Short answer, we didn't need it. While the code is largely cross-platform, there is some work involved when it gets down to the details.

We are currently working on supporting Windows to Windows. Linux to Linux has lower priority as rsync already provides all functionality, it's just a bit slower on fast connections. On slow connections, rsync and cdc_rsync perform very similarly as the sync speed is dominated by the network.

ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
greatgib · 3 years ago
I don't understand, you can have rsync use a ssh tunnel directly. Easily. Isn't that enough?
ljusten · 3 years ago
Note that cdc_rsync runs on Windows and syncs to Linux. rsync is a Linux-only tool where you'd have to jump through some hoops to make it work on Windows.
ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
deanCommie · 3 years ago
> At Stadia, game developers had access to Linux cloud instances to run games. Most developers wrote their games on Windows, though. Therefore, they needed a way to make them available on the remote Linux instance.

Am I reading this right that onboarding your game to Stadia as a developer involved essentially rsyncing data directly to a Linux cloud instance?

That's.....

ljusten · 3 years ago
Just the compiled and baked game, not your sources. You still developed on Windows or wherever, the cloud instance was just used for running the game.
ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
skanga · 3 years ago
Thanks! Does it work Windows to Windows also?
ljusten · 3 years ago
Windows to Windows is being worked on, see https://github.com/google/cdc-file-transfer/compare/main...s....

Linux to Linux is also an option if there is demand, but currently it's Windows to Linux only.

ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
RyanShook · 3 years ago
This looks like a very useful tool with a wide range of applications. Is it Windows to Linux only? It would be so nice if it was system agnostic.
ljusten · 3 years ago
We're currently adding support for Windows to Windows to cdc_rsync. If there is demand, Linux to Linux would also be possible.
ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
AnonC · 3 years ago
I skimmed through the readme, which explains the concepts quite well, but am unclear on what needs to be installed on each machine (assuming Windows as the source and Linux as the destination). There’s a mention of copying the Linux build output, cdc_rsync_server, to the Windows machine. Why is this needed? And is there something on the Linux machine that needs to be (newly) added in the PATH?
ljusten · 3 years ago
Just uncompress the binaries on the Windows machine and run cdc_rsync. The Linux component, cdc_rsync_server, is deployed automatically on first run. It is scp'ed to ~/.cache/cdc-file-transfer/bin. So nothing has to be installed on the Linux machine.
ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
dark-star · 3 years ago
Comparing it with rsync running on Cygwin is a bit unfair, as Cygwin is known to be terribly inefficient. I don't doubt that their CDC based algorithm is faster, but probably not by the margin they claim if Cygwin is taken out of the equation
ljusten · 3 years ago
I should also note that we used a fairly fast 100 MB/sec connection to upload the data, so the rsync diffing algorithm running at 50 MB/sec is actually a bottleneck. The difference would be smaller on a slower connection, where the network overhead would dominate the results.
ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
dark-star · 3 years ago
Comparing it with rsync running on Cygwin is a bit unfair, as Cygwin is known to be terribly inefficient. I don't doubt that their CDC based algorithm is faster, but probably not by the margin they claim if Cygwin is taken out of the equation
ljusten · 3 years ago
IIUC, rsync computes a relatively expensive Rabin-Karp rolling hash (https://librsync.github.io/rabinkarp_8c_source.html) and performs a hash map lookup for every byte. Hash map lookups might not be very cache friendly for larger data sets. In comparison, cdc_rsync only computes

  hash = (hash << 1) + random_table[data[n]];
  bool chunk_boundary = (hash & magic_pattern) == 0;
per byte. That's only a few ops and very cache friendly. The random table only has 256 entries, 8 bytes each, so it easily fits into L1.

ljusten commented on CDC File Transfer   github.com/google/cdc-fil... · Posted by u/cglong
eps · 3 years ago
The gotcha of "inserting or deleting a few bytes" is not in detection, it's in replicating this discovery to the target copy.

Say, we have 1GB file and we detected an extra byte at the head of our local copy. Great, what next? We can't replicate this on the receiving end without recopying the file, which is exactly what happens - rsync recreates target file from pieces of its old copy and differences received from the source. Every byte is copied, it's just that some of them are copied locally.

In that light, sync tools that operate with fixed-size blocks have one very big advantage - they allow updating target files in-place and limiting per-sync IO to writes of modified blocks only. This works exceptionally well for DBs, VMs, VHDs, file system containers, etc. It doesn't work well for archives (tars, zips), compressed images (jpgs, resource packs in games) and huge executables.

In other words - know your tools and know your data. Then match them appropriately.

ljusten · 3 years ago
For reference, here's the paper that describes the in-place update algo in rsync: https://www.usenix.org/legacy/events/usenix03/tech/freenix03.... I haven't looked into it more deeply, but I think it's possible to apply the same idea to variable sized chunks.

Also, most modern compression tools have an "rsyncable" option that makes the archives play more nicely with rsync.

u/ljusten

KarmaCake day42January 9, 2023View Original