I wish checksumming filesystems had some interface to expose the internal checksums. Maybe it wouldn't be useful for rsync though as filesystems should have the freedom to pick the best algorithm (so filesystem checksum of a file on different machines would be allowed to differ e.g. if filesystem block size was different). But so that e.g. git and build systems could use it to tell 'these 2 files under a same directory tree are definitely identical'.
Sorry to break it to you. That's not about luck. You've asked for something which is nonsense if you want to "recycle" compute used to checksum records.
If you want them to store the checksum of the POSIX object as an attribute (we can argue about performance later) great, but using the checksums intrinsic to the zfs technology to avoid bitflips directly is a bad call.
If they expose it, that ties them to a particular hash algo. Hash algos are as much art as science, and the opportunities for hardware acceleration vary from chip to chip, so maintaining the leeway to move from one algo to another is kind of important.
Maybe NixOS spoiled me, but I would not copy system files at all. The only data worth backing up is app state and your files usually under /home or /var/lib/whatever.
Yeah. But the system files add so little to the backup size that I just backup everything except tmpfs and xdg cache anyways. I also reduce backup size by simply not backing up ~/{Pictures,Videos}. I manually put those in the google photos account my phone backs up to since they very rarely change.
There are a lot of reasons why just making a copy of the files you need to another FS is not sufficient as a backup, clearly this is one of those. We need more checks to ensure integrity and robustness.
After one enables rsync with checksums, doesn't Borg have the same issue? I believe Borg needs to do the same rolling checksum over all the data, now, as well?
ZFS sounds like the better option -- just take the last local snapshot transaction, then compare to the transaction of the last sent snapshot, and send everything in between.
And the problem re: Borg and rsync isn't just the cost of reading back and checksumming the data -- for 100,000s of small files (1000s of home directories on spinning rust), it is the speed of those many metadata ops too.
As with rsync borg does not read files if their timestamp/length do not change since the last backup. And for million files on modern SSD it takes just few seconds to read their metadata.
No tool's defaults are always enough. Otherwise they'd be fixed settings not changeable defaults.
> "ooh, bit rot" and other things where one of the files has actually become corrupted while "at rest" for whatever reason. Those observers are right!
Yep. This is why you verify your backups occasionally. And perhaps your local “static” resources too depending on your accident/attack threat models and “time to recovery” requirements in the event of something going wrong.
> the first time you do a forced-checksum run, --dry-run will let you see the changes before it blows anything away, so you can make the call as to which version is the right one!
That reads like someone is not doing snapshot backups, or at least not doing backups in a way that means the past snapshots are not protected from being molested by the systems being backed up. This is a mistake, and not one rsync or any other tool can reliably save you from.
But yes, --dry-run is a good idea before any config change. Or just generally in combination with a checksum based run as part of a sanity check procedure (though as rsync is my backup tool, I prefer to verify my backups with a different method, checksums generated by another tool, or direct comparison between snapshots/current by another tool, just-in-case a bug in rsync is causing an issue that verification using rsync cannot detect because the bug affects said verification in the same way).
> Author doesn’t explain what happened or why the proposed flags will solve the problem.
Probably because she/he doesn't know. Could be lots of things, because FYI mtime can be modified by the user. Go `touch` a file.
In all likelihood, it happens because of a package installation, where a package install sets the same mtime, on a file which has the same sized, but has different file contents. That's where I usually see it.
`httm` allows one to dedup snapshot versions by size, then hash the contents of identical sized versions for this very reason.
--dedup-by[=<DEDUP_BY>] comparing file versions solely on the basis of size and modify time (the default "metadata" behavior) may return what appear to be "false positives". This is because metadata, specifically modify time and size, is not a precise measure of whether a file has actually changed. A program might overwrite a file with the same contents, and/or a user can simply update the modify time via 'touch'. When specified with the "contents" option, httm compares the actual file contents of same-sized file versions, overriding the default "metadata" only behavior...
It is worth nothing that rsync doesn't compare just by size and mtime but also (relative) path - i.e. it normally compares an old copy of a file with the current version of the same file. So the likelyhood of "collisions" is much smaller than a file de-duplicating tool that compares random files.
there is also some weirdness or used to be some weirdness with linux and modifying shared libraries. for example if you have a process is using a shared library and the contents of the file is modified (same inode) then what behaviour is expected? i think there are two main problems
1) pages from the shared library are lazily loaded into memory so if you try and access a new page you are going to get it from the new binary which is likely to cause problems
2) pages from the shared library might be 'swapped' back to disk due to memory pressure. not sure whether the pager will just throw the page away and try to swap back in from disk from the new file contents or if it will notice the disk page is dirty and use the swap for write back to preserve the original page.
also, i remember it used to be possible to trigger some error if you tried to open a shared library for writing while it was in use but I can't seem to trigger that error anymore.
Author explained that by default if two files are the same size, and have the same modification date/time, that rsync will assume they're identical, WITHOUT CHECKING THAT.
Author clarifies there are flags to change that behaviour, to make it actually compare file contents, and then shares those names.
Short of checking every single byte against each other you need to do some sort of short hand.
“Assume these two files are the same, check if they either system is saying they have modified it or check if the size has changed and call it a day” is pretty fair of an assumption and something even I knew RSync is doing and I’ve only used it once in a project 10 years ago. I am sure Rachel also knows this.
So, what is the problem? Is data not being synced? Is data being synced too often? And why do these assumptions lead to either happening? What horrors is the author expecting the reader to see when running the suggested command?
They also have to have the same name though. The actual chances of this situation happening and persisting long enough to matter are pretty damn small.
Maybe someone else will have better luck than me.
TIL that btrfs's checksums are per block, not per file. There's a dump-csum command, but doesn't seem likely to be very useful. https://unix.stackexchange.com/questions/191754/how-do-i-vie...
If you want them to store the checksum of the POSIX object as an attribute (we can argue about performance later) great, but using the checksums intrinsic to the zfs technology to avoid bitflips directly is a bad call.
I find "-apHAX" a sane default for most use and memorable enough. (think wardriving)
Very common contextuals:
-c (as mentioned, when you care about integrity)
-n (dryrun)
-v (verbose)
-z (compress when remoting)
Where it applies I usually do the actual syncing without -c and then follow up with -cnv to see if anything is off.
Or, since installing is such a pain, perhaps better consider everything user files ;) ;)
Dead Comment
BorgBackup is clearly quite good as an option.
After one enables rsync with checksums, doesn't Borg have the same issue? I believe Borg needs to do the same rolling checksum over all the data, now, as well?
ZFS sounds like the better option -- just take the last local snapshot transaction, then compare to the transaction of the last sent snapshot, and send everything in between.
And the problem re: Borg and rsync isn't just the cost of reading back and checksumming the data -- for 100,000s of small files (1000s of home directories on spinning rust), it is the speed of those many metadata ops too.
> "ooh, bit rot" and other things where one of the files has actually become corrupted while "at rest" for whatever reason. Those observers are right!
Yep. This is why you verify your backups occasionally. And perhaps your local “static” resources too depending on your accident/attack threat models and “time to recovery” requirements in the event of something going wrong.
> the first time you do a forced-checksum run, --dry-run will let you see the changes before it blows anything away, so you can make the call as to which version is the right one!
That reads like someone is not doing snapshot backups, or at least not doing backups in a way that means the past snapshots are not protected from being molested by the systems being backed up. This is a mistake, and not one rsync or any other tool can reliably save you from.
But yes, --dry-run is a good idea before any config change. Or just generally in combination with a checksum based run as part of a sanity check procedure (though as rsync is my backup tool, I prefer to verify my backups with a different method, checksums generated by another tool, or direct comparison between snapshots/current by another tool, just-in-case a bug in rsync is causing an issue that verification using rsync cannot detect because the bug affects said verification in the same way).
Probably because she/he doesn't know. Could be lots of things, because FYI mtime can be modified by the user. Go `touch` a file.
In all likelihood, it happens because of a package installation, where a package install sets the same mtime, on a file which has the same sized, but has different file contents. That's where I usually see it.
`httm` allows one to dedup snapshot versions by size, then hash the contents of identical sized versions for this very reason.
1) pages from the shared library are lazily loaded into memory so if you try and access a new page you are going to get it from the new binary which is likely to cause problems
2) pages from the shared library might be 'swapped' back to disk due to memory pressure. not sure whether the pager will just throw the page away and try to swap back in from disk from the new file contents or if it will notice the disk page is dirty and use the swap for write back to preserve the original page.
also, i remember it used to be possible to trigger some error if you tried to open a shared library for writing while it was in use but I can't seem to trigger that error anymore.
Author clarifies there are flags to change that behaviour, to make it actually compare file contents, and then shares those names.
It seems like you didn't read the article.
“Assume these two files are the same, check if they either system is saying they have modified it or check if the size has changed and call it a day” is pretty fair of an assumption and something even I knew RSync is doing and I’ve only used it once in a project 10 years ago. I am sure Rachel also knows this.
So, what is the problem? Is data not being synced? Is data being synced too often? And why do these assumptions lead to either happening? What horrors is the author expecting the reader to see when running the suggested command?
That is what is not explained in the article.
This should never be a surprise to people unless this is their first time using Unix.