We're talking about long-term archiving here. That means centuries.
My brother the archaelogical archivist of ancient (~2000 years BCE) mesopotamian artifacts has a lot to say about archival formats. His raw material is mostly fired clay tablets. Those archives keep working, partially, even if broken. That's good, because many of them are in fact broken when found.
But their ordering and other metadata about where they were found is written in archaeologists' notebooks, and many of those notebooks are now over a century old. Paper deteriorates. If a lost flake of paper from a page in the notebook rendered the whole notebook useless, that would be a disastrous outcome for that archive.
A decade ago I suggested digitizing the notebooks and storing the bits on CD-ROMs. He laughed, saying "we don't know enough about the long-term viability of CD-ROMs and their readers."
Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.
My point: planning for centuries long archiving is difficult. Formats with redundancy, at least with forward error correction codes, are very helpful. Formats that can be rendered useless by a few bit-flips, not so much.
I would be equally concerned about the stability of the file formats for the data stored inside the archives. Even plain ASCII text files have not been around very long - about 60 years since standardisation, but it took a while for the standard to become largely universal. And ASCII is pretty restricted in what it can represent. Note that I'm talking about plain text files, not things like Markdown which might use ASCII.
Most more complex file formats suffer from variant formats. Some, like Markdown and RTF just have multiple versions. Some like TIFF and PDF are envelope formats, so the possible contents of the envelope change over time, introducing incompatibility. Then there is bit-rot as formats go out of use, e.g. .DOC (as opposed to .DOCX).
My own objectives are simple compared to your brother's. I want to preserve simple formatted text files until about 40 year from now, in a way that is likely to allow cut and paste. I started accumulating them about 20 years back. Note that this is before Markdown (which is in any case poor for recording formatting). LaTeX was around and seemed ok in terms of expected lifetime, but is poor for cut and paste because the rendering of a chunk of text depends on instructions which are not local to it. I settled for RTF, which this carries significant long term risk for both compatibility and availability, but is documented well enough that migrating out may be possible.
That's just formatted text. Images have been worse, particularly if you are handling meta-data such as camera characteristics, satellite orientation, etc.
I very much doubt the bit representation matters very much as long as it simple. Even if ASCII text viewers are lost they would be extremely simple to implement. It is a counterpoint to things like latex. That would be hard to recreate.
Not to sound like the usual evangelist, and I am sure everyone has already thought of these points, but regarding RAID-10, is that the best call?
My thought process is that the majority of modern hardware/software RAID solutions don’t do error correction properly. And even assuming something that does actually do checksumming and such semi-properly is employed, I think if we are talking tens or hundreds of years from now, it’ll be nearly impossible to find a compatible hardware card should the ones in use die.
I’m aware it already sucks trying to build a project six months in the future, let alone six or sixty years, but perhaps something purely software-based that does care a ton about integrity, like ZFS, would be the best bet in terms of long-term compatibility of a hard-drive storage solution.
So long as the drives can still be plugged into a system, any system, and even if OpenZFS eventually drops backwards compatibility with the version used to create the pool, it’ll likely still be possible to virtualize whatever version of Linux/BSD/Illumos is compatible with that particular version of ZFS and then import the pool.
I think QR codes are actually great for this (so long as you are storing basic data like plain text that is likely to be recoverable for a long time). It has built in error correction and software for reading them is extremely widespread.
However, more than just error correction, we should really try to make formats that are resilient to data corruption. For example, zip files seem much more resilient than gzipped tarballs because each file is compressed separately.
Digital copies are really not that durable, we just sometimes confuse ease of copying with durability. This sometimes helps, but only if you can have distributed copies.
> Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.
I actually have a personal digitization project for some stuff I've inherited, and it's glad to get a little validation for my strategy.
Basically my plan is to scan the documents/photos, create some kind of printed book with the most important/interesting ones and an index, and have a M-DISC with all the scans in the back.
The hardest part is digitizing those notes. Once digitized, the best approach is to make copies every few years. Copying of digital data is easy and lossless, and the cost of
digital storage is constantly going down.
Correct, xz is no longer particularly useful, mostly annoying.
For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient. The cool thing about squashfs is you can efficiently list the contents and extract single files. This is why it's an ultimate best option for long-term filesystem-esque archives.
I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file, in case you don’t value your sanity). People have made a number of those on top of tar, but they are all limited in various ways by the creators’ requirements, and hardly ubiquitous; and, well, tar is nuts—both because of how difficult and dubiously compatible it is to store some aspects of filesystem metadata in it, and because of how impossible it is to not store others except by convention. Not to mention the useless block structure compared to e.g. cpio or even ar. (Seek indexes for gzip and zstd are also a solved problem in that annoying essentially-but-not-in-practice way, but at least the formats themselves are generally sane.)
Incidentally, the author of the head article also disapproves of the format (not compression) design of zstd[1] on the same corruption-resistance grounds (although they e.g. prohibit concatenability), even though to me the format[2] seems much less crufty than xz.
> I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file.
I wholeheartedly agree! That use-case is not currently covered by any widely known OSS project AFAIK.
I'm actually working on just such a thing! It's definitely a low-priority side project at this point, but I think the general technique has legs. It was born out of a desire for easier to use streaming container formats that can save arbitrary data streams.
I call it SITO and it's based on a stream of messagepack objects. If folks are interested in such a thing, I can post what I have currently of the spec, and I'd love to get some feedback on it.
I agree that TAR is problematic, compression is problematic, and there needs to be a mind towards ECC from the get-go.
I could really use some technical discussion to hammer out some of the design decisions that I'm iffy about, and also hone in on what is essential for the spec. For example, how to handle seek-indexing, and whether I should use named fields vs fixed schema, or allow both.
This is about long-term archiving though. For that you want a wide-spread format that is well-documented and has many independent implementations.
Like zip/7z (with external error recovery), or maybe RAR (with error-recovery records)
Fast compression or decompression is almost entirely meaningless in that context, and compression ratio is also only of secondary importance.
This is why PDF is still considered the best format for long-term archiving of documents, even though there might be things that compress better (djvu/jp2)
> The cool thing about squashfs is you can efficiently list the contents and extract single files.
What’s the story look like around reading files out of squashfs archives stored in an S3 compatible storage system? Can what you mention above be done with byte range requests versus retrieving the entire object?
"DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources."
DwarFS may be good, but it's not in the Linux kernel (depends on FUSE). That makes it less universal, potentially significantly slower for some uses cases, and also less thoroughly tested. SquashFS is used by a lot of embedded Linux distros among other use cases, so we can have pretty high confidence in its correctness.
Are you recommending zstd based on compression speed and ratio only? Because as the linked article explains, those are not the only criteria. How does zstd rate with everything else?
Zstd is still just as bad when it comes to the most important point:
>Xz does not provide any data recovery means
For the common use case of compression a tar archive, this is a critical flaw. One small area of corruption will render the remainder of the archive unreadable.
I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2.
The only compressed format left in common use that handles corruption is the hardware compression in LTO tape drives
> I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2.
Because monotonically increasing technological progress is a commonly-believed fairy tale. Nowadays, capabilities are lost as often as they're gained. Either because the people who designed the successor are being opinionated or just focusing on something else.
Zstd is terrible for archiving since it doesn't even detect corruption. The --check switch described in the manpage as enabling checksums (in a super-confusing way) seems to do absolutely nothing.
You can test by intentionally corrupting a .zstd file that was created with checksums enabled and then watch as zstd happily proceeds to decompress it, without any sort of warning. This is the stuff of nightmares.
After all these years, RAR remains the best option for archiving.
Sure, if you look at the Pareto frontier for xz and zstd, zstd does not seem like a “replacement” for xz. It’s not a replacement for PPMd.
The problem is that xz has kind of horrible performance when you crank it up to the high settings. On the medium settings, you can get the same ratio for much, much less CPU (round-trip) by switching to zstd.
Do you have any data to support this claim? In my experience, zstd is way better in every way compared to gzip. Additionally, xz is good compression but horribly crazy slow to decompress. Xz also only operates on one single file at a time, which is annoying.
In my experience the normal squashfs kernel driver is quite slow at listing/traversing very large archives (several GB+). For some reason squashfuse is MUCH faster for just looking around inside.
I recently came to the same determination looking for a better way to package sosreports (diagnostic files from linux machines). The pieces are there for indexed file lists and also seekable compression but basically nothing else implements them in a combined fassion with a modern compression format (mainly zstd).
Zstd can achieve good compression ratio for many data patterns and is super fast to create, and decompresses at multiple GB/s with a halfway decent CPU, often hitting disk bandwidth limits.
I've never tried lzop or met someone who advocated to use it.
Needs research, perhaps. Until then I'm healthily skeptical.
When does it make sense to use lzop (or the similar but more widely recommended LZ4) for static storage? My impression was that it was a compressor for when you want to put less bytes onto the transmission / storage medium at negligible CPU cost (in both directions) because your performance is limited by that medium (fast network RPC, blobs in databases), not because you want to take up as little space as possible on your backup drive. And it does indeed lose badly in compression ratio even to zlib on default settings (gzip -6), let alone Zstandard or LZMA.
I used lzop in the past when speed was more of a concern than compressed size (big disk images etc.)
For a couple of years now I have switched to zstd. It is both fast and well compressed by default in that use case, no need to remember any options.
No, I have NOT done any deeper analysis except a couple of comparisons which I haven't even documented. But they ended up in slight favor of zstd, so I switched. But nothing dramatic making me say forget lzop.
It was really disappointing when dpkg actively deprecated support (which I feel they should never do) for lzma format archives and went all-in on xz. The decompressor now needs the annoying flexibility mentioned in this article and the only benefit of the format--the ability to do random access on the file--is almost entirely defeated by dpkg using it to compress a tar file (which barely supports any form of accelerated access even when uncompressed; like the best you can do is kind of attempt to skip through file headers, which only helps if the files in the archive are large enough) and, to add insult to injury, the files are now all slightly larger to account for the extra headers :/.
Regardless, this is a pretty old article and if you search for it you will find a number of discussions that have already happened about it that all have a bunch of comments.
pixz (https://github.com/vasi/pixz) is a nice parallel xz that additionally creates an index of tar files so you can decompress individual files. I wonder if dpkg could be extended to do something similar.
I disagree with the premise of the article. Archive formats are all inadequate for long-term resilience and making them adequate would be a violation of the “do one thing and do it right” principle.
To support resilience, you don’t need an alternative to xz, you need hashes and forward error correction. Specifically, compress your file using xz for high compression ratio, optionally encrypt it, then take a SHA-256 hash to be used for detecting errors, then generate parity files using PAR[1] or zfec[2] to be used for correcting errors.
Folks seem to be comparing xz to zstd, but if I am understanding correctly the true competitor to xz is the article author’s “lzip” format, which uses the same LZMA compression as xz but with a much better designed container format (at least according to the author).
I'd not say it's necessarily better designed - it's just simpler. Few bytes of some headers, LZMA compressed data, checksum, done. No support for seeks and stuff like that
The vast majority of the discussion is around xz’s inability of dealing with corrupted data. That said, couldn’t you argue that that needs to be solved at a lower level (storage, transport)? I’m not convinced the compression algorithm is the right place to tackle this.
Just use a file system that does proper integrity checking/resilvering. Also use TLS to transfer data over the network.
The article is about usage of xz for long term archival, so transport is not relevant, the concern seems to be bitrot and forward compatibility.
Storage with integrity checking would be the solution to bitrot, but TFA also seems concerned with "how do you unarchive/recover a random file you found?" which seems a somewhat valid concern.
And xz does have support for integrity checking, so it seems reasonable to have a discussion on whether that is a good support, rather than on whether it should be there at all.
> couldn’t you argue that that needs to be solved at a lower level
Self-referentially incorruptible data is the standard to beat here, moving the concern to a different layer doesn't increase the efficiency or integrity of the data itself.
It is arguably less efficient, as you now rely on some lower layer of protection in addition to whatever is built into the standard itself.
It is less flexible - a properly protected archive format could be scrawled on to the side of a hill, or more reasonably onto an archive medium (BD-disk), and should be able to survive any file-system change, upgrade, etc. Self-repairing hard drives with multiple redundancies are nice, but not cheap, and not wide-spread.
It also does nothing for actually protecting the data - I don't care how advanced the lower level storage format is, if you overwrite data e.g. with random zeros (conceivable due a badly behaving program with too much memory access, e.g. a misbehaving virus, or a bad program or kernal driver someone ran with root access, also conceivable due to EM interference or solar radiation causing a program to misbehave), the file system will dutifully overwrite the correct data with incorrect data, including updating whatever relevant checks it has for the file. The only way around this is to maintain a historical archive of all writes ever made, and that should be evidently both absurdly impractical (how do you maintain the integrity of this archive? With a self-referentially incorruptible data archive perhaps?) and expensive.
Compared to a single file, which can be backed up, agnostic to the filesystem/hardware/transport/major world-ending events, which can be simply read/recovered, far into the future. There's a pretty clear winner here.
> Self-referentially incorruptible data is the standard to beat here, moving the concern to a different layer doesn't increase the efficiency or integrity of the data itself.
I respectfully disagree. By putting it in the layer below, there is the ability to do repairs.
For example, consider storing XZ files on a Ceph storage cluster. Ceph supports Reed-Solomon coding. This means that if data corruption occurs, Ceph is capable of automatically repairing the data corruption by recomputing the original file and writing it back to disk once more.
Even if XZ were able to recover from some forms of data corruption, is it realistic that such repairs propagate back to the underlying data store? Likely not.
The article spends a lot of time discussing XZ's behavior when reading corrupt archives, but in practice this is not a case that will ever happen.
Say you have a file `m4-1.4.19.tar.xz` in your source archive collection. The version I just downloaded from ftp.gnu.org has the SHA-256 checksum `63aede5c6d33b6d9b13511cd0be2cac046f2e70fd0a07aa9573a04a82783af96`. The GNU site doesn't have checksums for archives, but it does have a PGP signature file, so it's possible to (1) verify the archive signature and (2) store the checksum of the file.
If that file is later corrupted, the corruption will be detected via the SHA-256 checksum. There's no reason to worry about whether the file itself has an embedded CRC-whatever, or the behavior with regards to variable-length integer fields. If a bit gets flipped then the checksum won't match, and that copy of the file will be discarded (= replaced with a fresh copy from replicated storage) before it ever hits the XZ code.
If this is how the lzip format was designed -- optimized for a use case that doesn't exist -- it's no wonder it sees basically no adoption in favor of more pragmatic formats like XZ or Zstd.
I have no idea why you think that use case does not exist. Your whole idea about archive seems to be it is to ensure a blob doesn't change (same hash). But that's far from the only use of archive. (Hell, even with that, you are assuming you know the correct hash of the file to begin with, which isn't guaranteed.)
"Repairing" corrupt archives, as in to get as much as usable data from that archive is a pretty useful thing and I have done it multiple times. For example, an archive can have hundreds of files inside and if you can recover any of them that's better than nothing. It is also one of the reason I still use WinRAR occasionally due to its great recovery record (RR) feature.
>replaced with a fresh copy from replicated storage
The process of long-term archival starts with replication. A common approach is two local copies on separate physical media, and one remote copy in a cloud storage service with add-only permissions. This protects against hardware failure, bad software (accidental deletion, malware), natural disasters (flood, fire) and other 99th-percentile disaster conditions. The cloud storage providers will have their own level of replication (AWS S3 has a 99.999999999% durability SLA).
If you have only one copy of some important file and you discover it no longer matches the stored checksum, then that's not a question of archival, but of data recovery. There's no plausible mechanism by which a file might suffer a few bitflips and be saved by careful application of a CRC -- bitrot often zeros out entire sectors.
I've had a few experiences trying to recover data from old hard drives or even tape drives. The general experience was that either it works perfectly or the drive is covered in bad sectors and large chunks are unreadable. I don't dispute bitrot exists but there does seem to be an awful lot of discussion on the internet about an issue that is not not the most likely failure mode.
* At the sector level in physical media (tapes, disk drives, flash). The file will be largely intact, but 4- or 8-KiB chunks of it will be zero'd out.
* At the bit level, when copying goes wrong. Usually this is bad RAM, sometimes a bad network device, very occasionally a bad CPU. You'll see patterns like "every 64th byte has had its high bit set to 1".
In both cases, the only practical option is to have multiple copies of the file, plus some checksum to decide whether a copy is "good". Files on disk can be restored from backup, bad copies over the network can be detected by software and re-transmitted.
It is a case that will happen, when you long-term archive files. Which is exactly what the article discusses. Bit-rot is a real thing that really happens. The argument makes the case that XZ is a poor choice for such a case where you possibly have only one copy and can't just download a new un-corrupted copy.
If you want to archive data, you need multiple copies.
XZ or not doesn't matter here. Even if you have a completely uncompressed tar file, if you only have one copy of it and lose access to that copy (whether bitrot, or disaster, or software error) then you've lost the data.
Archival formats have always been of interest to me, given the very practical need to store a large amount of backups across any number of storage mediums - documents, pictures, music, sometimes particularly good movies, even the occasional software or game installer.
The decompression speeds feel good, the compression ratios also seem better than ZIP and somehow it still feels like a widely supported format, with the 7-Zip program in particular being nice to use: https://en.wikipedia.org/wiki/7-Zip
Of course, various archivers on *nix systems also seem to support it, so so far everything feels good. Though of course having the chance of an archive getting corrupt and no longer being properly able to decompress it and read all of those files, versus just using the filesystem and having something like that perhaps occur to a single file still sometimes bothers me.
Then again, on a certain level, I guess nothing is permanent and at least it's possible to occasionally test the archives for any errors and look into restoring them from backups, should something like that ever occur. Might just have to automate those tests, though.
Yet, for the most part, going with an exceedingly boring option like that seems like a good idea, though the space could definitely use more projects and new algorithms for even better compression ratios, so at the very least it's nice to see attempts to do so!
My brother the archaelogical archivist of ancient (~2000 years BCE) mesopotamian artifacts has a lot to say about archival formats. His raw material is mostly fired clay tablets. Those archives keep working, partially, even if broken. That's good, because many of them are in fact broken when found.
But their ordering and other metadata about where they were found is written in archaeologists' notebooks, and many of those notebooks are now over a century old. Paper deteriorates. If a lost flake of paper from a page in the notebook rendered the whole notebook useless, that would be a disastrous outcome for that archive.
A decade ago I suggested digitizing the notebooks and storing the bits on CD-ROMs. He laughed, saying "we don't know enough about the long-term viability of CD-ROMs and their readers."
Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.
My point: planning for centuries long archiving is difficult. Formats with redundancy, at least with forward error correction codes, are very helpful. Formats that can be rendered useless by a few bit-flips, not so much.
Most more complex file formats suffer from variant formats. Some, like Markdown and RTF just have multiple versions. Some like TIFF and PDF are envelope formats, so the possible contents of the envelope change over time, introducing incompatibility. Then there is bit-rot as formats go out of use, e.g. .DOC (as opposed to .DOCX).
My own objectives are simple compared to your brother's. I want to preserve simple formatted text files until about 40 year from now, in a way that is likely to allow cut and paste. I started accumulating them about 20 years back. Note that this is before Markdown (which is in any case poor for recording formatting). LaTeX was around and seemed ok in terms of expected lifetime, but is poor for cut and paste because the rendering of a chunk of text depends on instructions which are not local to it. I settled for RTF, which this carries significant long term risk for both compatibility and availability, but is documented well enough that migrating out may be possible.
That's just formatted text. Images have been worse, particularly if you are handling meta-data such as camera characteristics, satellite orientation, etc.
My thought process is that the majority of modern hardware/software RAID solutions don’t do error correction properly. And even assuming something that does actually do checksumming and such semi-properly is employed, I think if we are talking tens or hundreds of years from now, it’ll be nearly impossible to find a compatible hardware card should the ones in use die.
I’m aware it already sucks trying to build a project six months in the future, let alone six or sixty years, but perhaps something purely software-based that does care a ton about integrity, like ZFS, would be the best bet in terms of long-term compatibility of a hard-drive storage solution.
So long as the drives can still be plugged into a system, any system, and even if OpenZFS eventually drops backwards compatibility with the version used to create the pool, it’ll likely still be possible to virtualize whatever version of Linux/BSD/Illumos is compatible with that particular version of ZFS and then import the pool.
However, more than just error correction, we should really try to make formats that are resilient to data corruption. For example, zip files seem much more resilient than gzipped tarballs because each file is compressed separately.
Digital copies are really not that durable, we just sometimes confuse ease of copying with durability. This sometimes helps, but only if you can have distributed copies.
I actually have a personal digitization project for some stuff I've inherited, and it's glad to get a little validation for my strategy.
Basically my plan is to scan the documents/photos, create some kind of printed book with the most important/interesting ones and an index, and have a M-DISC with all the scans in the back.
For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient. The cool thing about squashfs is you can efficiently list the contents and extract single files. This is why it's an ultimate best option for long-term filesystem-esque archives.
https://en.m.wikipedia.org/wiki/SquashFS
For everything else, or if you want (very) fast compression/decompression and/or general high-ratio compression, use .zstd.
https://github.com/facebook/zstd
I've used this combination professionally to great effect. You're welcome :)
Incidentally, the author of the head article also disapproves of the format (not compression) design of zstd[1] on the same corruption-resistance grounds (although they e.g. prohibit concatenability), even though to me the format[2] seems much less crufty than xz.
[1] https://lists.gnu.org/archive/html/lzip-bug/2016-10/msg00002...
[2] https://www.rfc-editor.org/rfc/rfc8878.html
I wholeheartedly agree! That use-case is not currently covered by any widely known OSS project AFAIK.
I call it SITO and it's based on a stream of messagepack objects. If folks are interested in such a thing, I can post what I have currently of the spec, and I'd love to get some feedback on it.
I agree that TAR is problematic, compression is problematic, and there needs to be a mind towards ECC from the get-go.
I could really use some technical discussion to hammer out some of the design decisions that I'm iffy about, and also hone in on what is essential for the spec. For example, how to handle seek-indexing, and whether I should use named fields vs fixed schema, or allow both.
Like zip/7z (with external error recovery), or maybe RAR (with error-recovery records)
Fast compression or decompression is almost entirely meaningless in that context, and compression ratio is also only of secondary importance.
This is why PDF is still considered the best format for long-term archiving of documents, even though there might be things that compress better (djvu/jp2)
Personally I use 7zip for compression and par2[0] for parity.
[0] https://en.wikipedia.org/wiki/Parchive
What’s the story look like around reading files out of squashfs archives stored in an S3 compatible storage system? Can what you mention above be done with byte range requests versus retrieving the entire object?
"DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources."
Went ahead and submitted, I think it deserves its own discussion:
https://news.ycombinator.com/item?id=32216275
Some comment here says that the author of the article disapproves zstd with similar arguments as for xz. Have not verified the claim.
>Xz does not provide any data recovery means
For the common use case of compression a tar archive, this is a critical flaw. One small area of corruption will render the remainder of the archive unreadable.
I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2. The only compressed format left in common use that handles corruption is the hardware compression in LTO tape drives
Because monotonically increasing technological progress is a commonly-believed fairy tale. Nowadays, capabilities are lost as often as they're gained. Either because the people who designed the successor are being opinionated or just focusing on something else.
You can test by intentionally corrupting a .zstd file that was created with checksums enabled and then watch as zstd happily proceeds to decompress it, without any sort of warning. This is the stuff of nightmares.
After all these years, RAR remains the best option for archiving.
It can also compress more than xz with tweaks, though I don't know the compute/memory tradeoffs.
[0] https://archlinux.org/news/now-using-zstandard-instead-of-xz...
Edit to fix typo.
The problem is that xz has kind of horrible performance when you crank it up to the high settings. On the medium settings, you can get the same ratio for much, much less CPU (round-trip) by switching to zstd.
YMMV, use your own corpus and CPU to test it.
I've never tried lzop or met someone who advocated to use it. Needs research, perhaps. Until then I'm healthily skeptical.
For a couple of years now I have switched to zstd. It is both fast and well compressed by default in that use case, no need to remember any options.
No, I have NOT done any deeper analysis except a couple of comparisons which I haven't even documented. But they ended up in slight favor of zstd, so I switched. But nothing dramatic making me say forget lzop.
Edit: NOT
Is there a way to extract the files without mounting the filesystem?
I believe 7z can do that as well.
[0] https://github.com/mhx/dwarfs
I have never really seen it in the Linux world. There are several alternatives installed in most distros, all except zstd discussed in the article.
Regardless, this is a pretty old article and if you search for it you will find a number of discussions that have already happened about it that all have a bunch of comments.
https://news.ycombinator.com/item?id=20103255
https://news.ycombinator.com/item?id=16884832
https://news.ycombinator.com/item?id=12768425
To support resilience, you don’t need an alternative to xz, you need hashes and forward error correction. Specifically, compress your file using xz for high compression ratio, optionally encrypt it, then take a SHA-256 hash to be used for detecting errors, then generate parity files using PAR[1] or zfec[2] to be used for correcting errors.
[1] https://wiki.archlinux.org/title/Parchive
[2] https://github.com/tahoe-lafs/zfec
Just use a file system that does proper integrity checking/resilvering. Also use TLS to transfer data over the network.
Storage with integrity checking would be the solution to bitrot, but TFA also seems concerned with "how do you unarchive/recover a random file you found?" which seems a somewhat valid concern.
And xz does have support for integrity checking, so it seems reasonable to have a discussion on whether that is a good support, rather than on whether it should be there at all.
Archives need not only a way to check their integrity but also error correction, which xz does not have.
However, you can easily combine xz with par2, which does provide error correction.
Self-referentially incorruptible data is the standard to beat here, moving the concern to a different layer doesn't increase the efficiency or integrity of the data itself.
It is arguably less efficient, as you now rely on some lower layer of protection in addition to whatever is built into the standard itself.
It is less flexible - a properly protected archive format could be scrawled on to the side of a hill, or more reasonably onto an archive medium (BD-disk), and should be able to survive any file-system change, upgrade, etc. Self-repairing hard drives with multiple redundancies are nice, but not cheap, and not wide-spread.
It also does nothing for actually protecting the data - I don't care how advanced the lower level storage format is, if you overwrite data e.g. with random zeros (conceivable due a badly behaving program with too much memory access, e.g. a misbehaving virus, or a bad program or kernal driver someone ran with root access, also conceivable due to EM interference or solar radiation causing a program to misbehave), the file system will dutifully overwrite the correct data with incorrect data, including updating whatever relevant checks it has for the file. The only way around this is to maintain a historical archive of all writes ever made, and that should be evidently both absurdly impractical (how do you maintain the integrity of this archive? With a self-referentially incorruptible data archive perhaps?) and expensive.
Compared to a single file, which can be backed up, agnostic to the filesystem/hardware/transport/major world-ending events, which can be simply read/recovered, far into the future. There's a pretty clear winner here.
I respectfully disagree. By putting it in the layer below, there is the ability to do repairs.
For example, consider storing XZ files on a Ceph storage cluster. Ceph supports Reed-Solomon coding. This means that if data corruption occurs, Ceph is capable of automatically repairing the data corruption by recomputing the original file and writing it back to disk once more.
Even if XZ were able to recover from some forms of data corruption, is it realistic that such repairs propagate back to the underlying data store? Likely not.
Say you have a file `m4-1.4.19.tar.xz` in your source archive collection. The version I just downloaded from ftp.gnu.org has the SHA-256 checksum `63aede5c6d33b6d9b13511cd0be2cac046f2e70fd0a07aa9573a04a82783af96`. The GNU site doesn't have checksums for archives, but it does have a PGP signature file, so it's possible to (1) verify the archive signature and (2) store the checksum of the file.
If that file is later corrupted, the corruption will be detected via the SHA-256 checksum. There's no reason to worry about whether the file itself has an embedded CRC-whatever, or the behavior with regards to variable-length integer fields. If a bit gets flipped then the checksum won't match, and that copy of the file will be discarded (= replaced with a fresh copy from replicated storage) before it ever hits the XZ code.
If this is how the lzip format was designed -- optimized for a use case that doesn't exist -- it's no wonder it sees basically no adoption in favor of more pragmatic formats like XZ or Zstd.
"Repairing" corrupt archives, as in to get as much as usable data from that archive is a pretty useful thing and I have done it multiple times. For example, an archive can have hundreds of files inside and if you can recover any of them that's better than nothing. It is also one of the reason I still use WinRAR occasionally due to its great recovery record (RR) feature.
>replaced with a fresh copy from replicated storage
Lots of times you don't have other copy.
If you have only one copy of some important file and you discover it no longer matches the stored checksum, then that's not a question of archival, but of data recovery. There's no plausible mechanism by which a file might suffer a few bitflips and be saved by careful application of a CRC -- bitrot often zeros out entire sectors.
Internet connection is so much better now and almost 100% of downloads completed successfully.
* At the sector level in physical media (tapes, disk drives, flash). The file will be largely intact, but 4- or 8-KiB chunks of it will be zero'd out.
* At the bit level, when copying goes wrong. Usually this is bad RAM, sometimes a bad network device, very occasionally a bad CPU. You'll see patterns like "every 64th byte has had its high bit set to 1".
In both cases, the only practical option is to have multiple copies of the file, plus some checksum to decide whether a copy is "good". Files on disk can be restored from backup, bad copies over the network can be detected by software and re-transmitted.
XZ or not doesn't matter here. Even if you have a completely uncompressed tar file, if you only have one copy of it and lose access to that copy (whether bitrot, or disaster, or software error) then you've lost the data.
Right now, I've personally settled on using the 7z format: https://en.wikipedia.org/wiki/7z
The decompression speeds feel good, the compression ratios also seem better than ZIP and somehow it still feels like a widely supported format, with the 7-Zip program in particular being nice to use: https://en.wikipedia.org/wiki/7-Zip
Of course, various archivers on *nix systems also seem to support it, so so far everything feels good. Though of course having the chance of an archive getting corrupt and no longer being properly able to decompress it and read all of those files, versus just using the filesystem and having something like that perhaps occur to a single file still sometimes bothers me.
Then again, on a certain level, I guess nothing is permanent and at least it's possible to occasionally test the archives for any errors and look into restoring them from backups, should something like that ever occur. Might just have to automate those tests, though.
Yet, for the most part, going with an exceedingly boring option like that seems like a good idea, though the space could definitely use more projects and new algorithms for even better compression ratios, so at the very least it's nice to see attempts to do so!