Xz format considered inadequate for long-term archiving (2016)

We're talking about long-term archiving here. That means centuries.

My brother the archaelogical archivist of ancient (~2000 years BCE) mesopotamian artifacts has a lot to say about archival formats. His raw material is mostly fired clay tablets. Those archives keep working, partially, even if broken. That's good, because many of them are in fact broken when found.

But their ordering and other metadata about where they were found is written in archaeologists' notebooks, and many of those notebooks are now over a century old. Paper deteriorates. If a lost flake of paper from a page in the notebook rendered the whole notebook useless, that would be a disastrous outcome for that archive.

A decade ago I suggested digitizing the notebooks and storing the bits on CD-ROMs. He laughed, saying "we don't know enough about the long-term viability of CD-ROMs and their readers."

Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.

My point: planning for centuries long archiving is difficult. Formats with redundancy, at least with forward error correction codes, are very helpful. Formats that can be rendered useless by a few bit-flips, not so much.

tengwar2 · 3 years ago

I would be equally concerned about the stability of the file formats for the data stored inside the archives. Even plain ASCII text files have not been around very long - about 60 years since standardisation, but it took a while for the standard to become largely universal. And ASCII is pretty restricted in what it can represent. Note that I'm talking about plain text files, not things like Markdown which might use ASCII.

Most more complex file formats suffer from variant formats. Some, like Markdown and RTF just have multiple versions. Some like TIFF and PDF are envelope formats, so the possible contents of the envelope change over time, introducing incompatibility. Then there is bit-rot as formats go out of use, e.g. .DOC (as opposed to .DOCX).

My own objectives are simple compared to your brother's. I want to preserve simple formatted text files until about 40 year from now, in a way that is likely to allow cut and paste. I started accumulating them about 20 years back. Note that this is before Markdown (which is in any case poor for recording formatting). LaTeX was around and seemed ok in terms of expected lifetime, but is poor for cut and paste because the rendering of a chunk of text depends on instructions which are not local to it. I settled for RTF, which this carries significant long term risk for both compatibility and availability, but is documented well enough that migrating out may be possible.

That's just formatted text. Images have been worse, particularly if you are handling meta-data such as camera characteristics, satellite orientation, etc.

rowanG077 · 3 years ago

I very much doubt the bit representation matters very much as long as it simple. Even if ASCII text viewers are lost they would be extremely simple to implement. It is a counterpoint to things like latex. That would be hard to recreate.

OmarAssadi · 3 years ago

Not to sound like the usual evangelist, and I am sure everyone has already thought of these points, but regarding RAID-10, is that the best call?

My thought process is that the majority of modern hardware/software RAID solutions don’t do error correction properly. And even assuming something that does actually do checksumming and such semi-properly is employed, I think if we are talking tens or hundreds of years from now, it’ll be nearly impossible to find a compatible hardware card should the ones in use die.

I’m aware it already sucks trying to build a project six months in the future, let alone six or sixty years, but perhaps something purely software-based that does care a ton about integrity, like ZFS, would be the best bet in terms of long-term compatibility of a hard-drive storage solution.

So long as the drives can still be plugged into a system, any system, and even if OpenZFS eventually drops backwards compatibility with the version used to create the pool, it’ll likely still be possible to virtualize whatever version of Linux/BSD/Illumos is compatible with that particular version of ZFS and then import the pool.

slaymaker1907 · 3 years ago

I think QR codes are actually great for this (so long as you are storing basic data like plain text that is likely to be recoverable for a long time). It has built in error correction and software for reading them is extremely widespread.

However, more than just error correction, we should really try to make formats that are resilient to data corruption. For example, zip files seem much more resilient than gzipped tarballs because each file is compressed separately.

Digital copies are really not that durable, we just sometimes confuse ease of copying with durability. This sometimes helps, but only if you can have distributed copies.

tablespoon · 3 years ago

> Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.

I actually have a personal digitization project for some stuff I've inherited, and it's glad to get a little validation for my strategy.

Basically my plan is to scan the documents/photos, create some kind of printed book with the most important/interesting ones and an index, and have a M-DISC with all the scans in the back.

jeofken · 3 years ago

What about archiving them in torrent format - in this way, as long as there is one nerd who values history out there, there will be a copy

zxspectrum1982 · 3 years ago

Bittorrent is not archival, it's distribution.

dogben · 3 years ago

The hardest part is digitizing those notes. Once digitized, the best approach is to make copies every few years. Copying of digital data is easy and lossless, and the cost of digital storage is constantly going down.

Correct, xz is no longer particularly useful, mostly annoying.

For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient. The cool thing about squashfs is you can efficiently list the contents and extract single files. This is why it's an ultimate best option for long-term filesystem-esque archives.

https://en.m.wikipedia.org/wiki/SquashFS

For everything else, or if you want (very) fast compression/decompression and/or general high-ratio compression, use .zstd.

https://github.com/facebook/zstd

I've used this combination professionally to great effect. You're welcome :)

mananaysiempre · 3 years ago

I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file, in case you don’t value your sanity). People have made a number of those on top of tar, but they are all limited in various ways by the creators’ requirements, and hardly ubiquitous; and, well, tar is nuts—both because of how difficult and dubiously compatible it is to store some aspects of filesystem metadata in it, and because of how impossible it is to not store others except by convention. Not to mention the useless block structure compared to e.g. cpio or even ar. (Seek indexes for gzip and zstd are also a solved problem in that annoying essentially-but-not-in-practice way, but at least the formats themselves are generally sane.)

Incidentally, the author of the head article also disapproves of the format (not compression) design of zstd[1] on the same corruption-resistance grounds (although they e.g. prohibit concatenability), even though to me the format[2] seems much less crufty than xz.

[1] https://lists.gnu.org/archive/html/lzip-bug/2016-10/msg00002...

[2] https://www.rfc-editor.org/rfc/rfc8878.html

metadat · 3 years ago

> I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file.

I wholeheartedly agree! That use-case is not currently covered by any widely known OSS project AFAIK.

kortex · 3 years ago

I'm actually working on just such a thing! It's definitely a low-priority side project at this point, but I think the general technique has legs. It was born out of a desire for easier to use streaming container formats that can save arbitrary data streams.

I call it SITO and it's based on a stream of messagepack objects. If folks are interested in such a thing, I can post what I have currently of the spec, and I'd love to get some feedback on it.

I agree that TAR is problematic, compression is problematic, and there needs to be a mind towards ECC from the get-go.

I could really use some technical discussion to hammer out some of the design decisions that I'm iffy about, and also hone in on what is essential for the spec. For example, how to handle seek-indexing, and whether I should use named fields vs fixed schema, or allow both.

dark-star · 3 years ago

This is about long-term archiving though. For that you want a wide-spread format that is well-documented and has many independent implementations.

Like zip/7z (with external error recovery), or maybe RAR (with error-recovery records)

Fast compression or decompression is almost entirely meaningless in that context, and compression ratio is also only of secondary importance.

This is why PDF is still considered the best format for long-term archiving of documents, even though there might be things that compress better (djvu/jp2)

rookderby · 3 years ago

I have came to the same conclusion and was surprised to find that RAR is the only popular archive format that included parity.

Personally I use 7zip for compression and par2[0] for parity.

[0] https://en.wikipedia.org/wiki/Parchive

toomuchtodo · 3 years ago

> The cool thing about squashfs is you can efficiently list the contents and extract single files.

What’s the story look like around reading files out of squashfs archives stored in an S3 compatible storage system? Can what you mention above be done with byte range requests versus retrieving the entire object?

exikyut · 3 years ago

https://dr-emann.github.io/squashfs/squashfs.html suggests this may be possible with a small handful of HTTP requests (less than 10, likely 5 or 6).

coldblues · 3 years ago

https://github.com/mhx/dwarfs

"DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources."

jbotz · 3 years ago

DwarFS may be good, but it's not in the Linux kernel (depends on FUSE). That makes it less universal, potentially significantly slower for some uses cases, and also less thoroughly tested. SquashFS is used by a lot of embedded Linux distros among other use cases, so we can have pretty high confidence in its correctness.

metadat · 3 years ago

Thank you coldblues, I'd not heard of dwarfs!

Went ahead and submitted, I think it deserves its own discussion:

https://news.ycombinator.com/item?id=32216275

grumpyprole · 3 years ago

Are you recommending zstd based on compression speed and ratio only? Because as the linked article explains, those are not the only criteria. How does zstd rate with everything else?

usr1106 · 3 years ago

Yes, I felt that's the biggest deficiency of the article , not covering zstd.

Some comment here says that the author of the article disapproves zstd with similar arguments as for xz. Have not verified the claim.

rwmj · 3 years ago

zstd still doesn't have a seekable format as part of the official standard (I wish it did): https://github.com/facebook/zstd/issues/395#issuecomment-535...

kiririn · 3 years ago

Zstd is still just as bad when it comes to the most important point:

>Xz does not provide any data recovery means

For the common use case of compression a tar archive, this is a critical flaw. One small area of corruption will render the remainder of the archive unreadable.

I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2. The only compressed format left in common use that handles corruption is the hardware compression in LTO tape drives

tablespoon · 3 years ago

> I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2.

Because monotonically increasing technological progress is a commonly-believed fairy tale. Nowadays, capabilities are lost as often as they're gained. Either because the people who designed the successor are being opinionated or just focusing on something else.

armitron · 3 years ago

Zstd is terrible for archiving since it doesn't even detect corruption. The --check switch described in the manpage as enabling checksums (in a super-confusing way) seems to do absolutely nothing.

You can test by intentionally corrupting a .zstd file that was created with checksums enabled and then watch as zstd happily proceeds to decompress it, without any sort of warning. This is the stuff of nightmares.

After all these years, RAR remains the best option for archiving.

ars · 3 years ago

zstd hardly replaces xz, the compression ratio is quite worse. zstd seem more of a replacement for gz.

basilgohar · 3 years ago

Most tests I've seen, such as [0] don't support your statement. Zstd can compress almost as well as xz but decompresses much faster.

It can also compress more than xz with tweaks, though I don't know the compute/memory tradeoffs.

[0] https://archlinux.org/news/now-using-zstandard-instead-of-xz...

Edit to fix typo.

klodolph · 3 years ago

Sure, if you look at the Pareto frontier for xz and zstd, zstd does not seem like a “replacement” for xz. It’s not a replacement for PPMd.

The problem is that xz has kind of horrible performance when you crank it up to the high settings. On the medium settings, you can get the same ratio for much, much less CPU (round-trip) by switching to zstd.

YMMV, use your own corpus and CPU to test it.

usr1106 · 3 years ago

Depends on your use case. Saving the last couple of Bytes is often less important than fast compression.

metadat · 3 years ago

Do you have any data to support this claim? In my experience, zstd is way better in every way compared to gzip. Additionally, xz is good compression but horribly crazy slow to decompress. Xz also only operates on one single file at a time, which is annoying.

svnpenn · 3 years ago

Yes, because ratio is the only thing that matters. Decompression speed doesn't matter at all. Who cares if Zstd is 9 times faster?

ranger_danger · 3 years ago

In my experience the normal squashfs kernel driver is quite slow at listing/traversing very large archives (several GB+). For some reason squashfuse is MUCH faster for just looking around inside.

exyi · 3 years ago

Does SquashFS support cross-file compression? - i.e. how well does it compress a folder with a number of similar files?

lathiat · 3 years ago

I recently came to the same determination looking for a better way to package sosreports (diagnostic files from linux machines). The pieces are there for indexed file lists and also seekable compression but basically nothing else implements them in a combined fassion with a modern compression format (mainly zstd).

adastra22 · 3 years ago

I use lzop, as it has faster compression/decompression. Is there a specific reason to prefer zstd?

metadat · 3 years ago

Zstd can achieve good compression ratio for many data patterns and is super fast to create, and decompresses at multiple GB/s with a halfway decent CPU, often hitting disk bandwidth limits.

I've never tried lzop or met someone who advocated to use it. Needs research, perhaps. Until then I'm healthily skeptical.

mananaysiempre · 3 years ago

When does it make sense to use lzop (or the similar but more widely recommended LZ4) for static storage? My impression was that it was a compressor for when you want to put less bytes onto the transmission / storage medium at negligible CPU cost (in both directions) because your performance is limited by that medium (fast network RPC, blobs in databases), not because you want to take up as little space as possible on your backup drive. And it does indeed lose badly in compression ratio even to zlib on default settings (gzip -6), let alone Zstandard or LZMA.

usr1106 · 3 years ago

I used lzop in the past when speed was more of a concern than compressed size (big disk images etc.)

For a couple of years now I have switched to zstd. It is both fast and well compressed by default in that use case, no need to remember any options.

No, I have NOT done any deeper analysis except a couple of comparisons which I haven't even documented. But they ended up in slight favor of zstd, so I switched. But nothing dramatic making me say forget lzop.

Edit: NOT

slavik81 · 3 years ago

> For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient

Is there a way to extract the files without mounting the filesystem?

Nekit1234007 · 3 years ago

Yes! https://manpages.debian.org/testing/squashfs-tools/unsquashf...

I believe 7z can do that as well.

pdimitar · 3 years ago

I thought DwarFS[0] is better than SquashFS?

[0] https://github.com/mhx/dwarfs

yokoprime · 3 years ago

How good is zstd for long term archival vs e.g. 7z? The latter appears (I have no data to back it up) to be vastly more popular at the moment.

usr1106 · 3 years ago

7z was popular in the Windows world when I still used that more than 10 years ago because Windows contained nothing reasonable.

I have never really seen it in the Linux world. There are several alternatives installed in most distros, all except zstd discussed in the article.