I've studied the Burrows-Wheeler Transform, I understand the transformation, I've re-implemented it countless times for kicks, I see how it improves compressability, but for the life of me the intuition of _why_ it works has never really clicked.
It's a fantastic bit of algorithmic magic that will always impress me to see it.
The Burroughs-Wheeler transform has been described as a unique algorithm idea in that there are no non-trivial variations or related algorithms, unlike more conventional compression algorithms, which can be tweaked and improved in so many ways. There is no general compression theory in which BWT could be described as a special case.
It looks to me that the above still holds: Bzip2 and Bzip3 are simply combining more conventional compression algorithms with the BWT, which itself is still the same old transform. Bzip2 does Huffman coding after BWT, and Bzip3 does arithmetic coding.
> There is no general compression theory in which BWT could be described as a special case.
I would not really say this is true. BWT is spiritually similar to the various Prediction by Partial Matching (PPM) algorithms, except that instead of needing to decide how much context (i.e., preceding symbols) to use to model the next symbol, and carefully learning the probabilities for each unique context, it naturally sorts symbols with the same context (of _any_ length) so they appear next to each other, and relies on adaptation to update your learned statistics as you move from one context to the next, without ever explicitly tracking context boundaries [0].
It definitely is related to prediction by partial match (PPM).
BWT sorts rotated data and what is achieved is that same suffixes group together:
...
"Bzip2 and Bzip3 are simply combining more"
"Bzip3 are simply combining moreBzip2 and "
The preceding (to suffix) character goes to end and then gets outputted. This is much like PPM going backward. There is a PPM* algorithm (unbounded context length) where authors considered reconstruction of contexts from data, utilizing something like LZSS seach. Same idea is in BWT - context is reconstructed from data.
BWT also breaks near dependencies in data, this is why move-to-front with Huffman or arithmetic encoding works well there.
Thank you for the reference. I learned something new today. That algorithm is wild. If you had shown me the transform and asked if it had an inverse, I would have said of course it doesn't, it's too weird.
I always understood it as working because of the predictability of a symbol/letter/token given the previous one.
Sorting all the shifts of a string puts all the characters in order, then looking at the last column shows you all the _preceding_ characters. If there's any predictability there (which there often is), it's now easier to compress. It's sorta like an entropy coder in that way.
I've never thought of it as being that deep, and understood them since I was a kid -- building an intuition for "why" the FFT works is much harder -- but that being said, I clicked quickly to reply thinking "that's easy! I can explain this!" then struggled for a while trying to get the picture in my mind into text. :)
Understanding why increasing predictability helps with compression is not the hard part though. What's hard to grasp is why the transform is reversible.
I remember the lecturer commenting on what sort of sick and twisted mind could come up with such a ridiculous convoluted notion when I was taught it at university.
Wheeler was also one of the inventors of the "closed subroutine" AKA function, which had to be implemented via a hack as machines of the time did not include ISA support for "return":
I feel exactly the same, and have also implemented it backwards and forwards. I've thought about it in my sleep, trying to recall how it REALLY works. Happens every few years ;-) I always thought it was probably obvious to everyone else what the "magic" is.
Isn't it basically run length encoding but on patterns? Sorting lexicographical on all rotations means blocks of patterns get grouped together, which means you can do compression more easily, right?
Public-key cryptography is magic. Zero-knowledge proofs are a consequence that's difficult to find on your own but easy to understand once you've seen it.
I remember seeing someone (probably Avi Wigderson) demonstrating a zero-knowledge proof for graph coloring on an overhead projector when I was starting my second year studying CS. He had a slide with a graph, multiple slides with different permutations of the same valid coloring of the vertices, and a piece of paper with "doors" over the vertices to cover everything. The audience could ask him to open the doors over the vertices connected by any edge, and he would show that the coloring is valid, as far as those two vertices are concerned. And then he would replace the coloring with another permutation for the next iteration. The idea felt clever but kind of obvious in retrospect.
As you notice my sanity check actually has a slightly different size. Not sure why. The benchmark is a bit underspecified because new perl versions were released in the interim. I used all releases up to perl-5.37.1 to get to the correct number of files. Just treat all numbers to have about 2% uncertainty to account for this difference.
I can't provide compression/decompression times, but the --long or --long=31 arguments should not have major impact on speed, they mostly impact used memory. --long=31 requires setting the same in the decompressor, making that option mostly useful for internal use, not archives meant for public consumption.
As you can see, the benchmark chosen by the author mostly comes down to finding similar data that's far away. I wonder if bzip3 can do this better than other algorithms (especially in less memory) or simply chose default parameters that use more memory.
As you may be aware, different compression tools fill in different data type niches. In particular, less specialised statistical methods (bzip2, bzip3, PPMd) generally perform poorly on vaguely defined binary data due to unnatural distribution of the underlying data that at least in bzip3's case does not lend well to suffix sorting.
Conversely, Lempel-Ziv methods usually perform suboptimally on vaguely defined "textual data" due to the fact that the future stages of compression that involve entropy coding can not make good use of the information encoded by match offsets while maintaining fast decompression performance - it's a long story that I could definitely go into detail about if you'd like, but I want to keep this reply short.
All things considered, data compression is more of an art than science, trying to fit in an acceptable spot on the time to compression ratio curve. I created bzip2 as an improvement to the original algorithm, hoping that we can replace some uses of it with a more modern and worthwhile technology as of 2022. I have included benchmarks against LZMA, zstandard, etc. mostly as a formality; in reality if you were to choose a compression method it'd be very dependent on what exactly you're trying to compress, but my personal stance is that bzip3 would likely be strictly better than bzip2 in all of them.
bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size. what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.
Given that it's BWT, the difference should be the most prominent on codebases with huge amounts of mostly equivalent files. Most compression algorithms won't help if you get an exact duplicate of some block when it's past the compression window (and will be less efficient if near the end of the window).
But here's a practical trick: sort files by extension and then by name before putting them into an archive, and then use any conventional compression. It will very likely put the similar-looking files together, and save you space. Done that in practice, works like a charm.
I've experimented a bit with bzip3, and I think the results in the readme are not representative. I think it's a handmade pick, with an uncommon input and unfair choices of parameters. And it's made with a HDD, which skews the results even more.
For instance, with a 800 MB SQL file, for the same compression time and optimal parameters (within my capacity), bzip3 produced a smaller file (5.7 % compression ration) than zstd (6.1 % with `--long -15`). But the decompression was about 20× slower (with all cores or just one).
I'm not claim my stupid benchmark is better or even right. It's just that my results were very different from bzip3's readme. So I'm suspicious.
A 4x improvement over lzma is an extraordinary claim. I see the author has also given a result after applying lrzip (which removes long-range redundancies in large files), and the difference isn’t so great (but bzip3 still wins). Does the amazing result without lrzip mean bzip3 is somehow managing to exploit some of that long-range redundancy natively?
I’d be astonished if such a 4x result generalized to tarballs that aren’t mostly duplicated files.
Currently running my own benchmarks, my preliminary results are that zstd becomes competitive again once you add the --long option (so `zstd --long -16 all.tar` instead of `zstd -16 all.tar`). Which is an option that not everyone might be aware of, but whose usefulness should be intuitive for this benchmark of >200 very similar files.
I'd argue that's actually the lowlight of the README since that is a very poor choice of benchmark. Combining a multitude of versions of the same software massively favors an algorithm good at dealing with this kind of repetitiveness in a way that will not be seen in typical applications.
The "Corpus benchmarks" further down in the README are IMHO much more practically relevant. The compression ratio of bzip3 is not significantly better, but the runtime seems quite a bit lower than lzma at least.
The author is super cool. They are one of very few people to write any substantial programs in Malbolge, a programming language designed to be cryptographically hard to use (or something like that)
> bzip2 -9 -k all.tar 981.78s user 9.77s system 95% cpu 8M memory 17:16.64 total
> bzip3 -e -b 256 -j 12 all.tar 2713.81s user 16.28s system 634% cpu 18301M memory 7:10.10 total
The difference in memory usage might be worth noting: 8M v 18301M
Probably worth noting that bzip2 also did by far the worst in this. ~7x larger files than the best bzip3. Large memory use is generally required for good compression.
I'm curious how well gzip would have handled this though, as it's generally low memory too, and all the others in that list have FAR more than 8M memory used.
In my view, improving "long range" compression has the biggest potential. There are many, many algorithms and implementations for very short range (huffman, arithmetic, ANS) and short range (LZ, BWT), but not that much research has gone into "long range" yet. There's deduplication, and large-window LZ / BWT.. but not much more yet. What is missing is efficietly (and with little memory) finding similarities on multi-GB data sets. I think sorting by similarity would help there. Or did I miss research in this area?
Small request: write a header or tail block which records the compression efficiency. Bzip2 doesn't. Gzip does. Knowing the uncompressed size can be vital. Yes, there is a risk of lying and making zip bombs.
That only works if the standard actually describes what you're supposed to do with extra data at the end, and everyone agrees.
In practice, there have been antivirus bypasses that made use of AV scanners treating the additional data differently from common extraction software (I believe it was winrar?).
One could argue that a text document with a corrupt file size byte should still be decodeable. One could also argue that the file is completely corrupt and should be treated as such. Knowing that there will be tools that will take the first approach regardless, I'd stick to just decoding the extra data and marking the file as potentially damaged rather than send the user down a data recovery rabbit hole for a file that'll decompress just fine.
It's a fantastic bit of algorithmic magic that will always impress me to see it.
It looks to me that the above still holds: Bzip2 and Bzip3 are simply combining more conventional compression algorithms with the BWT, which itself is still the same old transform. Bzip2 does Huffman coding after BWT, and Bzip3 does arithmetic coding.
I would not really say this is true. BWT is spiritually similar to the various Prediction by Partial Matching (PPM) algorithms, except that instead of needing to decide how much context (i.e., preceding symbols) to use to model the next symbol, and carefully learning the probabilities for each unique context, it naturally sorts symbols with the same context (of _any_ length) so they appear next to each other, and relies on adaptation to update your learned statistics as you move from one context to the next, without ever explicitly tracking context boundaries [0].
[0] N.J. Larsson, "The Context Trees of Block Sorting Compression," 1998. https://ieeexplore.ieee.org/abstract/document/672147
BWT sorts rotated data and what is achieved is that same suffixes group together:
The preceding (to suffix) character goes to end and then gets outputted. This is much like PPM going backward. There is a PPM* algorithm (unbounded context length) where authors considered reconstruction of contexts from data, utilizing something like LZSS seach. Same idea is in BWT - context is reconstructed from data.BWT also breaks near dependencies in data, this is why move-to-front with Huffman or arithmetic encoding works well there.
Sorting all the shifts of a string puts all the characters in order, then looking at the last column shows you all the _preceding_ characters. If there's any predictability there (which there often is), it's now easier to compress. It's sorta like an entropy coder in that way.
I've never thought of it as being that deep, and understood them since I was a kid -- building an intuition for "why" the FFT works is much harder -- but that being said, I clicked quickly to reply thinking "that's easy! I can explain this!" then struggled for a while trying to get the picture in my mind into text. :)
What I don't get isn't the benefits of BWT on its own. It's why BWT should add any additional benefit if you're already doing Huffman.
https://en.m.wikipedia.org/wiki/Wheeler_Jump
In terms of why it aids compression, or why it's reversible?
As far as the first goes: it transforms n-grams into repeated characters.
Deleted Comment
I remember seeing someone (probably Avi Wigderson) demonstrating a zero-knowledge proof for graph coloring on an overhead projector when I was starting my second year studying CS. He had a slide with a graph, multiple slides with different permutations of the same valid coloring of the vertices, and a piece of paper with "doors" over the vertices to cover everything. The audience could ask him to open the doors over the vertices connected by any edge, and he would show that the coloring is valid, as far as those two vertices are concerned. And then he would replace the coloring with another permutation for the next iteration. The idea felt clever but kind of obvious in retrospect.
(edited to add 12 thread runs of bzip3 and remove superfluous filenames)
I can't provide compression/decompression times, but the --long or --long=31 arguments should not have major impact on speed, they mostly impact used memory. --long=31 requires setting the same in the decompressor, making that option mostly useful for internal use, not archives meant for public consumption.
As you can see, the benchmark chosen by the author mostly comes down to finding similar data that's far away. I wonder if bzip3 can do this better than other algorithms (especially in less memory) or simply chose default parameters that use more memory.
Edit: added more benchmarks
Thank you for your benchmark!
As you may be aware, different compression tools fill in different data type niches. In particular, less specialised statistical methods (bzip2, bzip3, PPMd) generally perform poorly on vaguely defined binary data due to unnatural distribution of the underlying data that at least in bzip3's case does not lend well to suffix sorting.
Conversely, Lempel-Ziv methods usually perform suboptimally on vaguely defined "textual data" due to the fact that the future stages of compression that involve entropy coding can not make good use of the information encoded by match offsets while maintaining fast decompression performance - it's a long story that I could definitely go into detail about if you'd like, but I want to keep this reply short.
All things considered, data compression is more of an art than science, trying to fit in an acceptable spot on the time to compression ratio curve. I created bzip2 as an improvement to the original algorithm, hoping that we can replace some uses of it with a more modern and worthwhile technology as of 2022. I have included benchmarks against LZMA, zstandard, etc. mostly as a formality; in reality if you were to choose a compression method it'd be very dependent on what exactly you're trying to compress, but my personal stance is that bzip3 would likely be strictly better than bzip2 in all of them.
bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size. what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.
Deleted Comment
But here's a practical trick: sort files by extension and then by name before putting them into an archive, and then use any conventional compression. It will very likely put the similar-looking files together, and save you space. Done that in practice, works like a charm.
Using the results from the readme, seems like bzip3 performs competitively with zstd on both counts.
For instance, with a 800 MB SQL file, for the same compression time and optimal parameters (within my capacity), bzip3 produced a smaller file (5.7 % compression ration) than zstd (6.1 % with `--long -15`). But the decompression was about 20× slower (with all cores or just one).
I'm not claim my stupid benchmark is better or even right. It's just that my results were very different from bzip3's readme. So I'm suspicious.
I’d be astonished if such a 4x result generalized to tarballs that aren’t mostly duplicated files.
The "Corpus benchmarks" further down in the README are IMHO much more practically relevant. The compression ratio of bzip3 is not significantly better, but the runtime seems quite a bit lower than lzma at least.
What makes Perl source benchmark special? Deduplication?
Here's the program as talked about on HN:
https://news.ycombinator.com/item?id=38850961
https://github.com/kspalaiologos/malbolge-lisp
The cursed language: https://en.wikipedia.org/wiki/Malbolge
And the variant used by the program: https://esolangs.org/wiki/Malbolge_Unshackled
And then I remember discovering, several years later, that bzip (the first one) is an obsolete format that is now difficult to even decompress.
I learned my lesson and now use horribly sub-optimal formats that I'm sure will stick around for a long time, if not forever.
New frameworks, new languages, new cars, new heating systems.
If you want to have a stable solution, use "boring" things. They are boring for a reason; not much stuff is changing in them anymore. They are stable.
I use boring stuff all the time.
I'm curious how well gzip would have handled this though, as it's generally low memory too, and all the others in that list have FAR more than 8M memory used.
In my view, improving "long range" compression has the biggest potential. There are many, many algorithms and implementations for very short range (huffman, arithmetic, ANS) and short range (LZ, BWT), but not that much research has gone into "long range" yet. There's deduplication, and large-window LZ / BWT.. but not much more yet. What is missing is efficietly (and with little memory) finding similarities on multi-GB data sets. I think sorting by similarity would help there. Or did I miss research in this area?
In practice, there have been antivirus bypasses that made use of AV scanners treating the additional data differently from common extraction software (I believe it was winrar?).
One could argue that a text document with a corrupt file size byte should still be decodeable. One could also argue that the file is completely corrupt and should be treated as such. Knowing that there will be tools that will take the first approach regardless, I'd stick to just decoding the extra data and marking the file as potentially damaged rather than send the user down a data recovery rabbit hole for a file that'll decompress just fine.
I agree w other commenters, that Lzip would be a good baseline to compare it to, besides bzip2.
Nice job though on testing it on so many CPU architectures!