The most magical part of this transform is the search! First learned about this in a bioalgorithms course, and the really cool property is that for a string length l, you can search for the string in O(l) time. It has the same search time complexity of a suffix tree with O(n) space complexity (with a very low constant multiple). To this day it may be the coolest algorithm I've encountered.
I encountered the search version of this, which is turned suffix arrays, in grad school and was so taken by them I incorporated them as the primary search mechanism for the pi searcher. 25 years later it's still the best way to do it. Incredible insight behind bwt and suffix arrays.
I remember randomly reading about this in the Hugi demoscene diskmag as a young teen, and it completely blew my mind (notably the reverse transform, apparently not covered in OP's article): https://hugi.scene.org/online/coding/hugi%2013%20-%20cobwt.h...
The author later wrote many now-famous articles, including A Trip Through the Graphics Pipeline.
Oh wow, I clicked that link without reading the last section of your comment, and was like “Fabian Giesen!”, an absolute HN favourite: https://hn.algolia.com/?q=fgiesen
Noteworthy that the paper that was published describing this technique came at a time when IP lawyers had begun to destroy the world wrt to small business innovation. And that they released it unencumbered is a huge debt of gratitude that we haven’t really honored.
An interesting bit of trivia about the Burrows-Wheeler transform is that it was rejected when submitted to a conference for publication, and so citations just point to a DEC technical report, rather than a journal or conference article.
I only know the story from word-of-mouth. My understanding is that the submitted paper wasn't written/presented well, and reviewers had trouble understanding the significance of what was being proposed. But take that story with a big grain of salt.
For some reason I was really getting lost on step 2, “sort” it. They mean lexicographically or “sort it like each rotation is a word in the dictionary, where $ has the lowest character value”. Now that I see it, I’m a little embarrassed I got stuck on that step, but hopefully it helps someone else.
This is a good article and a fun algorithm. Google made a great series of compression videos a while ago - albeit sometimes silly. For the Burrows-Wheeler one they interviewed Mike Burrows for some of the history. https://youtu.be/4WRANhDiSHM
Most BWT descriptions: now draw the rest of the owl.
Once upon a time I found one that did the entire round trip, I neglected to bookmark it. So when I eventually forgot how it worked, or wanted to show others, I was stuck. Never did find it.
This always seemed magical to me too, but Gemini recently gave me an explanation that clicked.
The seemingly magical part is that it seems like you "lose" information during the transformation. After all if I give you the sorted string, it's not recoverable. But the key is that the the first and last columns of the BWT matrix end up being the lookup table, from any character to the preceding character. And of course given the BWT encoded string, you can get back the sorted string which is the first column. And with this info of which character precedes which character, it shouldn't be too magical that you can work backwards to reconstruct the original string.
Still, the really clever part is that the BWT encoded string embeds the decoding key directly into the character and positional information. And it just so happens that the way it does this captures character pair dependencies in a way that makes it a good preprocessor for "symbol at a time" encoders.
>It sorts letters in a text by their [surrounding] context, so that letters with similar context appear close to each other. In a strange vaguely DFT-like way, it switches long-distance and short-distance patterns around. The result is, in a typical text, long runs of the same letter, which can be easily compressed.
Yes, agreed. Presumably if you just compressed the sorted string it would compress even better, though not reversibly. So compressing the preceding column (preceding since rows are rotations) seems the next best thing
I think what did it for me is to recognize that the last column has characters that, for each row, come before the characters in the first column (since they are rotations). So we can view the last column as coming "before" the first one.
1. We sorted the first column to get the BWT column. Thereby we created the structure where the BWT column comes before the sorted column.
2. Therefore if we insert the BWT column before a sorted column, the order of row elements is preserved
3. If we now sort again, the order of characters across individual rows is again preserved
4. Going to step 2 again preserves row order
5. Once all columns are populated, therefore all rows are in the correct original order. And thanks to the end marker we can get the original string from any row.
The algorithm given in this page for the reverse transform is not usably efficient; the algorithm given in the original paper is, and I illustrated it in my explorable explanation of it linked in my top-level comment here. It was a pretty surprising insight!
The author later wrote many now-famous articles, including A Trip Through the Graphics Pipeline.
By chance I once sat near him and the rest of Farbrausch at a demoparty, but I was too shy to say hi.
Once upon a time I found one that did the entire round trip, I neglected to bookmark it. So when I eventually forgot how it worked, or wanted to show others, I was stuck. Never did find it.
The seemingly magical part is that it seems like you "lose" information during the transformation. After all if I give you the sorted string, it's not recoverable. But the key is that the the first and last columns of the BWT matrix end up being the lookup table, from any character to the preceding character. And of course given the BWT encoded string, you can get back the sorted string which is the first column. And with this info of which character precedes which character, it shouldn't be too magical that you can work backwards to reconstruct the original string.
Still, the really clever part is that the BWT encoded string embeds the decoding key directly into the character and positional information. And it just so happens that the way it does this captures character pair dependencies in a way that makes it a good preprocessor for "symbol at a time" encoders.
https://news.ycombinator.com/item?id=35738839 has a nice pithy way of phrasing it
>It sorts letters in a text by their [surrounding] context, so that letters with similar context appear close to each other. In a strange vaguely DFT-like way, it switches long-distance and short-distance patterns around. The result is, in a typical text, long runs of the same letter, which can be easily compressed.
1. Put the BWT string in the right-most empty column
2. Sort the rows of the matrix such that the strings read along the columns of the matrix are in lexicographical order starting from the top-row????
3. Repeat step 1 and 2 until matrix is full
4. Extract the row of the matrix that has the end-delimiter in the final column
It's the "sort matrix" step that seems under-explained to me.
1. We sorted the first column to get the BWT column. Thereby we created the structure where the BWT column comes before the sorted column.
2. Therefore if we insert the BWT column before a sorted column, the order of row elements is preserved
3. If we now sort again, the order of characters across individual rows is again preserved
4. Going to step 2 again preserves row order
5. Once all columns are populated, therefore all rows are in the correct original order. And thanks to the end marker we can get the original string from any row.