One reason that uuencode lost out to Base64 was that uuencode used spaces in its encoding. It was fairly common for Internet protocols in those days to mess with whitespace, so it was often necessary to patch up corrupted uuencode files by hand.
Base64, on the other hand, was carefully designed to survive everything from whitespace corruption to being passed through non-ASCII character sets. And then it became widely used as part of MIME.
and yet, Internet protocols (http, at least) don't play well with equal signs which are part of base64, sometimes. That little issue has caused lots of intermittent bugs for me over the years, either from forgetting to urlencode it or not urldecoding it at the right time.
And slashes as well, which is a magic character in both urls and file systems. Means you can't reliably use normal base64 for filenames, for instance. That might seem like a niche use-case, but it's really not, because you can use it for content-based addressing. Git does this, names all the blobs in the .git folder after their hash, but you can't encode the hash with regular base64.
Having lived through the transition, I can say personally it comes down to "packaging" -if MIME had adopted UUENCODE format, I probably would have used it but as materials emerged to me which depended on base64 decode, it became compelling to use it. Once it was ubiquitously available in e.g ssl, it became trivial to decode a base64 encoded thing, no matter what. Not all systems had a functioning uudecode all the time. DOS for instance, you had to find one. If you're given base64 content, you install a base64 encode/decode package and then its what you have.
There was also an extended period of time where people did uux much as they did shar: both of which are inviting somebody else's hands into your execution state and filestore.
We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall, but just as we discussed lempel-zif and gzip tuning on usenet news, we discussed uuencode/base64 and other text wrapping.
Ned Freed, Nathaniel Borenstein, Patrik Falstrom and Robert Elz amongst others come to mind as people who worked on the baseXX encoding and discussed this on the lists at the time. Other alphabets were discussed.
uu* was the product of Mike Lesk a decade before, who was a lot quieter on the lists: He'd moved into different circles, was doing other things and not really that interested in the chatter around line encoding issues.
> Some of the characters used by uuencode cannot be represented in some of the mail systems used to carry rfc 822 (and therefore MIME) mail messages. Using uuencode in these environments causes corruption of encoded data. The working group that developed MIME felt that reliability of the encoding scheme was more important that compatibility with uuencode.
In a followup (same link):
> "The only character translation problem I have encountered is that the back-quote (`) does not make it through all mailers and becomes a space ( )."
> The back-quote problem is only one of many. Several of the characters used by uuencode are not present in (for example) the EBCDIC character set. So a message transmitted over BITNET could get mangled -- especially for traffic between two different countries where they use different versions of EBCDIC, and therefore different translate tables between EBCDIC and ASCII. There are other character sets used by 822-based mail systems that impose similar restrictions, but EBCDIC is the most obvious one.
> We didn't use uuencode because several members of our working group had experience with cases where uuencoded files were garbaged in transit. It works fine for some people, but not for "everybody" (or even "nearly everybody").
> The "no standards for uuencode" wasn't really a problem. If we had wanted to use uuencode, we would have documented the format in the MIME RFC.
That last comment was from Keith Moore, "the author and co-author of several IETF RFCs related to the MIME and SMTP protocols for electronic mail, among others" says https://en.wikipedia.org/wiki/Keith_Moore .
After a given point usenet was nearly 8-bit clean, and thus https://en.wikipedia.org/wiki/YEnc was also developed to convolve all the octets (I + 42 (decimal)) and escape the results that happened to still match reserved characters (CR, LF, 0x0, = (yEnc escape)) - it seems that if the result character was among that set, then = was output and new output determined by O = (I+64) % 256 instead.
> We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall
uuencode has file headers/footers, like MIME. But the actual content encoding is basically base64 with a different alphabet; both add precisely 1/3 overhead (plus up to 2 padding bytes at the end).
uuencode has some additional overhead, namely 2 additional bytes per line, that means it varies from 60-70%, the latter being best case, while base64 is 75% efficient in all cases.
On a related note, I'm getting flashbacks to being on the web in the late-1990s, back when "Downloads!" was a reason to visit a particular website; and noticing that Windows users like myself could just download-and-run an .exe file, while
the same downloads for Mactintosh users would be a BinHex file that'd also be much larger than the Windows equivalent - and this wasn't over FTP or Telnet, but an in-browser HTTP download, just like today.
Can anyone explain why BinHex remained "popular" in online Mac communities through to the early 2000s? Why couldn't Macs download "real" binary files back then?
Classic Macintosh files were basically 2 separate files with the same name (data fork and resource fork). Additionally, there was important meta data (Finder Info, most importantly the file type, creator type). Since other file systems couldn't handle forks or finder info, it had to be encapsulated in some other format like binhex, macbinary, applesingle, or stuffit. The other 3 were binary so they would have been smaller. Why not them... shrug
I wasn't a macintosh user back in the day but for the file archives I frequented (apple II), sometimes files were in BINSCII which was a similar text encoding. The advantage being that they could be emailed inline, posted to usenet, didn't require an 8-bit connection (important back in the 80s), and could be transferred by screen scraping if there wasn't a better alternative.
What I saw most of the time was a file that was compressed with StuffIt, and then encoded with BinHex. They were usually about inverse in terms of efficiency, so what you saved with StuffIt you would then turn around and lose with BinHex. But the resulting file was roughly the same size as the original file set.
The most common Macintosh archive format was (eventually) StuffIt, but StuffIt Expander couldn't open a .sit file which was missing its resource fork, and when you downloaded a file from the internet, it only came with a data fork.
So a common hack was to binhex the .sit file. Binhex was originally designed to make files 7-bit clean, but had the side effect that it bundled the resource fork and the data fork together.
Later versions of StuffIt could open .sit files which lacked the resource fork just fine, but by then .zip was starting to become more common.
I could be remembering wrong, but didn't later versions of stuffit compress to a .sit file that had no resource fork, so it would stay fully-intact on any filesystem? I may be imagining that, but I remember hitting a certain version where "copying to Windows" would no longer ruin my .sit files... haha
Funny because today I find the install process for Mac much simpler. Most installs are "drag this .app file to your Applications folder", meanwhile on Windows you download an installer that downloads another installer that does who-knows-what to your system and leaves ambiguously-named files and registry modifications all over the place.
There are plenty of portable windows applications (distributed as a zipped directory) and there are plenty of pkg macOS installers.
I don't really understand why macOS users like this "simple" installation, because when you "uninstall" the app, it leaves all the trash in your system without a chance to clean up. And implying that macOS application somehow will not do "who-knows-what" to your system is just wrong. Docker Desktop is "simple", yet the first thing it does after launch is installing "who-knows-what".
If the installer on Windows is properly done, you actually know exactly what it does to your system (including registry modifications). This includes the ability to remove the application completely.
Whereas on macOS, installation is trivial, but then the application sets up stuff upon first run and that is really intransparent then, with no way of properly uninstalling the app unless there is a dedicated uninstaller.
The one annoying thing macOS apps do is pollute /Library. Even apps that don’t explicitly write to this area end up with dozens of permafiles. Tons of stuff is spewed in there when you install an application that actually uses it. It’s like a directory version of a registry kitchen sink.
A thing I wonder: why is using = padding required in the most common base64 variant?
It's redundant since this info can be fully inferred from the length of the stream.
Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream (and = does not always appear so is not a separator).
There's no way that using the = instead of per-byte length-checking gains any speed, since to prevent reading out of bounds you must check the per byte length anyway, you can't trust input to be a multiple of 4 length.
It could only make sense if it's somehow required to read 4 bytes at once, and you can't possibly read less, but what platform is such?
The padding character is not essential for decoding, since the number of missing bytes can be inferred from the length of the encoded text. In some implementations, the padding character is mandatory, while for others it is not used. An exception in which padding characters are required is when multiple Base64 encoded files have been concatenated.
I think so too. It feels similar to how many specifications from the 90s use big endian 4-byte integers for many things (like png, riff, jpeg, ...) despite little endian CPU's being most common since the 80s already, and those specifications seemingly assuming that you would want to decode those 4-byte values with fread without any bounds checking or endianness dependency.
Without padding, how would you encode, for example, a message with just a single zero? To be more precise, how do you distinguish it from two zeroes and three zeroes?
Both for encoding and decoding the padding is not needed. Without ='s, you get a uniquely different base64 encoding for NULL, 2 NULLs and 3 NULLs.
This shows the binary, base64 without padding and base64 with padding:
NULL --> AA --> AA==
NULL NULL --> AAA --> AAA=
NULL NULL NULL --> AAAA --> AAAA
As you can see, all the padding does is make the base64 length a multiple of 4. You already get uniquely distinguishable symbols for the 3 cases (one, two or three NULL symbols) without the ='s, so they are unnecessary
The output padding is only relevant for decoding. For encoding, since the alphabet of Base64 is 6 bits wide, the padding is 0 when the input is not a multiple of 6 (e.g. encoding two bytes (16 bits) needs two more bits to become a multiple of 6 (18))
Refer to the "examples" section of the wikipedia page
The = does not appear if the base64 data is a multiple of 4 length. So you wouldn't know if aGVsbG8I is one or two streams. The = is not a separator, only padding to make the base64 stream a multiple of 4 length for some reason.
I only mentioned the concatenation because Wikipedia claims this use case requires padding while in reality it doesn't.
There is still one sort-of efficient way of embedding binary content in an HTML file. You must save the file as UTF-16. A Javscript string from a UTF-16 HTML file can contain anything except these: \0, \r, \n, \\, ", and unmatched surrogate pairs 0xD800-0xDFFF.
If you escape any disallowed character in the usual way for a string ("\0", "\r", "\n", "\\", "\"", "\uD800") then there is no decoding process, all the data in the string will be correct.
If you throw data that is compressed in there, you're unlikely to get very many zeroes, so you can just hope that there aren't too many unmatched surrogate pairs in your binary data, because those get inflated to 6 times their size.
Note that this operates on 16-bit values. In order to see a null, \r, \n, \\ and ", the most significant byte must also be zero, and in order for your data to contain a surrogate pair, you're looking at the two bytes taken together. When the data is compressed, the patterns are less likely.
Not listed was a clever encoding for MS-DOS files, XXBUG[1]. DOS had a rudimentary debugger and memory editor. (It even stuck around all the way to Windows XP but didn't survive the transition to 64-bit.) Because it had the ability to write to disk you could convert any file to hexadecimal bytes and sprinkle some control commands about to create a script for DEBUG.EXE. The text-encoded file could then be sent anywhere without needing to download a decoder program first.
Base64, on the other hand, was carefully designed to survive everything from whitespace corruption to being passed through non-ASCII character sets. And then it became widely used as part of MIME.
Still more robust than uuencode though.
.-_ would have been a better choice tha +/=
There was also an extended period of time where people did uux much as they did shar: both of which are inviting somebody else's hands into your execution state and filestore.
We were also obsessed with efficiency. base64 was "sold" as denser encoding. I can't say if it was true overall, but just as we discussed lempel-zif and gzip tuning on usenet news, we discussed uuencode/base64 and other text wrapping.
Ned Freed, Nathaniel Borenstein, Patrik Falstrom and Robert Elz amongst others come to mind as people who worked on the baseXX encoding and discussed this on the lists at the time. Other alphabets were discussed.
uu* was the product of Mike Lesk a decade before, who was a lot quieter on the lists: He'd moved into different circles, was doing other things and not really that interested in the chatter around line encoding issues.
1) https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi...
> Some of the characters used by uuencode cannot be represented in some of the mail systems used to carry rfc 822 (and therefore MIME) mail messages. Using uuencode in these environments causes corruption of encoded data. The working group that developed MIME felt that reliability of the encoding scheme was more important that compatibility with uuencode.
In a followup (same link):
> "The only character translation problem I have encountered is that the back-quote (`) does not make it through all mailers and becomes a space ( )."
A followup from that at https://www.usenetarchives.com/view.php?id=comp.mail.mime&mi... says:
> The back-quote problem is only one of many. Several of the characters used by uuencode are not present in (for example) the EBCDIC character set. So a message transmitted over BITNET could get mangled -- especially for traffic between two different countries where they use different versions of EBCDIC, and therefore different translate tables between EBCDIC and ASCII. There are other character sets used by 822-based mail systems that impose similar restrictions, but EBCDIC is the most obvious one.
> We didn't use uuencode because several members of our working group had experience with cases where uuencoded files were garbaged in transit. It works fine for some people, but not for "everybody" (or even "nearly everybody").
> The "no standards for uuencode" wasn't really a problem. If we had wanted to use uuencode, we would have documented the format in the MIME RFC.
That last comment was from Keith Moore, "the author and co-author of several IETF RFCs related to the MIME and SMTP protocols for electronic mail, among others" says https://en.wikipedia.org/wiki/Keith_Moore .
uuencode has file headers/footers, like MIME. But the actual content encoding is basically base64 with a different alphabet; both add precisely 1/3 overhead (plus up to 2 padding bytes at the end).
Can anyone explain why BinHex remained "popular" in online Mac communities through to the early 2000s? Why couldn't Macs download "real" binary files back then?
So a common hack was to binhex the .sit file. Binhex was originally designed to make files 7-bit clean, but had the side effect that it bundled the resource fork and the data fork together.
Later versions of StuffIt could open .sit files which lacked the resource fork just fine, but by then .zip was starting to become more common.
I don't really understand why macOS users like this "simple" installation, because when you "uninstall" the app, it leaves all the trash in your system without a chance to clean up. And implying that macOS application somehow will not do "who-knows-what" to your system is just wrong. Docker Desktop is "simple", yet the first thing it does after launch is installing "who-knows-what".
Whereas on macOS, installation is trivial, but then the application sets up stuff upon first run and that is really intransparent then, with no way of properly uninstalling the app unless there is a dedicated uninstaller.
But yeah, the simple case is quite nice.
Deleted Comment
Deleted Comment
It's redundant since this info can be fully inferred from the length of the stream.
Even for concatenations it is not necessary to require it, since you must still know the length of each sub stream (and = does not always appear so is not a separator).
There's no way that using the = instead of per-byte length-checking gains any speed, since to prevent reading out of bounds you must check the per byte length anyway, you can't trust input to be a multiple of 4 length.
It could only make sense if it's somehow required to read 4 bytes at once, and you can't possibly read less, but what platform is such?
Deleted Comment
This shows the binary, base64 without padding and base64 with padding:
NULL --> AA --> AA==
NULL NULL --> AAA --> AAA=
NULL NULL NULL --> AAAA --> AAAA
As you can see, all the padding does is make the base64 length a multiple of 4. You already get uniquely distinguishable symbols for the 3 cases (one, two or three NULL symbols) without the ='s, so they are unnecessary
Refer to the "examples" section of the wikipedia page
Deleted Comment
But I think it's likely just poor design taste.
I'm not sure I understand this part. You can decode aGVsbG8=IHdvcmxk, what do you need to know?
I only mentioned the concatenation because Wikipedia claims this use case requires padding while in reality it doesn't.
Now, 25+ years later, I have some answers - thanks!
If you escape any disallowed character in the usual way for a string ("\0", "\r", "\n", "\\", "\"", "\uD800") then there is no decoding process, all the data in the string will be correct.
If you throw data that is compressed in there, you're unlikely to get very many zeroes, so you can just hope that there aren't too many unmatched surrogate pairs in your binary data, because those get inflated to 6 times their size.
Note that this operates on 16-bit values. In order to see a null, \r, \n, \\ and ", the most significant byte must also be zero, and in order for your data to contain a surrogate pair, you're looking at the two bytes taken together. When the data is compressed, the patterns are less likely.
[1] http://justsolve.archiveteam.org/wiki/XXBUG