How about assume utf-8, and if someone has some binary file they'd rather a program interpret as some other format, they turn it into utf-8 using a standalone program first. Instead of burning this guess-what-bytes-they-might-like nonsense into all the software.
We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.
I doubt you can handle UTF-8 properly with that attitude.
The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.
It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.
Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.
These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.
> Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.
Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.
The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.
> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.
Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.
Older versions of mac did enforce NFD for file names, but more recent names don't, at least at the OS level. But many apple programs, such as finder _will_ use NFD. Except that it isn't even Unicode standardized NFD, it is Apple's own modified version of it. And this can cause issues when for example you create a file in finder, then search for it using `find`, and type the name of the file the exact same way, but it can't find the file because find got an NFC form, but the actual file is in NFD.
OTOH, in many applications, you don't really care about the normalization form used. For example, if you are parsing a CSV, you probably don't need to worry about if one of the cells using using a single code point or two code points to represent that accented e.
We make some B2B software running on Windows, integrating with customer systems. We get a lot of interesting files.
About a decade ago I wrote some utility code for reading files, where it'll try to detect BOM first, if not scan for invalid UTF-8 sequences. If none are found assume UTF-8 else assume Windows-1252. Worked well for us so far.
Still get the occasional flat file in Windows-1252 with one random field containing UTF-8, so some special handling is needed for those cases. But that's rare.
Fortunately we don't have to worry about normalization for the most part. If we're parsing then any delimiters will be one of the usual suspects and the rest data.
Microsoft Windows is a source of many a headache for me as almost every other client I write code for has to deal with data created by humans using MS Office. Ordinary users could be excused, because they are not devs but even devs don't see a difference between ASCII and UTF-8 and continue to write code today as if it was 1986 and nobody needed to support accented characters.
I got a ticket about some "folders with Chinese characters" showing up on an SMB share at work, my first thought was a Unicode issue and sure enough when you combine two UTF-8/ASCII A-z code points together as one UTF-16 code point, it will usually wind up in the CJK Common Ideographs range of Unicode. Some crappy software had evidently bypassed the appropriate Windows APIs and just directly wrote a C-style ASCII string onto the filesystem without realizing that NTFS is UTF-16.
Do you know of a resource that explains character encoding in greater detail? Just for my own curiosity. I am learning web development and boy, they brow beat UTF-8 upon us which okay, I'll make sure that call is in my meta data, but none bother to explain how or why we got to that point, or why it seems so splintered.
Non-technical users don't want to do that, and won't understand any of that. That's the unfortunate reality of developing software for people.
If Excel generates CSV files with some Windows-1234 encoding, then my "import data from CSV" function needs to handle that, in one way or another. A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding, they won't care that Microsoft is using obsolete or weird defaults. They will see it as a bug in my program and demand that I fix my software. Even if Excel offers them a choice in encoding, they won't understand any of that and more importantly they don't want to deal with that right now, they just want the thing to work.
> then my "import data from CSV" function needs to handle that, in one way or another.
It doesn't. Well, maybe "another".
Your function or even app doesn't need to handle it. Here's what we did on a bookkeeping app: remove all the heuristics, edge cases, broken-csv-handling and validation from all the CVS ingress points.
Add one service that did one thing: receive a CSV, and normalize it. All ingress points could now assume a clean, valid, UTF8 input.
It removed thousands of LOCs, hundreds of issues from the backlog. It made debugging easy and it greatly helped the clients even.
At some point, we offered their import runs for download, we added the [original name]_clean.csv our normalized versions. Got praise for that. Clients loved that, as they were often well aware of their internal mess of broken CVSs.
The point is to separate the cleaning stage from the import stage. Having a clean utf-8 csv makes debugging import issues so much easier. And there are several well working csv tools such as the Python one, that can not only detect character encodings but also record separators and various quotation idiosyncrasies that you also need to be aware of when dealing with Microsoft Office files. Other people have already thought long and hard about that stuff so you don't have to.
Plus, Excel really likes to use semicolons instead of commas for comma-separated-files. That's another idiosyncrasy that programmers need to take into account.
>A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding...
> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.
What ever happened to the Robustness Principle[1]? I think the entire comment section of this article has forgotten it. IMO the best software accepts many formats and "deals with it," or at least attempts to, rather than just exiting with "Hahah, Error 19923 Wrong Input Format. Try again, loser."
We collectively discovered that we were underestimating the long term costs, by a lot, so its lustre has faded. This is in some sense relatively recent, so word is still getting around, as the programming world does not move as quickly as it fancies itself to.
If you'd like to see why, read the HTML 5 parsing portion of the spec. Slowly and carefully. Try to understand what is going on and why. A skim will not reveal the issue. You will come to a much greater understanding of the problem.
Some study of what had happened when we tried to upgrade TCP (not the 4->6 transition, that's its own thing) and why the only two protocols that can practically exist on the Internet anymore are TCP and UDP may also be of interest.
Being lenient is all well and good when the consequences are mild. When the consequences of misinterpreting or interpreting differently to a second implementation becomes costly, such as a security exploit, then the Robustness Principle becomes less obviously a win.
It's important to understand that every implementation will try to fix-up formatting problems in their own way unique to their particular implementation. From that you get various desync or reinterpretation attacks (eg. HTTP request smuggling).
That is fine in contexts where a wrong guess does no harm.
But that is not always the case, and e.g. silently "fixing" text encoding issues can often corrupt the data if you get it wrong.
By all means offer options of you want, but if you do flag very clearly to the user that they're taking a risk of corrupting the data unless any errors are very apparent and trivial to undo.
This basically loses data integrity if it's wrong though.
You might want to do that with human input if it's helpful to the user - ie user enters a phone number and you strip dashes etc. But if it's machine to machine, it should just follow the spec.
The article addresses this, that current thinking in many places is that the robustness principle / Postel's Law maybe wasn't the best idea.
If you reject malformed input, then the person who created it has to go back and fix it and try again. If you interpret malformed input the best you can (and get it right), then everyone else implementing the same thing in the future now also has to implement your heuristics and workarounds. The malformed input effectively becomes a part of the spec.
This is why HTML & CSS are the garbage dump they are today, and why different browsers still don't always display pages exactly alike. The reason HTML5 exists is because people finally just gave up and decided to standardize all the broken behavior that was floating around in the wild. Pre-HTML5, the web was an outright dumpster fire of browser compatibility issues (as opposed to the mere garbage dump we have today).
Anyway, it's not really important to try to convince you that Postel's Law is bad; what's important is that you know that many people are starting to think it's bad, and there's no longer any strong consensus that it was ever a good thing.
I've lived through dealing with non-UTF8 encoding issues and it was a truly gigantic pain in the ass. I'm much more on the side now of people who only want to deal with UTF8 and fully support software that tells any other encoding to go pound sand. The harder life gets for people who use other encodings (yes, particularly Microsoft) the more incentive they have to eventually get on board and stop costing everyone time and effort managing all this nonsense.
> they turn it into utf-8 using a standalone program first
I took the article to be for people who would be writing that "standalone program"?
I have certainly been in a position where I was the person who had to deal with input text files with unknown encodings. There was no-one else to hand off the problem to.
Not every encoding can make a round trip through Unicode without you writing ad hoc handling code for every single one. There's a number of reasons some of these are still in use and Unicode destroying information is one of them.
> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.
Browsers used to have a menu option to choose the encoding you wanted to use to decode the page.
In Firefox, that's been replaced by the magic option "Repair Text Encoding". There is no justification for this.
They seem to be in the process of disabling that option too:
> Note: On most modern pages, the Repair Text Encoding menu item will be greyed-out because character encoding changes are not supported.
> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden). With the current approach, the parser needs to know of one flag to force chardetng, which the parser has to be able to run in other situations anyway, to run.
> Elaborate UI surface for a niche feature risks the whole feature getting removed
> Telemetry [...] suggested that users aren’t that good at choosing correctly manually.
In other words, it's trying to protect users from themselves by dumbing down the browser. (Never mind that people who know what they are doing have probably also turned off telemetry...)
Probably because most websites now send a correct encoding header or meta Tag, so the user changing can only make it wrong. (Assuming no encoding header is wrong, which happens in reality)
If you give me a computer timestamp without a timezone, I can and will assume it's in UTC. It might not be, but if it's not and I process it as though it is, and the sender doesn't like the results, that's on them. I'm willing to spend approximately zero effort trying to guess what nonstandard thing they're trying to send me unless they're paying me or my company a whole lot of money, in which case I'll convert it to UTC upon import and continue on from there.
Same with UTF-8. Life's too short for bothering with anything else today. I'll deal with some weird janky encoding for the right price, but the first thing I'd do is convert it to UTF-8. Damned if I'm going to complicate the innards of my code with special case code paths for non-UTF-8.
If there were some inherent issue with UTF-8 that made it significantly worse than some other encoding for a given task, I'd be sympathetic to that explanation and wouldn't be such a pain in the neck about this. For instance, if it were the case that it did a bad job of encoding Mandarin or Urdu or Xhosa or Persian, and the people who use those languages strongly preferred to use something else, I'd understand. However, I've never heard a viable explanation for not using UTF-8 other than legacy software support, and if you want to continue to use something ancient and weird, it's on you to adapt it to the rest of the world because they're definitely not going to adapt the world to you.
It depends on the domain. If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.
Unfortunately Google and many other companies have decided UTC is the only way, so this causes issues with ICS files that use that format sometimes when they are generating their helpful popups in the GMail inbox.
> If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.
If you have to take medication (for instance, an antibiotic) every 24 hours, it must be taken at the same UTC hour, even if you took a train to a town in another timezone. Keeping the same local time even when the timezone changes would be wrong for that use case.
> For instance, if it were the case that it did a bad job of encoding Mandarin
I don't know if you picked this example on purpose, but using UTF-8 to encode Chinese is 50% larger than the old encoding (GB2312). I remember people cared about this like twenty years ago. I don't know of anyone that still cares about this encoding inefficiency. Any compression algorithm is able to remove such encoding inefficiency while using negligible CPU to decompress.
That doesn't seem like the worst issue imaginable. I doubt there are too many cases where every byte counts, text uses a significant portion of the available space, and compression is unavailable or inefficient. If we were still cramming floppies full of text files destined for very slow computers, that'd be one thing. Web pages full of uncompressed text are still either so small that it's a moot point or so huge with JS, images, and fonts that the relative text size isn't that significant.
Which is all to say that you're right, but I can't imagine that it's more than a theoretical nuisance outside some extremely niche cases.
For Asian languages, UTF-8 is basically the same size as any other encoding when compressed[0] (and you should be using compression if you care about space) so in practice there is no data size advantage to using non-standard encodings.
A key aspect is that nowadays we rarely encode pure text - while other encodings are more efficient for encoding pure Mandarin, nowadays a "Mandarin document" may be an HTML or JSON or XML file where less than half of the characters are from CJK codespace, and the rest come from all the formatting overhead which is in the 7-bit ASCII range, and UTF-8 works great for such combined content.
> For instance, if it were the case that it did a bad job of encoding Mandarin
Please look up the issues caused by Han unification in Unicode. It’s an important reason why the Chinese and Japanese encodings are still used in their respective territories.
I can't help myself. The grandest of nitpicks is coming your way. I'm sorry.
> If you give me a computer timestamp without a timezone, I can and will assume it's in UTC.
Do you mean, give you an _offset_? `2024-04-29T14:03:06.0000-8:00` the `-8:00` is an offset. It only tells you what time this stamp occurred relative to standard time. It does not tell you anything about the region or zone itself. While I have consumed APIs that give me the timezone context as part of the response, none of them are part of the timestamp itself.
The only time you should assume a timestamp is UTC is if it has the `z` at the end (assuming 8601) or is otherwise marked as UTC. Without that, you have absolutely no information about where or when the time has occurred -- it is local time. And if your software assumes a local timestamp is UTC, then I argue it is not the sender of that timestamp's problem that your software is broken.
My desire to meet you at 4pm has no bearing on if the DST switchover has happened, or my government decides to change the timezone rules, or if {any other way the offset for a zone can change for future or past times}. My reminder to take my medicine at 7pm is not centered on UTC or my physical location on the planet. Its just at 7pm. Every day. If I go from New York to Paris then no, I do not want your software to tell me my medicine is actually supposed to be at Midnight. Its 7pm.
But, assuming you aren't doing any future scheduling, calendar appointments, bookings, ticket sales, transportation departure, human-centric logs, or any of the other ways Local Time is incredibly useful -- ignore away.
As I mentioned in another reply, "remind me every day at 7PM" isn't a timestamp. It's a formula for how to determine when the next timestamp is going to occur. Even those examples are too narrow, because it's really closer to "remind me the next time you notice that it's after 7PM wherever I happen to be, including if that's when I cross a time zone and jump instantly from 6:30PM to 7:30PM".
Consider my statement more in the context of logs of past events. The only time you can reasonably assume a given file is in a particular non-UTC TZ is when it came from a person sitting in your same city, from data they collected manually, and you're confident that person isn't a time geek who uses UTC for everything. Otherwise there's no other sane default when lacking TZ/offset data. (I know they're not the same, but they're similar in the sense that they can let you convert timestamps from one TZ to another).
It's always nice to see someone who actually understands time.
"Convert to UTC and then throw away the time zone" only works when you need to record a specific moment in time so it's crazy how often it's recommended as the universal solution. It really isn't that hard to store (datetime, zone) and now you're not throwing away information if you ever need to do date math.
Developers should assume UTF-8 for text files going forward.
UTF-8 should have no BOM. It is the default. And there are no undefined Byte sequences that need an Order. Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?
Others variants of Unicode have BOMs, e.g. UTF-16BE.
We know CJK languages need UTF-16 for compression.
The BOM is only a couple more bytes.
No problem, so far so good.
But there are old files, that are in 'platform encoding'.
Fine, let there be an OS 'locale', that has a default encoding.
That default can be overridden with another OS 'encoding' variable. And that can be overridden by an application arg. And there may be a final default for a specific application, that is only ever used with one language encoding. Then individual files can override all of the above ...
Text file (serialization) formats should define handling of optional BOM followed by an ASCII header that defines the encoding of the body that follows. One can also imagine .3 filetypes that have a unique or default encoding, with values maintained by, say, IANA (like MIME types). XML got this right. JSON and CSV are woefully neglectful, almost to the point of criminal liability.
But in the absence of all of the above, the default-default-default-default-default is UTF-8.
We are talking about the future, not the past. Design for UTF-8 default, and BOMs for other Unicodes. Microsoft should have defined BOMs for Windows-Blah encodings, not for UTF-8!
When the whole world looks to the future, Microsoft will follow. Eventually. Reluctantly.
The specific use case the OP author was focusing on was CSV. (A format which has no place to signal the encoding inline). They noted that, to this day, Windows Excel will output CSV in Win-1252. (And the user doing the CSV export has probably never even heard of encodings).
If you assume UTF-8, you will have corrupted text.
I agree that I'm mad about Excel outputting Win-1252 CSV by default.
Programming languages have lumbered slowly towards UTF-8 by default but from time to time you find an environment with a corrupted character encoding.
I worked at an AI company that was ahead of its time (actually I worked at a few of those) where the data scientists had a special talent for finding Docker images with strange configurations so all the time I'd find out one particular container was running a Hungarian or other wrong charset.
(And that's the problem with Docker... People give up on being in control of their environment and just throw in five different kitchen sinks and it works... Sometimes)
If csv files bring criminal liability then I am guilty.
Sidenote: this particular criminal conspiracy is open to potential new members. Please join the Committee To Keep Csv Evil: https://discord.gg/uqu4BkNP5G
Jokes aside, talking about the future is grand but the problem is that data was written in the past and we need to read it in the present. That means that you do have to detect encoding and you can't just decide that the world runs on UTF-8. Even today, mainland China does not use UTF-8 and is not, as far as I know, in the process of switching.
I understand UTF-8 is mostly fine even for east asian languages though - and bytes are cheap
There is no CSV specification. That does bring opprobrium. RFC4180 is from 2005, long after Unicode and XML, so people should have known better. The absence of a standard points to disagreement, perhaps commercial disagreement, but IETF should be independent, is it not?
That failure to standardize encoding, and other issues (headers, etc.), has wasted an enormous amount of time for many creative hackers, who could have been producing value, rather than banging their head against a vague assembly of pseudo-spec-illusions. Me included, sadly.
> China does not use UTF-8 and is not, as far as I know, in the process of switching.
That’s … not true? Most Chinese software and websites are utf8 by default, and it’s been that way for a while. GBK and her sisters might linger around in legacy formats, but UTF-8 has certainly reached the point of mass adoption from what I can tell.
> We know CJK languages need UTF-16 for compression.
My understanding is that it is for the opposite of compression: it saves memory when uncompressed versus the UTF-8 surrogates needing more bytes. My understanding is that UTF-8 surrogates compress pretty well as they have common patterns that form dictionary "words" just as easily anything else. UTF-8 seems to be winning in the long run for even CJK and astral plane languages on disk and the operating systems and applications that were preferring UTF-16 in memory are mostly only doing so out of backwards compatibility and are themselves often using more UTF-8 buffers internally as those reflect the files at rest.
(.NET has a backwards compatibility based on using UTF-16 codepoint strings by default but has more and more UTF-8 only pathways and has some interesting compile time options now to use UTF-8 only today. Python 3 made the choice that UTF-8 was the only string format to support, even with input from CJK communities. UTF-8 really does seem to be slowly winning everything.)
> JSON and CSV are woefully neglectful,
As the article also points out, JSON probably got it right: UTF-8 only and BOM is an error (because UTF-8) (but parsers are allowed to gently ignore that error if they wish). https://www.rfc-editor.org/rfc/rfc8259#section-8.1
That seems to be the way forward for new text-based formats that only care about backward compatibility with low byte ASCII: UTF-8 only, no BOM. UTF-8 (unlike UTF-16 and missing reservations for some of its needed surrogates) is infinitely expandable if we ever do find a reason to extend past the "astral plane".
(Anyone still working in CSV by choice is maybe guilty of criminal liability though. I still think the best thing Excel could do to help murder CSV is give us a file extension to force Excel to open a JSON file, like .XLJSON. Every time I've resorted to CSV has been because "the user needs to double click the file and open in Excel". Excel has great JSON support, it just won't let you double click a file for it, which is the only problem, because no business executive wants the training on "Data > From JSON" no matter how prominent in the ribbon that tool is.)
> When the whole world looks to the future, Microsoft will follow.
That ship is turning slowly. Windows backward compatibility guarantees likely mean that Windows will always have some UTF-16, but the terminals in Windows now correctly default to UTF-8 (since Windows 10) and even .NET with its compatibility decrees is more "UTF-8 native" than ever (especially when compiling for running on Linux, which is several layers of surprise for anyone that was around in the era where Microsoft picked UCS-2 as its one true format in the first place).
You can fit Japanese comfortably in a 16 bit charset but Chinese needs more than that.
My take though is that CSV is not a good thing because the format isn't completely standardized, you just can't count that people did the right thing with escaping, whether a particular column is intended to be handled as strings or numeric values, etc.
Where I work we publish data files in various formats, I'm planning on making a Jupyter notebook to show people how to process our data with Pandas -- one critical point is that I'm going to use one of the commercial statistics data formats (like Stata) because I can load the data right the first time and not look back. (e.g. CSV = good because it is "open" is wrong)
If I am exporting files for Excel users I export an Excel file. Good Excel output libraries have been around for at least 20 years and even if you don't have fun with formulas, formatting and all that it is easy to make a file that people will load right the first time and every time.
What you do, rather, is drop support for non-UTF-8.
Work with tech-stacks whose text handling is based strictly around Unicode and UTF-8, and find enough opportunities that way that you don't have to care about anything else.
Let the customers who cling to data in weird encodings go to someone who makes it a nice to support that.
Joel spolsky spoke against this exact statistics-based approach when he wrote about unicode[1]:
> What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.
I don't think he was speaking against the statistics-based approach itself, just against Postel's Law in general.
Ideally people would see gibberish (or an error message) immediately if they don't provide an encoding; then they'll know something is wrong, figure it out, fix it, and never have the issue again.
But if we're in a situation where we already have lots and lots of text documents that don't have an encoding specified, and we believe it's not feasible to require everyone to fix that, then it's actually pretty amazing that we can often correctly guess the encoding.
There's enca library (and cli tool) which does that. I used it often before UTF-8 became overwhelming. The situation was especially dire with Russian encodings. There were three 1-byte encodings which were quite wide-spread: KOI8-R mostly found in unixes, CP866 used in DOS and CP1251 used in Windows. What's worse, with Windows you sometimes had to deal with both CP866 and CP1251 because it includes DOS subsystem with separate codepage.
Exactly. I used this technique at Mozilla in 2010 when processing Firefox add-ons, and it misidentified scripts as having the wrong encoding pretty frequently. There's far less weird encoding out there than there are false positives from statistics-based approaches.
If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding. But that should be a fallback, not the first thing you try.
> If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding.
Don't really even need to do that. There's only a handful other encodings still in common use, just try each of them as fallbacks and see which one works without errors, and you'll manage the vast majority of what's not UTF-8.
(We recently did just that for a system that handles unreliable input, I think I remember our fallback only has 3 additional encodings before it gives up and it's been working fine)
Based on my past role, you can't even assume UTF-8 when the file says it's UTF-8.
Clients would constantly send CSV or other files with an explicit BOM or other marking indicating UTF-8 but the parser would choke since they just output native Windows-1252 or similar into it. I think some programs just spit it out since it's standard.
We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.
The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.
It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.
Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.
These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.
Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.
The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.
> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.
Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.
It's actually worse than that.
Older versions of mac did enforce NFD for file names, but more recent names don't, at least at the OS level. But many apple programs, such as finder _will_ use NFD. Except that it isn't even Unicode standardized NFD, it is Apple's own modified version of it. And this can cause issues when for example you create a file in finder, then search for it using `find`, and type the name of the file the exact same way, but it can't find the file because find got an NFC form, but the actual file is in NFD.
OTOH, in many applications, you don't really care about the normalization form used. For example, if you are parsing a CSV, you probably don't need to worry about if one of the cells using using a single code point or two code points to represent that accented e.
About a decade ago I wrote some utility code for reading files, where it'll try to detect BOM first, if not scan for invalid UTF-8 sequences. If none are found assume UTF-8 else assume Windows-1252. Worked well for us so far.
Still get the occasional flat file in Windows-1252 with one random field containing UTF-8, so some special handling is needed for those cases. But that's rare.
Fortunately we don't have to worry about normalization for the most part. If we're parsing then any delimiters will be one of the usual suspects and the rest data.
It depends. If you're writing an app, just add the necessary incantation to your manifest, and all the narrow char APIs start talking UTF-8 to you.
For a library, yeah.
[1]Me, surely at least 1 other
If Excel generates CSV files with some Windows-1234 encoding, then my "import data from CSV" function needs to handle that, in one way or another. A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding, they won't care that Microsoft is using obsolete or weird defaults. They will see it as a bug in my program and demand that I fix my software. Even if Excel offers them a choice in encoding, they won't understand any of that and more importantly they don't want to deal with that right now, they just want the thing to work.
It doesn't. Well, maybe "another".
Your function or even app doesn't need to handle it. Here's what we did on a bookkeeping app: remove all the heuristics, edge cases, broken-csv-handling and validation from all the CVS ingress points.
Add one service that did one thing: receive a CSV, and normalize it. All ingress points could now assume a clean, valid, UTF8 input. It removed thousands of LOCs, hundreds of issues from the backlog. It made debugging easy and it greatly helped the clients even.
At some point, we offered their import runs for download, we added the [original name]_clean.csv our normalized versions. Got praise for that. Clients loved that, as they were often well aware of their internal mess of broken CVSs.
What if you just gave them instructions?
It’s much easier to tell the people with old cassette tapes to rip them, rather than try to put a tape player in every car.
I assume you mean "rip them", as in transcode to a different format?
In that case, you need a tool that takes the old input format(s) and convert them to the new format.
For text files, you'd need a tool that takes the old text files with various encodings and converts them to UTF-8.
Isn't the point of the article to describe how an engineer would create such a tool?
What ever happened to the Robustness Principle[1]? I think the entire comment section of this article has forgotten it. IMO the best software accepts many formats and "deals with it," or at least attempts to, rather than just exiting with "Hahah, Error 19923 Wrong Input Format. Try again, loser."
1: https://en.wikipedia.org/wiki/Robustness_principle
If you'd like to see why, read the HTML 5 parsing portion of the spec. Slowly and carefully. Try to understand what is going on and why. A skim will not reveal the issue. You will come to a much greater understanding of the problem. Some study of what had happened when we tried to upgrade TCP (not the 4->6 transition, that's its own thing) and why the only two protocols that can practically exist on the Internet anymore are TCP and UDP may also be of interest.
Being lenient is all well and good when the consequences are mild. When the consequences of misinterpreting or interpreting differently to a second implementation becomes costly, such as a security exploit, then the Robustness Principle becomes less obviously a win.
It's important to understand that every implementation will try to fix-up formatting problems in their own way unique to their particular implementation. From that you get various desync or reinterpretation attacks (eg. HTTP request smuggling).
But that is not always the case, and e.g. silently "fixing" text encoding issues can often corrupt the data if you get it wrong.
By all means offer options of you want, but if you do flag very clearly to the user that they're taking a risk of corrupting the data unless any errors are very apparent and trivial to undo.
This basically loses data integrity if it's wrong though.
You might want to do that with human input if it's helpful to the user - ie user enters a phone number and you strip dashes etc. But if it's machine to machine, it should just follow the spec.
If you reject malformed input, then the person who created it has to go back and fix it and try again. If you interpret malformed input the best you can (and get it right), then everyone else implementing the same thing in the future now also has to implement your heuristics and workarounds. The malformed input effectively becomes a part of the spec.
This is why HTML & CSS are the garbage dump they are today, and why different browsers still don't always display pages exactly alike. The reason HTML5 exists is because people finally just gave up and decided to standardize all the broken behavior that was floating around in the wild. Pre-HTML5, the web was an outright dumpster fire of browser compatibility issues (as opposed to the mere garbage dump we have today).
Anyway, it's not really important to try to convince you that Postel's Law is bad; what's important is that you know that many people are starting to think it's bad, and there's no longer any strong consensus that it was ever a good thing.
Bush hid the facts
https://en.wikipedia.org/wiki/Robustness_principle#Criticism
Postel's Law doesn't pass a software engineering smell test.
The idea that software should guess and repair bad inputs is deeply flawed. It is a security threat and a source of enshittification.
I took the article to be for people who would be writing that "standalone program"?
I have certainly been in a position where I was the person who had to deal with input text files with unknown encodings. There was no-one else to hand off the problem to.
Browsers do, kind of https://mimesniff.spec.whatwg.org/#rules-for-identifying-an-...
Browsers used to have a menu option to choose the encoding you wanted to use to decode the page.
In Firefox, that's been replaced by the magic option "Repair Text Encoding". There is no justification for this.
They seem to be in the process of disabling that option too:
> Note: On most modern pages, the Repair Text Encoding menu item will be greyed-out because character encoding changes are not supported.
( https://support.mozilla.org/en-US/kb/text-encoding-no-longer... )
This note is logical gibberish; encoding isn't something that has to be supported by the page. Decoding is a choice by the browser!
https://hsivonen.fi/no-encoding-menu/
> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden). With the current approach, the parser needs to know of one flag to force chardetng, which the parser has to be able to run in other situations anyway, to run.
> Elaborate UI surface for a niche feature risks the whole feature getting removed
> Telemetry [...] suggested that users aren’t that good at choosing correctly manually.
In other words, it's trying to protect users from themselves by dumbing down the browser. (Never mind that people who know what they are doing have probably also turned off telemetry...)
Same with UTF-8. Life's too short for bothering with anything else today. I'll deal with some weird janky encoding for the right price, but the first thing I'd do is convert it to UTF-8. Damned if I'm going to complicate the innards of my code with special case code paths for non-UTF-8.
If there were some inherent issue with UTF-8 that made it significantly worse than some other encoding for a given task, I'd be sympathetic to that explanation and wouldn't be such a pain in the neck about this. For instance, if it were the case that it did a bad job of encoding Mandarin or Urdu or Xhosa or Persian, and the people who use those languages strongly preferred to use something else, I'd understand. However, I've never heard a viable explanation for not using UTF-8 other than legacy software support, and if you want to continue to use something ancient and weird, it's on you to adapt it to the rest of the world because they're definitely not going to adapt the world to you.
Unfortunately Google and many other companies have decided UTC is the only way, so this causes issues with ICS files that use that format sometimes when they are generating their helpful popups in the GMail inbox.
If you have to take medication (for instance, an antibiotic) every 24 hours, it must be taken at the same UTC hour, even if you took a train to a town in another timezone. Keeping the same local time even when the timezone changes would be wrong for that use case.
I don't know if you picked this example on purpose, but using UTF-8 to encode Chinese is 50% larger than the old encoding (GB2312). I remember people cared about this like twenty years ago. I don't know of anyone that still cares about this encoding inefficiency. Any compression algorithm is able to remove such encoding inefficiency while using negligible CPU to decompress.
Which is all to say that you're right, but I can't imagine that it's more than a theoretical nuisance outside some extremely niche cases.
[0] https://utf8everywhere.org/#asian
Please look up the issues caused by Han unification in Unicode. It’s an important reason why the Chinese and Japanese encodings are still used in their respective territories.
> If you give me a computer timestamp without a timezone, I can and will assume it's in UTC.
Do you mean, give you an _offset_? `2024-04-29T14:03:06.0000-8:00` the `-8:00` is an offset. It only tells you what time this stamp occurred relative to standard time. It does not tell you anything about the region or zone itself. While I have consumed APIs that give me the timezone context as part of the response, none of them are part of the timestamp itself.
The only time you should assume a timestamp is UTC is if it has the `z` at the end (assuming 8601) or is otherwise marked as UTC. Without that, you have absolutely no information about where or when the time has occurred -- it is local time. And if your software assumes a local timestamp is UTC, then I argue it is not the sender of that timestamp's problem that your software is broken.
My desire to meet you at 4pm has no bearing on if the DST switchover has happened, or my government decides to change the timezone rules, or if {any other way the offset for a zone can change for future or past times}. My reminder to take my medicine at 7pm is not centered on UTC or my physical location on the planet. Its just at 7pm. Every day. If I go from New York to Paris then no, I do not want your software to tell me my medicine is actually supposed to be at Midnight. Its 7pm.
But, assuming you aren't doing any future scheduling, calendar appointments, bookings, ticket sales, transportation departure, human-centric logs, or any of the other ways Local Time is incredibly useful -- ignore away.
Consider my statement more in the context of logs of past events. The only time you can reasonably assume a given file is in a particular non-UTC TZ is when it came from a person sitting in your same city, from data they collected manually, and you're confident that person isn't a time geek who uses UTC for everything. Otherwise there's no other sane default when lacking TZ/offset data. (I know they're not the same, but they're similar in the sense that they can let you convert timestamps from one TZ to another).
"Convert to UTC and then throw away the time zone" only works when you need to record a specific moment in time so it's crazy how often it's recommended as the universal solution. It really isn't that hard to store (datetime, zone) and now you're not throwing away information if you ever need to do date math.
You usually end up with having to deal with whatever eccentric sh!t that ultimately comes from the same source as the payment for the job.
UTF-8 should have no BOM. It is the default. And there are no undefined Byte sequences that need an Order. Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?
Others variants of Unicode have BOMs, e.g. UTF-16BE. We know CJK languages need UTF-16 for compression. The BOM is only a couple more bytes. No problem, so far so good.
But there are old files, that are in 'platform encoding'. Fine, let there be an OS 'locale', that has a default encoding. That default can be overridden with another OS 'encoding' variable. And that can be overridden by an application arg. And there may be a final default for a specific application, that is only ever used with one language encoding. Then individual files can override all of the above ...
Text file (serialization) formats should define handling of optional BOM followed by an ASCII header that defines the encoding of the body that follows. One can also imagine .3 filetypes that have a unique or default encoding, with values maintained by, say, IANA (like MIME types). XML got this right. JSON and CSV are woefully neglectful, almost to the point of criminal liability.
But in the absence of all of the above, the default-default-default-default-default is UTF-8.
We are talking about the future, not the past. Design for UTF-8 default, and BOMs for other Unicodes. Microsoft should have defined BOMs for Windows-Blah encodings, not for UTF-8!
When the whole world looks to the future, Microsoft will follow. Eventually. Reluctantly.
If you assume UTF-8, you will have corrupted text.
I agree that I'm mad about Excel outputting Win-1252 CSV by default.
Don't interpret user supplied strings at all. Defines max lengths as byte lengths.
Remain agnostic of encoding. Especially in libraries.
It's easier than people think it is thanks to some very clever people's work a long time ago.
I worked at an AI company that was ahead of its time (actually I worked at a few of those) where the data scientists had a special talent for finding Docker images with strange configurations so all the time I'd find out one particular container was running a Hungarian or other wrong charset.
(And that's the problem with Docker... People give up on being in control of their environment and just throw in five different kitchen sinks and it works... Sometimes)
Sidenote: this particular criminal conspiracy is open to potential new members. Please join the Committee To Keep Csv Evil: https://discord.gg/uqu4BkNP5G
Jokes aside, talking about the future is grand but the problem is that data was written in the past and we need to read it in the present. That means that you do have to detect encoding and you can't just decide that the world runs on UTF-8. Even today, mainland China does not use UTF-8 and is not, as far as I know, in the process of switching.
I understand UTF-8 is mostly fine even for east asian languages though - and bytes are cheap
There is no CSV specification. That does bring opprobrium. RFC4180 is from 2005, long after Unicode and XML, so people should have known better. The absence of a standard points to disagreement, perhaps commercial disagreement, but IETF should be independent, is it not?
That failure to standardize encoding, and other issues (headers, etc.), has wasted an enormous amount of time for many creative hackers, who could have been producing value, rather than banging their head against a vague assembly of pseudo-spec-illusions. Me included, sadly.
Let me reframe it as a Schelling Point [1] - the uncoordinated coordination problem.
You arrange to meet your file on a certain day in New York, no place or time were mentioned. When and where will you go? It seems impossible.
But perhaps you go at noon, to the UTF-8 Building in midtown Manhattan. Are you there now?
[1] https://en.wikipedia.org/wiki/Focal_point_(game_theory)
That’s … not true? Most Chinese software and websites are utf8 by default, and it’s been that way for a while. GBK and her sisters might linger around in legacy formats, but UTF-8 has certainly reached the point of mass adoption from what I can tell.
My understanding is that it is for the opposite of compression: it saves memory when uncompressed versus the UTF-8 surrogates needing more bytes. My understanding is that UTF-8 surrogates compress pretty well as they have common patterns that form dictionary "words" just as easily anything else. UTF-8 seems to be winning in the long run for even CJK and astral plane languages on disk and the operating systems and applications that were preferring UTF-16 in memory are mostly only doing so out of backwards compatibility and are themselves often using more UTF-8 buffers internally as those reflect the files at rest.
(.NET has a backwards compatibility based on using UTF-16 codepoint strings by default but has more and more UTF-8 only pathways and has some interesting compile time options now to use UTF-8 only today. Python 3 made the choice that UTF-8 was the only string format to support, even with input from CJK communities. UTF-8 really does seem to be slowly winning everything.)
> JSON and CSV are woefully neglectful,
As the article also points out, JSON probably got it right: UTF-8 only and BOM is an error (because UTF-8) (but parsers are allowed to gently ignore that error if they wish). https://www.rfc-editor.org/rfc/rfc8259#section-8.1
That seems to be the way forward for new text-based formats that only care about backward compatibility with low byte ASCII: UTF-8 only, no BOM. UTF-8 (unlike UTF-16 and missing reservations for some of its needed surrogates) is infinitely expandable if we ever do find a reason to extend past the "astral plane".
(Anyone still working in CSV by choice is maybe guilty of criminal liability though. I still think the best thing Excel could do to help murder CSV is give us a file extension to force Excel to open a JSON file, like .XLJSON. Every time I've resorted to CSV has been because "the user needs to double click the file and open in Excel". Excel has great JSON support, it just won't let you double click a file for it, which is the only problem, because no business executive wants the training on "Data > From JSON" no matter how prominent in the ribbon that tool is.)
> When the whole world looks to the future, Microsoft will follow.
That ship is turning slowly. Windows backward compatibility guarantees likely mean that Windows will always have some UTF-16, but the terminals in Windows now correctly default to UTF-8 (since Windows 10) and even .NET with its compatibility decrees is more "UTF-8 native" than ever (especially when compiling for running on Linux, which is several layers of surprise for anyone that was around in the era where Microsoft picked UCS-2 as its one true format in the first place).
My take though is that CSV is not a good thing because the format isn't completely standardized, you just can't count that people did the right thing with escaping, whether a particular column is intended to be handled as strings or numeric values, etc.
Where I work we publish data files in various formats, I'm planning on making a Jupyter notebook to show people how to process our data with Pandas -- one critical point is that I'm going to use one of the commercial statistics data formats (like Stata) because I can load the data right the first time and not look back. (e.g. CSV = good because it is "open" is wrong)
If I am exporting files for Excel users I export an Excel file. Good Excel output libraries have been around for at least 20 years and even if you don't have fun with formulas, formatting and all that it is easy to make a file that people will load right the first time and every time.
Deleted Comment
So that you know you are dealing with UTF-8. Assuming ASCII only works if you are only dealing with English texts and data.
What you do, rather, is drop support for non-UTF-8.
Work with tech-stacks whose text handling is based strictly around Unicode and UTF-8, and find enough opportunities that way that you don't have to care about anything else.
Let the customers who cling to data in weird encodings go to someone who makes it a nice to support that.
> What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.
1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
Ideally people would see gibberish (or an error message) immediately if they don't provide an encoding; then they'll know something is wrong, figure it out, fix it, and never have the issue again.
But if we're in a situation where we already have lots and lots of text documents that don't have an encoding specified, and we believe it's not feasible to require everyone to fix that, then it's actually pretty amazing that we can often correctly guess the encoding.
If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding. But that should be a fallback, not the first thing you try.
Don't really even need to do that. There's only a handful other encodings still in common use, just try each of them as fallbacks and see which one works without errors, and you'll manage the vast majority of what's not UTF-8.
(We recently did just that for a system that handles unreliable input, I think I remember our fallback only has 3 additional encodings before it gives up and it's been working fine)
Clients would constantly send CSV or other files with an explicit BOM or other marking indicating UTF-8 but the parser would choke since they just output native Windows-1252 or similar into it. I think some programs just spit it out since it's standard.
I got 99 problems, but charsets aint one of them.