You can't just assume UTF-8

How about assume utf-8, and if someone has some binary file they'd rather a program interpret as some other format, they turn it into utf-8 using a standalone program first. Instead of burning this guess-what-bytes-they-might-like nonsense into all the software.

We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

ezoe · 2 years ago

I doubt you can handle UTF-8 properly with that attitude.

The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.

It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.

Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.

These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

josephg · 2 years ago

> Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.

The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.

> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.

thayne · 2 years ago

> Apple macOS use NFD while other mostly use NFC.

It's actually worse than that.

Older versions of mac did enforce NFD for file names, but more recent names don't, at least at the OS level. But many apple programs, such as finder _will_ use NFD. Except that it isn't even Unicode standardized NFD, it is Apple's own modified version of it. And this can cause issues when for example you create a file in finder, then search for it using `find`, and type the name of the file the exact same way, but it can't find the file because find got an NFC form, but the actual file is in NFD.

OTOH, in many applications, you don't really care about the normalization form used. For example, if you are parsing a CSV, you probably don't need to worry about if one of the cells using using a single code point or two code points to represent that accented e.

magicalhippo · 2 years ago

We make some B2B software running on Windows, integrating with customer systems. We get a lot of interesting files.

About a decade ago I wrote some utility code for reading files, where it'll try to detect BOM first, if not scan for invalid UTF-8 sequences. If none are found assume UTF-8 else assume Windows-1252. Worked well for us so far.

Still get the occasional flat file in Windows-1252 with one random field containing UTF-8, so some special handling is needed for those cases. But that's rare.

Fortunately we don't have to worry about normalization for the most part. If we're parsing then any delimiters will be one of the usual suspects and the rest data.

surfingdino · 2 years ago

Microsoft Windows is a source of many a headache for me as almost every other client I write code for has to deal with data created by humans using MS Office. Ordinary users could be excused, because they are not devs but even devs don't see a difference between ASCII and UTF-8 and continue to write code today as if it was 1986 and nobody needed to support accented characters.

morpheuskafka · 2 years ago

I got a ticket about some "folders with Chinese characters" showing up on an SMB share at work, my first thought was a Unicode issue and sure enough when you combine two UTF-8/ASCII A-z code points together as one UTF-16 code point, it will usually wind up in the CJK Common Ideographs range of Unicode. Some crappy software had evidently bypassed the appropriate Windows APIs and just directly wrote a C-style ASCII string onto the filesystem without realizing that NTFS is UTF-16.

dgellow · 2 years ago

I’ve been sharing it multiple times but I love it: WTF-16 spec https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

0xEF · 2 years ago

Do you know of a resource that explains character encoding in greater detail? Just for my own curiosity. I am learning web development and boy, they brow beat UTF-8 upon us which okay, I'll make sure that call is in my meta data, but none bother to explain how or why we got to that point, or why it seems so splintered.

int_19h · 2 years ago

> If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

It depends. If you're writing an app, just add the necessary incantation to your manifest, and all the narrow char APIs start talking UTF-8 to you.

For a library, yeah.

Comma2976 · 2 years ago

Some[1] would see breaking Windows as a feature

[1]Me, surely at least 1 other

998244353 · 2 years ago

Non-technical users don't want to do that, and won't understand any of that. That's the unfortunate reality of developing software for people.

If Excel generates CSV files with some Windows-1234 encoding, then my "import data from CSV" function needs to handle that, in one way or another. A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding, they won't care that Microsoft is using obsolete or weird defaults. They will see it as a bug in my program and demand that I fix my software. Even if Excel offers them a choice in encoding, they won't understand any of that and more importantly they don't want to deal with that right now, they just want the thing to work.

berkes · 2 years ago

> then my "import data from CSV" function needs to handle that, in one way or another.

It doesn't. Well, maybe "another".

Your function or even app doesn't need to handle it. Here's what we did on a bookkeeping app: remove all the heuristics, edge cases, broken-csv-handling and validation from all the CVS ingress points.

Add one service that did one thing: receive a CSV, and normalize it. All ingress points could now assume a clean, valid, UTF8 input. It removed thousands of LOCs, hundreds of issues from the backlog. It made debugging easy and it greatly helped the clients even.

At some point, we offered their import runs for download, we added the [original name]_clean.csv our normalized versions. Got praise for that. Clients loved that, as they were often well aware of their internal mess of broken CVSs.

xorcist · 2 years ago

The point is to separate the cleaning stage from the import stage. Having a clean utf-8 csv makes debugging import issues so much easier. And there are several well working csv tools such as the Python one, that can not only detect character encodings but also record separators and various quotation idiosyncrasies that you also need to be aware of when dealing with Microsoft Office files. Other people have already thought long and hard about that stuff so you don't have to.

edflsafoiewq · 2 years ago

OTOH if no one ever pushes back against entropy, your great-grandkids will still be dealing with Windows-1234 problems in 2100.

vaylian · 2 years ago

Plus, Excel really likes to use semicolons instead of commas for comma-separated-files. That's another idiosyncrasy that programmers need to take into account.

unclebucknasty · 2 years ago

>A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding...

What if you just gave them instructions?

zarzavat · 2 years ago

Agreed. Continuing to support other encodings is like insisting that cars should continue to have cassette tape players.

It’s much easier to tell the people with old cassette tapes to rip them, rather than try to put a tape player in every car.

fl7305 · 2 years ago

> It’s much easier to tell the people with old cassette tapes to rip them

I assume you mean "rip them", as in transcode to a different format?

In that case, you need a tool that takes the old input format(s) and convert them to the new format.

For text files, you'd need a tool that takes the old text files with various encodings and converts them to UTF-8.

Isn't the point of the article to describe how an engineer would create such a tool?

gwervc · 2 years ago

UTF-8 uses 50% more bytes than UFT-16 to encode Chinese or Japanese texts.

ryandrake · 2 years ago

> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

What ever happened to the Robustness Principle[1]? I think the entire comment section of this article has forgotten it. IMO the best software accepts many formats and "deals with it," or at least attempts to, rather than just exiting with "Hahah, Error 19923 Wrong Input Format. Try again, loser."

1: https://en.wikipedia.org/wiki/Robustness_principle

jerf · 2 years ago

We collectively discovered that we were underestimating the long term costs, by a lot, so its lustre has faded. This is in some sense relatively recent, so word is still getting around, as the programming world does not move as quickly as it fancies itself to.

If you'd like to see why, read the HTML 5 parsing portion of the spec. Slowly and carefully. Try to understand what is going on and why. A skim will not reveal the issue. You will come to a much greater understanding of the problem. Some study of what had happened when we tried to upgrade TCP (not the 4->6 transition, that's its own thing) and why the only two protocols that can practically exist on the Internet anymore are TCP and UDP may also be of interest.

travisb · 2 years ago

Among other things, security became a concern.

Being lenient is all well and good when the consequences are mild. When the consequences of misinterpreting or interpreting differently to a second implementation becomes costly, such as a security exploit, then the Robustness Principle becomes less obviously a win.

It's important to understand that every implementation will try to fix-up formatting problems in their own way unique to their particular implementation. From that you get various desync or reinterpretation attacks (eg. HTTP request smuggling).

thfuran · 2 years ago

It was a horrible idea. The real robustness principle is "Follow the spec".

vidarh · 2 years ago

That is fine in contexts where a wrong guess does no harm.

But that is not always the case, and e.g. silently "fixing" text encoding issues can often corrupt the data if you get it wrong.

By all means offer options of you want, but if you do flag very clearly to the user that they're taking a risk of corrupting the data unless any errors are very apparent and trivial to undo.

david422 · 2 years ago

> "deals with it,"

This basically loses data integrity if it's wrong though.

You might want to do that with human input if it's helpful to the user - ie user enters a phone number and you strip dashes etc. But if it's machine to machine, it should just follow the spec.

tedunangst · 2 years ago

Seems like you've uploaded a jpeg. Let me OCR that into CSV for you. Hmm, no text found. Let's pass it to a multimodal LLM.

kelnos · 2 years ago

The article addresses this, that current thinking in many places is that the robustness principle / Postel's Law maybe wasn't the best idea.

If you reject malformed input, then the person who created it has to go back and fix it and try again. If you interpret malformed input the best you can (and get it right), then everyone else implementing the same thing in the future now also has to implement your heuristics and workarounds. The malformed input effectively becomes a part of the spec.

This is why HTML & CSS are the garbage dump they are today, and why different browsers still don't always display pages exactly alike. The reason HTML5 exists is because people finally just gave up and decided to standardize all the broken behavior that was floating around in the wild. Pre-HTML5, the web was an outright dumpster fire of browser compatibility issues (as opposed to the mere garbage dump we have today).

Anyway, it's not really important to try to convince you that Postel's Law is bad; what's important is that you know that many people are starting to think it's bad, and there's no longer any strong consensus that it was ever a good thing.

arp242 · 2 years ago

> What ever happened to the Robustness Principle

Bush hid the facts

lamontcg · 2 years ago

I've lived through dealing with non-UTF8 encoding issues and it was a truly gigantic pain in the ass. I'm much more on the side now of people who only want to deal with UTF8 and fully support software that tells any other encoding to go pound sand. The harder life gets for people who use other encodings (yes, particularly Microsoft) the more incentive they have to eventually get on board and stop costing everyone time and effort managing all this nonsense.

jppittma · 2 years ago

I think people have collectively decided that they want their programs stupid and predictable, rather than smart and unwieldy.

kazinator · 2 years ago

What happened to it:

https://en.wikipedia.org/wiki/Robustness_principle#Criticism

Postel's Law doesn't pass a software engineering smell test.

The idea that software should guess and repair bad inputs is deeply flawed. It is a security threat and a source of enshittification.

fl7305 · 2 years ago

> they turn it into utf-8 using a standalone program first

I took the article to be for people who would be writing that "standalone program"?

I have certainly been in a position where I was the person who had to deal with input text files with unknown encodings. There was no-one else to hand off the problem to.

0xffff2 · 2 years ago

In that case, you should be explicitly asking the user what the input format is.

amarshall · 2 years ago

> some software probably does.

Browsers do, kind of https://mimesniff.spec.whatwg.org/#rules-for-identifying-an-...

SuperNinKenDo · 2 years ago

Not every encoding can make a round trip through Unicode without you writing ad hoc handling code for every single one. There's a number of reasons some of these are still in use and Unicode destroying information is one of them.

thaumasiotes · 2 years ago

Browsers used to have a menu option to choose the encoding you wanted to use to decode the page.

In Firefox, that's been replaced by the magic option "Repair Text Encoding". There is no justification for this.

They seem to be in the process of disabling that option too:

> Note: On most modern pages, the Repair Text Encoding menu item will be greyed-out because character encoding changes are not supported.

( https://support.mozilla.org/en-US/kb/text-encoding-no-longer... )

This note is logical gibberish; encoding isn't something that has to be supported by the page. Decoding is a choice by the browser!

shiomiru · 2 years ago

It seems the decision was made in the name of security:

https://hsivonen.fi/no-encoding-menu/

> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden). With the current approach, the parser needs to know of one flag to force chardetng, which the parser has to be able to run in other situations anyway, to run.

> Elaborate UI surface for a niche feature risks the whole feature getting removed

> Telemetry [...] suggested that users aren’t that good at choosing correctly manually.

In other words, it's trying to protect users from themselves by dumbing down the browser. (Never mind that people who know what they are doing have probably also turned off telemetry...)

n_plus_1_acc · 2 years ago

Probably because most websites now send a correct encoding header or meta Tag, so the user changing can only make it wrong. (Assuming no encoding header is wrong, which happens in reality)

kelnos · 2 years ago

Solutions that require lots of unrelated people to start doing something a different way are not really solutions.

Developers should assume UTF-8 for text files going forward.

UTF-8 should have no BOM. It is the default. And there are no undefined Byte sequences that need an Order. Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?

Others variants of Unicode have BOMs, e.g. UTF-16BE. We know CJK languages need UTF-16 for compression. The BOM is only a couple more bytes. No problem, so far so good.

But there are old files, that are in 'platform encoding'. Fine, let there be an OS 'locale', that has a default encoding. That default can be overridden with another OS 'encoding' variable. And that can be overridden by an application arg. And there may be a final default for a specific application, that is only ever used with one language encoding. Then individual files can override all of the above ...

Text file (serialization) formats should define handling of optional BOM followed by an ASCII header that defines the encoding of the body that follows. One can also imagine .3 filetypes that have a unique or default encoding, with values maintained by, say, IANA (like MIME types). XML got this right. JSON and CSV are woefully neglectful, almost to the point of criminal liability.

But in the absence of all of the above, the default-default-default-default-default is UTF-8.

We are talking about the future, not the past. Design for UTF-8 default, and BOMs for other Unicodes. Microsoft should have defined BOMs for Windows-Blah encodings, not for UTF-8!

When the whole world looks to the future, Microsoft will follow. Eventually. Reluctantly.

jrochkind1 · 2 years ago

The specific use case the OP author was focusing on was CSV. (A format which has no place to signal the encoding inline). They noted that, to this day, Windows Excel will output CSV in Win-1252. (And the user doing the CSV export has probably never even heard of encodings).

If you assume UTF-8, you will have corrupted text.

I agree that I'm mad about Excel outputting Win-1252 CSV by default.

out_of_protocol · 2 years ago

More pressure on M$ - faster the change. Also, almost any software nowadays assuming UTF8 for plaintext anyway

brnt · 2 years ago

And that's far from the only thing Excel does quietly. It also changes comma's to semi-colons based on system locale. It's braindead.

teknopaul · 2 years ago

Imho Presume 8 bit, encapsulting 7bit usacsii. That includes utf8 and many many other encodings.

Don't interpret user supplied strings at all. Defines max lengths as byte lengths.

Remain agnostic of encoding. Especially in libraries.

It's easier than people think it is thanks to some very clever people's work a long time ago.

PaulHoule · 2 years ago

Programming languages have lumbered slowly towards UTF-8 by default but from time to time you find an environment with a corrupted character encoding.

I worked at an AI company that was ahead of its time (actually I worked at a few of those) where the data scientists had a special talent for finding Docker images with strange configurations so all the time I'd find out one particular container was running a Hungarian or other wrong charset.

(And that's the problem with Docker... People give up on being in control of their environment and just throw in five different kitchen sinks and it works... Sometimes)

calpaterson · 2 years ago

If csv files bring criminal liability then I am guilty.

Sidenote: this particular criminal conspiracy is open to potential new members. Please join the Committee To Keep Csv Evil: https://discord.gg/uqu4BkNP5G

Jokes aside, talking about the future is grand but the problem is that data was written in the past and we need to read it in the present. That means that you do have to detect encoding and you can't just decide that the world runs on UTF-8. Even today, mainland China does not use UTF-8 and is not, as far as I know, in the process of switching.

I understand UTF-8 is mostly fine even for east asian languages though - and bytes are cheap

mikhailfranco · 2 years ago

Old CSV files bring no stigma.

There is no CSV specification. That does bring opprobrium. RFC4180 is from 2005, long after Unicode and XML, so people should have known better. The absence of a standard points to disagreement, perhaps commercial disagreement, but IETF should be independent, is it not?

That failure to standardize encoding, and other issues (headers, etc.), has wasted an enormous amount of time for many creative hackers, who could have been producing value, rather than banging their head against a vague assembly of pseudo-spec-illusions. Me included, sadly.

mikhailfranco · 2 years ago

I explained how old files could be read.

Let me reframe it as a Schelling Point [1] - the uncoordinated coordination problem.

You arrange to meet your file on a certain day in New York, no place or time were mentioned. When and where will you go? It seems impossible.

But perhaps you go at noon, to the UTF-8 Building in midtown Manhattan. Are you there now?

[1] https://en.wikipedia.org/wiki/Focal_point_(game_theory)

e63f67dd-065b · 2 years ago

> China does not use UTF-8 and is not, as far as I know, in the process of switching.

That’s … not true? Most Chinese software and websites are utf8 by default, and it’s been that way for a while. GBK and her sisters might linger around in legacy formats, but UTF-8 has certainly reached the point of mass adoption from what I can tell.

WorldMaker · 2 years ago

> We know CJK languages need UTF-16 for compression.

My understanding is that it is for the opposite of compression: it saves memory when uncompressed versus the UTF-8 surrogates needing more bytes. My understanding is that UTF-8 surrogates compress pretty well as they have common patterns that form dictionary "words" just as easily anything else. UTF-8 seems to be winning in the long run for even CJK and astral plane languages on disk and the operating systems and applications that were preferring UTF-16 in memory are mostly only doing so out of backwards compatibility and are themselves often using more UTF-8 buffers internally as those reflect the files at rest.

(.NET has a backwards compatibility based on using UTF-16 codepoint strings by default but has more and more UTF-8 only pathways and has some interesting compile time options now to use UTF-8 only today. Python 3 made the choice that UTF-8 was the only string format to support, even with input from CJK communities. UTF-8 really does seem to be slowly winning everything.)

> JSON and CSV are woefully neglectful,

As the article also points out, JSON probably got it right: UTF-8 only and BOM is an error (because UTF-8) (but parsers are allowed to gently ignore that error if they wish). https://www.rfc-editor.org/rfc/rfc8259#section-8.1

That seems to be the way forward for new text-based formats that only care about backward compatibility with low byte ASCII: UTF-8 only, no BOM. UTF-8 (unlike UTF-16 and missing reservations for some of its needed surrogates) is infinitely expandable if we ever do find a reason to extend past the "astral plane".

(Anyone still working in CSV by choice is maybe guilty of criminal liability though. I still think the best thing Excel could do to help murder CSV is give us a file extension to force Excel to open a JSON file, like .XLJSON. Every time I've resorted to CSV has been because "the user needs to double click the file and open in Excel". Excel has great JSON support, it just won't let you double click a file for it, which is the only problem, because no business executive wants the training on "Data > From JSON" no matter how prominent in the ribbon that tool is.)

> When the whole world looks to the future, Microsoft will follow.

That ship is turning slowly. Windows backward compatibility guarantees likely mean that Windows will always have some UTF-16, but the terminals in Windows now correctly default to UTF-8 (since Windows 10) and even .NET with its compatibility decrees is more "UTF-8 native" than ever (especially when compiling for running on Linux, which is several layers of surprise for anyone that was around in the era where Microsoft picked UCS-2 as its one true format in the first place).

PaulHoule · 2 years ago

You can fit Japanese comfortably in a 16 bit charset but Chinese needs more than that.

My take though is that CSV is not a good thing because the format isn't completely standardized, you just can't count that people did the right thing with escaping, whether a particular column is intended to be handled as strings or numeric values, etc.

Where I work we publish data files in various formats, I'm planning on making a Jupyter notebook to show people how to process our data with Pandas -- one critical point is that I'm going to use one of the commercial statistics data formats (like Stata) because I can load the data right the first time and not look back. (e.g. CSV = good because it is "open" is wrong)

If I am exporting files for Excel users I export an Excel file. Good Excel output libraries have been around for at least 20 years and even if you don't have fun with formulas, formatting and all that it is easy to make a file that people will load right the first time and every time.

Deleted Comment

Pet_Ant · 2 years ago

> Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?

So that you know you are dealing with UTF-8. Assuming ASCII only works if you are only dealing with English texts and data.

themerone · 2 years ago

The BOM only tells you that your are not dealing with ASCII. You could encounter anything after it.

remram · 2 years ago

Who is assuming ASCII?