In MySQL, use “utf8mb4” instead of “utf8” (2016)

This is also a security vulnerability because MySQL/MariaDB silently truncate strings at the first invalid character. This can result in data manipulation attacks where a higher level layer validates the complete input as UTF-8 for insertion into the DB, but the database only stores half the string.

https://bugzilla.mozilla.org/show_bug.cgi?id=1253201

evanelias · 4 years ago

Silent truncation only occurs if strict sql_mode is disabled. Strict sql_mode has been the default since MySQL 5.7, released over 6 years ago. All prior versions have hit EOL support, so there is literally no supported version of MySQL which has this behavior out-of-the-box.

The topic of MySQL utf8 vs utf8mb4 was discussed in depth just six days ago on HN: https://news.ycombinator.com/item?id=29793916

capitainenemo · 4 years ago

That's good to know, however I was encountering this MySQL truncation just last year in COTS software, so clearly there's a long tail. As a related example, log4j 1.x has been deprecated for years, but 100% of atlassian uses it with their own maintained version.

In fact, when the log4j thing came up, we audited all our existing COTS/internal and virtually all of it was still on the "EOL" log4j 1.x.

Since you're familiar with this. What happens with an existing upgrade - is strict mode enabled by a typical distro package upgrade or does it only impact new installs?

Ginden · 4 years ago

> Silent truncation only occurs if strict sql_mode is disabled. Strict sql_mode has been the default since MySQL 5.7, released over 6 years ago.

What happens if you upgrade from older version? Are defaults from previous version used, or is it default only on fresh installs?

kukx · 4 years ago

The strict sql_mode is often manually disabled. I bet it is off on most shared hostings.

purple_ferret · 4 years ago

Lots of people turn it off because Warnings are easier to ignore than Errors are

capitainenemo · 4 years ago

I do find this behaviour rather incomprehensible actually. I understand that if the "utf8" is BMP it can't have invalid characters in it. What puzzles me is that the approach they went for seems to be almost the worst possible one from a security and data integrity perspective.

Assuming a SQL error can't be thrown for some legacy reason (and you'd think a flag to opt into that would be an option), why not replace the invalid parse with U+FFFD like most things do? Or at a bare minimum, only strip the invalid character which at least makes attacks a bit harder and results in a little less data loss.

pshc · 4 years ago

Quietly corrupting data and shipping with the wrong default settings is what MySQL is all about! Keeps the consultants busy.

capitainenemo · 4 years ago

Just wanted to reply to myself that I just learned from evanelias' comment a "strict mode" flag is indeed an option. I suppose it's time to audit all my mysqls to see which ones have this enabled, and what other mysql-isms this flag might break.

I know it's definitely not enabled on one updated and maintained BSD - this could be because they didn't want to break existing systems during upgrade.

I suspect this problem is still extremely common. But good to know there's some form of defense.

Deleted Comment

tyre · 4 years ago

Another fun one that came up at my first job: If you run up against the size limit of a text column in MySQL, it silently truncates.

This came up when a developer used YAML for some configuration file. There was all sorts of confounding behaviour because YAML is still valid when truncated.

Every database has limits, but at least alert the user! I’d rather a query fail than alter data.

Edit:

This is something that might be useful in CS classes, to reflect what being a real-world engineer can often entail. Set an assignment with some silly nonsense like this and students have to debug it.

cornel_io · 4 years ago

There's a setting for this, but like many settings MySQL has (along with InnoDB, the storage engine) historically chosen exceptionally bad defaults. It's possible to tune the thing to be screaming fast and pretty robust, but it takes a lot of expertise and TBH these days most people would be much better off going either NoSQL or Postgres instead.

marginalia_nu · 4 years ago

Why were you putting YAML in a relational database?

leksak · 4 years ago

If the point of CS education is to produce programmers and software engineers as opposed to computer scientists, yes.

dehrmann · 4 years ago

It's like MySQL and PHP were meant for each other.

anon776 · 4 years ago

How dare you. Also accurate.

ajnin · 4 years ago

The day I realized that is the day I stopped using MySQL. It was a very long time ago, I was burned by this exact issue. I don't know what's more fundamental to a database than storing the data you ask it to. If it's storing something else, then it failed at its job spectacularly. At the time I experienced this bug I think it didn't even give a warning, but I think that's actually worse with one : it knows it's storing incorrect data but does so anyways.

And it still has not been fixed after all those years. I don't know what use is a character encoding that partially supports Unicode (you just have to try to find out in what way) and is used by nobody else. I don't really care that at some point in the future utf8 is going to become an alias to utf8mb4, a DB capable of this behaviour cannot be trusted, and it allowed me to discover postgres which seems superior in just about every way, so I don't need to go back.

wanderr · 4 years ago

Any data in your database that can come from external input should be treated as untrusted and validated before it's used. Otherwise validation bugs or bypasses will result in bad data and exploits that persist beyond the fix. Edit: I’m not arguing against the need to use utf8mb4

kstrauser · 4 years ago

While that’s true, I trust PostgreSQL to store exactly what I’ve asked it to store. At some point, you have to trust something to do its job, or else everything built on top of it is a castle of sand.

Imagine a bug like this in ext4. No one would reasonably contend that the layers on top of it should be validating that the files you write out are the ones you’ll read back in. We write unit tests for all kinds of stuff, but we’re not that thorough.

capitainenemo · 4 years ago

Agreed. The confusing part here as I see it is where validation layer A (correctly) asserts the data is valid UTF-8 and safe then assumes the database persists what it passes to it, since no error is reported.

Then, subsystem B trusts reading the database field (since it passed validation layer A).

Obviously more validation layers can be added, but at this point validation layer C called by subsystem B needs to know what the initial input from layer A is in order to differentiate it from the db value which was manipulated - a rather tricky thing to do sometimes. (I guess you could add a hash to the db to check the db is storing your strings, but really.. come on)

Upgrading to utf8mb4 is probably safer than hoping enough validation layers thrown at it solves the problem.

colpabar · 4 years ago

Sure, but the issue here is that the default mysql "utf8" encoding is not actually utf8. You can write as many validation layers as you want, but if they are assuming that utf8 actually means utf8, they won't help, and mysql will potentially screw it up when it gets stored.

In the past, Unicode was assumed to be 64k of codepoints, so a 3-byte UTF-8 sequence was considered "long enough", especially since there were surrogate pairs for the rare cases where you have to encode higher code points.

Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.

Yes, it could have been avoided if they had allowed arbitrary-length UTF-8 sequences from the beginning, but I can see that they probably just wanted to optimize a bit.

What I don't understand is why they had to create a different encoding (the utf8mb4) instead of just extending the existing utf8 encoding, since 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8... (unless they always used 3 bytes for every character, which would be stupid as UTF-8 has been explicitly designed for variable-length encodings)

Bonus: Many filesystems also do not allow 4+ byte UTF 8 code points in filenames. Test your company's file server to see if it allows you to save a file as "(some random emoji).doc". A few very expensive storage systems also have problems with that (and they have the same workaround: convert your filesystem to a different encoding, instead of simply extending the existing encoding to allow 4+ bytes)

raphlinus · 4 years ago

While non-BMP Unicode has only become popular somewhat recently due to emoji, it's been part of the relevant official specs for a quarter century.

I know you were kinda joking by using scare quotes for "recently", but let me fill that in for people who might be less familiar with the history. Unicode broke out of the 16-bit limit as of version 2.0, released in July 1996, so yeah, a quarter century. The first RFC for UTF-8 (RFC 2044, Oct 1996) explicitly supports UCS-4. As far as I know, there has never been a time where UTF-8 restricted to the BMP (ie 3 bytes) has been in widespread use. In other words, I am questioning your assertion that 3-byte UTF-8 has ever been considered "long enough," except perhaps in sloppy implementations made without much care for following specs.

avar · 4 years ago

Indeed. It somewhere between sad and amusing for those of us who care about i18n that encoding issues in various software that we'd been pointing out for the better half of the last 20+ years suddenly became no big deal to fix in the last ~5 years as 4-byte Unicode Emojis got widely deployed for users whose languages would otherwise fit in US-ASCII or Latin-1.

It's also a lesson for future generations. If you want an encoding specification to be fully implemented make sure that doing so is synonymous with the ability to display something like dog poop emoji. Clearly "this allows millions of people to read their native language" was too boring of a reason to care.

chungy · 4 years ago

> except perhaps in sloppy implementations made without much care for following specs.

Describes MySQL perfectly ;)

ghusbands · 4 years ago

Almost everything you claim here is wrong. UTF-8 has never been a 3 byte encoding, the spec initially specified [1] an up-to-six-byte encoding after Unicode had already gone past 16 bits [2] and then was reduced to 4 bytes by RFC3629 [3] in 2003. MySQL is the only piece of software that I know of where they invented their own length. (Though a lot of software often failed to validate it at all.)

There's nothing recent about it - it was 22 years ago! There's no significant optimisation advantage in the different length, and you can see from the patch that reduced the max-length [4] that it wasn't about optimisation. I don't think you can name a single file system that restricted UTF-8 to three bytes.

[1] https://datatracker.ietf.org/doc/html/rfc2044 [2] https://unicode.org/faq/utf_bom.html [3] https://datatracker.ietf.org/doc/html/rfc3629 [4] https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...

dark-star · 4 years ago

It is you who does not know what he's talking about:

> UTF-8 has never been a 3 byte encoding

I never claimed that

> There's nothing recent about it

The non-BMP characters are "recent" because 10 years ago the non-BMP was not allocated except for some small areas. Also I said it "became popular recently", due to emoji. Before that, non-BMP codepoints were rarely used

> I don't think you can name a single file system that restricted UTF-8 to three bytes.

WAFL[1] (unless you "format" it as utf8mb4, which was only implemented a few years ago...)

[1] https://docs.netapp.com/ontap-9/topic/com.netapp.doc.cdot-fa...

chungy · 4 years ago

> Many filesystems also do not allow 4+ byte UTF 8 code points in filenames.

I can't even think of a single example. Most filesystems just offer 255 8-bit units to filenames, where 4-byte UTF-8 sequences are totally a non-issue. ZFS supports utf8only=on, which enforces that filenames do conform to UTF-8, in which case... 4-byte sequences are still not a problem.

What's the filesystem that doesn't allow it?

brandmeyer · 4 years ago

Apple filesystems are infamous for breaking this convention. IIUC, they will support codepoints which expand to 4 bytes in UTF-8, but their implicit normalization rules can trip up programs which expect behavior closer to POSIX norms.

bearjaws · 4 years ago

Its way more than Emojis, people copy pasting from Excel will inadvertently add UTF8mb4 characters to your database. This is handled extremely poorly by MySQL and results in ugly characters being displayed.

xigoi · 4 years ago

It's not backward-compatible if someone relies on the errors. I'm guessing it's a case of spacebar heating.

Anthony-G · 4 years ago

For those unfamiliar with the spacebar heating reference: https://xkcd.com/1172/

int_19h · 4 years ago

> why they had to create a different encoding (the utf8mb4) instead of just extending the existing utf8 encoding, since 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8

Because CHAR and VARCHAR columns have max length specified, and this translates to the corresponding amount being reserved in storage. For a variable-length encoding, this is normally computed assuming the largest possible value for every codepoint.

tomwojcik · 4 years ago

I need to share something as literally today I fixed a bug in our project that's somewhat related.

MS SQL encodes everything with UTF-16. Emojis (code points) require up to 4 bytes. If it's a grapheme (emoji constructed from emojis), it will be even more.

We are using Django. If you check length of an emoji, it will give you `1` and Django assumes utf8 everywhere. If you try to save it to PostgreSQL with char field `max_length=1` it will work just fine, but on MS SQL it will fail as it requires 2 characters (4 bytes) in the DB!

I tried it with MS SQL 2017, 2019 with different collations on nvarchar and I'm pretty sure there's no way around it.

> Because in CHAR(n) and VARCHAR(n) or in NCHAR(n) and NVARCHAR(n), the n defines the byte storage size, not the number of characters that can be stored, it's important to determine the data type size you must convert to, in order to avoid data truncation.

https://docs.microsoft.com/en-US/sql/relational-databases/co...

gecko · 4 years ago

Generically, anything by Microsoft will historically have used UCS-2, and will use UTF-16 these days, so this is utterly unsurprising to me as an experienced Windows dev. Conversely, Linux (and POSIX, more generally) deciding that filename encoding is a per-filename and untracked thing is a bit lit, from my perspective. Point being: when it comes to handling unicode and foreign characters, just, like...always read the documentation. Assume nothing.

Yep. Microsoft is the main reason for 2 of my favourite unicode links. https://utf8everywhere.org/ and https://simonsapin.github.io/wtf-8/

And, apparently it's mildly inaccurate to say it uses UTF-16... it's more like UCS-2 with UTF-16 hacked in, with no validation. Thus WTF-8.

pimeys · 4 years ago

With SQL Server 2019, you can write UTF-8 data to `VARCHAR` and `CHAR` fields, if the table collation is `LATIN1_GENERAL_100_CI_AS_SC_UTF8`.

I feel the pain of UTF-16. When I was writing the Rust crate for the TDS protocol, all N-columns for strings always require you to do a full copy from `Vec<u16>` to `Vec<u8>` and back.

I used this collation on 2019 when testing and the only thing that has changed was the error message (more verbose).

tootie · 4 years ago

This kind of bug gives me nightmares. I'm afraid to ask how long it took you to solve.

2 days to identify and understand the problem because I wasn't able to reproduce at all, only on prod. Then ~1 day to fix but I've decided to handle not emojis per se, but unicode characters (which makes more sense but requires more knowledge) which was finished within the next 2 days. In the meantime I set up a test project with PostgreSQL, 3 MySQL databases and 2 MS SQL Server so all operations were executed on all 6 at once.

The worst thing is Django won't raise an exception on max_length exceeded even with MS SQL because on the application level its length is 1, so only the DB (DataError on constraint created by Django) will complain.

throw0101a · 4 years ago

Note:

> The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. utf8 is currently an alias for utf8mb3, but it is now deprecated as such, and utf8 is expected subsequently to become a reference to utf8mb4. Beginning with MySQL 8.0.28, utf8mb3 is also displayed in place of utf8 in columns of Information Schema tables, and in the output of SQL SHOW statements.

* https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-...

Maybe they should rename/alias it to brokenutf8, to make it more likely that people will notice a problem.

ehsankia · 4 years ago

That's not needed, what is needed is what they said they would do:

> and utf8 is expected subsequently to become a reference to utf8mb4

Once utf8 points to the right thing, the mistake will stop happening as no one is explicitly using utf8mb3 by accident.

Not sure why this wasn't done right away, and only will be "subsequently". Has it been done yet?

lmm · 4 years ago

Hmm, I wonder if they'll deprecate "LATIN1" as an alias for "CP1252 + 8 arbitrary characters" (a charset that literally only exists in MySQL) as well.

ksec · 4 years ago

I really wish they do this in a new, major version release. But they dont seems to be keen on releasing 9.0.

iliketrains · 4 years ago

Fun story: When I was working at one of the FAANG companies, I have placed the (Night with stars, it does not seem to render here [0]) character in my preferred name. Some time later, I was notified by HR that I must change my preferred name and I can only use UTF-8 characters with maximum length of 3 bytes. I was quite confused why such specific demand, I am pretty sure it was exactly this issue. I replaced it with ᗧ···ᗣ···

[0] https://emojipedia.org/night-with-stars/

jiggawatts · 4 years ago

Are you called Bobby Tables by any chance?

urbandw311er · 4 years ago

Fantastic response

donatj · 4 years ago

Is Medium secretly mining crypto in the background? I have no idea what it is but their site makes my CPU spike and my fans turn on.

https://jdon.at/JTHj2G

Update - here's a video, it's making hundreds of graphql requests a minute…

https://jdon.at/z05ImC

mrcarruthers · 4 years ago

And here's the commit that changed it: https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e.... It was originally set to max 6 bytes but for some reason someone dropped it to 3.

They also seem to be slowly phasing it out. Internally it's utf8mb3 and utf8 is just an alias. The idea is to eventually make utf8 an alias to utf8mb4.

isotopp · 4 years ago

Back then MySQL used the memory engine for implicit temporary tables, and memory engine has no variable width data types. So a "varchar" becomes a "char", and a "varchar(255) charset utf8" becomes an allocation of six times 255 = 1530 bytes (filled with a lot of padding).

Reducing this to 3 bytes is still bad, but only half as bad as before.

Only MySQL 8 introduced innodb typed temporary tables (unlogged, even), so with MySQL 8 this is no longer a problem.

Animats · 4 years ago

That was set up when Microsoft and Java had standardized on UTF-16. So this can represent the UTF-16 subset of Unicode, which is Unicode Plane 0, the Basic Multilingual Plane (BMP). The higher-numbered "astral planes" of UTF-8 were rarely used. All modern languages with a significant user base are covered in Plane 0. Plane 1, the Supplementary Multilingual Plane, with Cretan Linear B, Egyptian hieroglyphics, and such, was seldom needed. Few people had fonts for those, anyway.

Because of the way UTF-8 is encoded, it takes 3 bytes to represent the UTF-16 set. That's because it only takes one byte for ASCII characters. Hence, 3-byte MySQL data.

Emoji, though, were put in Unicode Plane 1. That's where mass demand for the astral planes came from. More four byte UTF-8 characters started showing up in data.

tremon · 4 years ago

A UTF-16 "subset" of Unicode doesn't exist. all UTF-* encodings are just that: encodings. They can all represent the entire Unicode character set. (Except for UTF-7, which never was an official standard).

What you're referring to is UCS-2 [0], which is UTF-16 without support for the high surrogates that make up the rest of the Unicode character set. But to imply that UTF-16 is a subset of Unicode is just not true.

[0] https://en.wikipedia.org/wiki/UCS-2

Dylan16807 · 4 years ago

That they can all represent all of unicode is true, but they're not "just" encodings either. The current code space of unicode is built around UTF-16. That's why there are 17 planes.

light_hue_1 · 4 years ago

There's so much wrong with this answer. It completely confuses Unicode, with Unicode encodings, and with non-Unicode encodings.

> So this can represent the UTF-16 subset of Unicode

There is no UTF-16 subset of Unicode. UTF-16 has surrogate pairs, it can represent all of Unicode. You are talking about UCS-2, a format which has no surrogate pairs.

> That was set up when Microsoft and Java had standardized on UTF-16

Java standardized on UCS-2 initially (which is what you're talking about). In the mid 2000s they moved to actual UTF-16.

> The higher-numbered "astral planes" of UTF-8

"astral planes" are not a UTF-8 concept. They're a Unicode concept. A Unicode code point has 6 hex digits. The plane is just the first 2 of those 6 digits. So code points 0000– FFFF are in the BMP (Basic Multilingual Plane), etc. People nickname planes above the 00 plane, BMP, astral planes.

> Plane 1 ... was seldom needed. Few people had fonts for those, anyway.

So "rarely", that the majority of humanity writes in languages that aren't in BMP. CJK (Chinese, Japanese, Korean and sometimes Vietnamese) languages need characters outside of BMP.

> Emoji, though, were put in Unicode Plane 1. That's where mass demand for the astral planes came from. More four byte UTF-8 characters started showing up in data.

That's totally untrue. The demand always existed by the majority of humanity, you just never cared about this massive problem until it impacted you.

> That's totally untrue. The demand always existed by the majority of humanity, you just never cared about this massive problem until it impacted you.

I think you swung so far in the other direction you also landed in untrue territory. Originally the intent was to encode all characters in "modern use" and 16 bits was probably enough to do that for the entire world.

Also CJK is about one quarter of the world population.

loulouxiv · 4 years ago

UTF-16 can represent the full range of Unicode codepoints by using couples of surrogates

chrismorgan · 4 years ago

Technical amendment: UTF-16 can represent the full range of Unicode scalar values with surrogate pairs. Code points includes the surrogates U+D800–U+DFFF, scalar values don’t. Like all other Unicode encodings, UTF-16 cannot represent surrogates.

That’s where the real problem lies: almost nothing that uses UTF-16 actually uses UTF-16, but rather potentially ill-formed UTF-16.

mort96 · 4 years ago

You're right. Replace UTF-16 with UCS-2 and the comment sounds at least slightly more correct.

Sort of. Applications using UTF-16 have to be aware of pairs at the application level. Many are not.