Understanding UUIDs, ULIDs and string representations

I really liked this article but I feel it misses one somewhat important point about using incremental numbers: They are trivially guessable and one needs to be very cautious when exposing them to the outside world.

If you encounter some URL like https://fancy.page/users/15 chances are that the 15 is a numeric ID and 1 to 14 also exist. And the lower numbers tend to be admin accounts as they are usually created first. This might be used by an attacker to extract data or maybe gain access to something internal. One could argue that using UUIDs only hides a security hole in this case but thats better than nothing I guess.

traceroute66 · 4 years ago

Beyond the obvious and important security implications of incremental numbers, there is one other major problem with them.

They make life hell for database clustering, merges and migrations.

In addition, on a more minor level, in a client-centric (apps, browser JS etc) world, the use of incremental numbers is an un-necessary pain point. If you use UUIDs, the client can generate its own without the need to a call back to the API (unless necessary in context, obviously).

Frankly, IMHO in the 21st century, the use of incremental numbers for IDs in databases thoroughly deserves to be consigned to the history books. The desperate clutching at straws arguments that went before (storage space, indexing etc.) are no longer applicable in the modern database and modern computing environment.

jandrewrogers · 4 years ago

This greatly overstates the benefits of UUIDs, and ignores the myriad ways in which they demonstrably have poor properties in real systems. I've worked on several large systems designed around UUIDs at scale that we had to later modify to not use UUIDs due to their manifest issues in practice. And then there is the reality that v3/4/5 UUIDs are expressly prohibited in some application contexts due to their defects and weaknesses.

Also, sequence generators are a non-problem in competent architectures, since you can trivially permute the number such that it is both collision-free and non-reversible (or only reversible with a key).

It is still common to use structured 128-bit keys in very large scale databases, and it is good practice in some cases, but these are not UUIDs because the standardized versions all have problems that can't be ignored. This leads to the case where there are myriad non-standard UUID-like identifiers (in that they are 128-bit) but not interoperable.

hderms · 4 years ago

Is it really true that concerns around UUIDs as primary keys are wholly irrelevant? Maybe I'm working off outdated information but in high scale environments there are a lot of downsides primarily related to the random write patterns into B-trees causing page splitting and things like that.

sudhirj · 4 years ago

There have been / still are tons of attacks where you can see other people's data by just incrementing and decrementing the ID in the URL.

Will see if I can a section about security implications, there's a similar time based argument to be made for ULIDs as well — you don't inadvertently want to expose a timestamp in some cases.

pqdbr · 4 years ago

I once discovered by accident that a big hospital in my big city used incremental IDs for loading exams results (one of my exams wasn't loading while the others were, so I just opened dev tools and 2 minutes later I noticed I could access 500_000 exams of random people just by changing something like /exam/ID).

UUIDs could have prevented the leak even if they still managed to completely disregard any authentication logic on the backend.

drewcoo · 4 years ago

This is the problem with incremental numbers:

https://en.wikipedia.org/wiki/German_tank_problem

9dev · 4 years ago

More trivially, it also gives insights into your business if I can determine the upper bound of your resources by trial-and-error guessing. If the highest user ID is 94, that may be an (hopefully unwarranted!) red flag to potential customers or investors.

justinholmes · 4 years ago

Using https://hashids.org is sensible for mapping to external ID.

vyrotek · 4 years ago

I recently learned about this. We're thinking of using them with our project really soon. Are there any gotchas to be aware of with it?

jrochkind1 · 4 years ago

The OP proposes using `ULID`s, which are the same number of bytes as UUIDs, but have an initial timestamp component (ms since epoch), plus a subsequent random component. While these are sequential (not exactly "incremental"), so give two of them you can know which came first -- they aren't really "guessable", as you'd need to guess not only an exact timestamp (not infeasible if more of a challenge than with incremental integers), but a large subsequent random component (infeasible).

Apparently there are some proposals to make official UUID variants with this sort of composition too, which some threads in this discussion go into more detail on.

Sjoerd · 4 years ago

They aren't guessable, except for ULIDs generated by the same process in the same millisecond. To keep chronological order even within the same timestamp, ULIDs within generated within the same millsecond become incremental. This can become relevant for example when an attacker requests a password reset for himself and the victim simultaneously.

jandrewrogers · 4 years ago

This is often dealt with trivially using collision-free hashing. It exports a number in the same domain as the sequence (i.e. 64-bit sequence -> 64-bit id) but is not reversible and is guaranteed to be unique.

noodlesUK · 4 years ago

This is true with identifiers which are already random, but unless you’re doing something like keyed hashing, a naive implementation of say SHA256(predictable_id) isn’t going to solve this problem against a determined attacker, but I’d like to learn a bit more about what you’re discussing here.

leetrout · 4 years ago

I was not aware of this!

Do you have a favorite link for more information on this?

The whole section on serial number IDs is a bit FUDy in my opinion, especially this:

> If you suddenly have a million people who want to buy things on your store, you can't ask them to wait because your sequence generator can't number their order line items fast enough. And because a sequence must store each number to disk before giving it out, your entire system is bottle-necked by the speed of rewriting a number on one SSD or hard disk — no matter how many servers you have.

There’s maybe a handful of apps in the world that see so much traffic that this would be a problem. Unless you expect to reach Amazon-scale anytime soon, or need distributed ID generation (like generating them in mobile apps or SPAs), just starting with a simple BIGSERIAL (or rather, BIGINT GENERATED BY DEFAULT AS IDENTITY as the state of the art is) will be good enough to get started.

You can always add complexity to your app later. Taking it away once added is much more difficult.

halpert · 4 years ago

UUIDs aren’t particularly complex. In a lot of ways, they’re more convenient than using an auto-incrementing identifier. For instance, if you’re using DynamoDB, getting an incrementing integer is way more complicated than a random UUID.

mikl · 4 years ago

Well, they do have opportunities for screw-ups. We recently had some bad data loss issues because part of an application was using a buggy UUID generator that produced lots of collisions. If you google for UUID collision bugs, its not an uncommon occurrence.

Besides, UUIDs have fragmentation issues. I’d use ULID if I had need to generate IDs in a distributed fashion.

sudhirj · 4 years ago

Think you’re right about the FUD part, will tone it down. Still think numbers are a horrible idea, but this shouldn’t be consideration for most people.

jandrewrogers · 4 years ago

Yes, this was hyperbole. Durably generating tens of millions per second is entirely within the realm of possibility. However, events generated by intentional human action, like buying something or sending a text message, historically maxes out at hundreds of thousands per second.

AtlasBarfed · 4 years ago

Sequence numbers don't scale to distributed databases and distributed data creation though.

"Handful" is wrong. Any major system will start to run into this as soon as you start saying the word "scale" in design meetings.

UUIDs can also provide room for encoding other information, like the type of the object, where it was created, etc, since MAC address is often integrated in the UUID.

mikl · 4 years ago

“Handful” is referring to cases where “your sequence generator can't number their order line items fast enough” is true. And I stand by that.

And if you want “other information”, good database design would put that other information in a column of its own.

Distributed databases have their place. But the tradeoffs they bring are often not worth it for your 1.0/MVP app.

elevader · 4 years ago

jdknezek · 4 years ago

There's also a proposal for UUIDv6-8, lexicographically sortable variants.

https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...

stevesimmons · 4 years ago

To summarise the differences:

* UUIDv6 - sortable, with a layout matching UUIDv1 for backward compatibility, except the time chunks have been reordered so the uuid sorts chronologically

* UUIDv7 - sortable, based on nanoseconds since the Unix epoch. Simpler layout than UUIDv6 and more flexibility about the number of bits allocated to the time part versus sequence and randomness. The nice aspect here is the uuids sort chronologically even when created by systems using different numbers of time bits.

* UUIDv8 - more flexibility for layout. Should only be used if UUIDv6/7 aren't suitable. Which of course makes them specific to that one application which knows how to encode/decode them.

UUIDv7 is thus the better choice in general.

(I recently wrote Python and C# implementations - https://github.com/stevesimmons/uuid7 and https://github.com/stevesimmons/uuid7-csharp)

sbuttgereit · 4 years ago

I see this pop up from time to time and it looks interesting. Does anyone know if there's actual progress on seeing this get adoption. I don't have any background on how to evaluate or how seriously to take such a draft.... is this draft under serious debate by those that could chose to adopt it or is it just written by someone with high hopes of throwing a draft out there and getting some attention for their idea?

Brad Peabody did the original -00 draft, which was discussed as an FYI at an IEFT meeting in March 2020. See [1], around 50 lines from the bottom.

Kyzer Davis has since submitted two further revisions -01 and -02 in April and October 2021. See history in [2].

The current -02 draft is due to expire in April 2022. Presumably Kyzer Davis will try to get it discussed before then.

The GitHub repo tracking these drafts is https://github.com/uuid6/uuid6-ietf-draft/.

[1] https://datatracker.ietf.org/meeting/107/materials/minutes-1...

[2] https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-...

skunkworker · 4 years ago

I'm liking ULIDs more and more recently, as a UUIDv4 is random, insert performance is going to be subpar compared to bigserial. But going to a ULID which includes the time allows you slightly quicker insert performance. Also allowing for some tiered storage architectures, where if you know the ULID, you know where to look (approximately).

hn_go_brrrrr · 4 years ago

Depends on your storage system. For the one I work with most, a common prefix on your primary keys will hurt performance because it causes hotspotting. A UUID primary key would be the best case because it optimally shards writes. A ULID would be the worst case -- I would need to store it with the bits reversed.

And conversely, if storage is chunked (e.g. parquet files on S3), having time-ordered uuids may turn it into essentially an append-only log-structured store.

Here hotspotting is the aim, since it lets you efficiently prune query plans from index scans to direct reads of the right chunk.

globular-toast · 4 years ago

Why not UUIDv1?

tinus_hn · 4 years ago

This source provides some insight on the ‘risks’ of collisions:

http://www.h2database.com/html/advanced.html#uuid

If you generate 70 trillion UUIDs, the odds of two of these being a duplicate is approximately the same as the chance of one person being hit by a meteorite this year.

To put it bluntly, to add structures like time, Mac addresses or domain ids to UUIDs to avoid collisions is really not useful and considering the downside that it leaks that information is a bad idea.

dataflow · 4 years ago

For anyone else wondering: ULID = Unique Lexicographically IDentifiers

prepend · 4 years ago

Thanks. I read the article to find out what this meant. ULID was mentioned 16 times but never spelled out.

UUID wasn’t either but I at least knew that.

fenollp · 4 years ago

We use K-Sortable Globally Unique IDs: https://github.com/segmentio/ksuid

Some differences:

128 bit strongly generated payload (instead of 80 bits for ULIDs).

Only 32 bit time precision but that's wall clock time anyway.

Base62 encoded.

Author here, self-posted. AMA.

piaste · 4 years ago

I think Base85 [0] warrants a mention for those needing to minimize the string representation length of their UUIDs.

A year ago or so we had to store a reference to one of our entities into a legacy third-party system, which used char(20) as the column size and of course couldn't be changed.

Since Base85 encodes a UUID as exactly 20 ASCII characters, it saved me from having to add an extra indirection. (Also, and to be honest mainly, from giving ammo to our CEO who had never liked UUIDs)

Of the various 85-character encodings, I thought Z85 [1] was the best one. It's not URL-safe, but it's safe for copy-pasting into queries, source code, XML, JSON, CSV, etc.

[0] https://en.wikipedia.org/wiki/Ascii85

[1] https://rfc.zeromq.org/spec/32/

Zamicol · 4 years ago

Thank you! I was familiar with Ascii85 but not Z85. I just added it to the "Other projects" section of my base converter. https://convert.zamicol.com/

Nice, thanks, haven’t see. This before. Can’t say I like the symbols in there, but will add it as a reference.

Should probably add base58 as well, Bitcoin uses it.

wch · 4 years ago

I just want to say that I really enjoyed reading this article. It's among the clearest, most accessible writing about a technical subject that I've encountered in a while.

dlbucci · 4 years ago

Great article! I hadn't heard of this, but it sounds like something I might suddenly be using in the near future! Also, I loved this line: "Given the way we're going, humanity in its present form isn't likely to exist them, so when this becomes an issue it'll be somebody else's problem. More likely something else's problem."

hl4hknz · 4 years ago

Great write up!

Do you know if lex62 has any performance disadvantage versus bases that are in 2^n? (32, 64 etc?)

I always assumed the conversation to 2^n bases could be done more efficiently.

Hmm. I haven’t actually gone low level enough to answer that, but I suppose it’s possible - doesn’t seem likely to be significant in a web application server scenario, though. The random jitter on the lan line to the DB is probably going to be more significant.

Deleted Comment