Permanent identifiers should not carry data. This is like the cardinal sin of data management. You always run into situations where the thing you thought, "surely this never changes, so it's safe to squeeze into the ID to save a lookup". Then people suddenly find out they have a new gender identity, and they need a last final digit in their ID numbers too.
Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"
Librarians in the analog age can be forgiven for cramming data into their identifiers, to save a lookup. When the lookup is in a physical card catalog, that's somewhat understandable (although you bet they could run into trouble over it too). But when you have a powerful database at your fingertips, use it! Don't make decisions you will regret just to shave off a couple of milliseconds!
> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"
To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.
But in case, the intention for encoding a timestamp into a UUID isn't for any implied meaning. It's both to guarantee uniqueness with a side effect that IDs are more or less monotonically increasing. Whether this is actually desirable depends on your application, but generally if the application is as a indexed key for insertion into a database, it's usually more useful for performance than a fully random ID as it avoids rewriting lots of leaf-nodes of B-trees. If you insert a load of these such keys, it forms a cluster on one side of the tree that can the rebalance with only the top levels needing to be rewritten.
>To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.
You still have that problem from organic birthdays and also the problem of needing to change ids to correct birth dates.
> To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.
well, till you run out of numbers for the immigrants that don't have exact birth date
And then have to enter/handle a non-date through all systems? How do you know if this non-dated person is over the age of minority? Eligible for a pension?
Maybe the answer is to evenly spread the defaults over 365 days.
In either the AAA or BB component there is something about the gender.
But it does mean that there is a limit of people born per day of a certain gender.
But for a given year, using a moniker will only delay the inevitable. Sure, there are more numbers, but still limited as there are SOME parts that need to reflect reality. Year, gender (if that's still the case?) etc.
Fantastic real life example. Italian PNs carry also the gender, which something you can change surgically, and you'll eventually run into the issue when operating at scale.
I don't agree with the absolute statement, though. Permanent identifiers should not generally carry data. There are situations where you want to have a way to reconciliate, you have space or speed constraints, so you may accept the trade off, md5 your data and store it in a primary index as a UUID. Your index will fragment and thus you will vacuum, but life will still be good overall.
I've worked on a system where ULIDs (not UUIDv7, but similar) were used with a cursor to fetch data in chronological order and then—surprise!—one day records had to be backdated, meaning that either the IDs for those records had to be counterfeited (potentially violating invariants elsewhere) or the fetching had to be made smarter.
You can choose to never make use of that property. But it's tempting.
> You're not going to try and extract a timestamp from a uuid.
I totally used uuidv7s as "inserted at" in a small project and I had methods to find records created between two timestamps that literally converted timestamps to uuidv7 values so I could do "WHERE id BETWEEN a AND b"
> You're not going to try and extract a timestamp from a uuid.
What? The first 48 bits of an UUID7 are a UNIX timestamp.
Whether or not this is a meaningful problem or a benefit to any particular use of UUIDs requires thinking about it; in some cases it’s not to be taken lightly and in others it doesn’t matter at all.
I see what you’re getting at, that ignoring the timestamp aspect makes them “just better UUIDs,” but this ignores security implications and the temptation to partition by high bits (timestamp).
The curious thing about the article is that, it's definitely premature optimization for smaller databases, but when the database gets to the scale where these optimizations start to matter, you actually don't want to do what they suggest.
Specifically, if your database is small, the performance impact is probably not very noticeable. And if your database is large (eg. to the extent primary keys can't fit within 32-bit int), then you're actually going to have to think about sharding and making the system more distributed... and that's where UUID works better than auto-incrementing ints.
I agree there's a scale below which this (or any) optimization matters and a scale above which you want your primary key to have locality (in terms of which shard/tablet/... is responsible for the record). But...
* I think there is a wide range in the middle where your database can fit on one machine if you do it well, but it's worth optimizing to use a cheaper machine and/or extend the time until you need to switch to a distributed db. You might hit this middle range soon enough (and/or it might be a painful enough transition) that it's worth thinking about it ahead of time.
* If/when you do switch to a distributed database, you don't always need to rekey everything:
** You can spread existing keys across shards via hashing on lookup or reversing bits. Some databases (e.g. DynamoDB) actually force this.
** Allocating new ids in the old way could be a big problem, but there are ways out. You might be able to switch allocation schemes entirely without clients noticing if your external keys are sufficiently opaque. If you went with UUIDv7 (which addresses some but not all of the article's points), you can just keep using it. If you want to keep using dense(-ish), (mostly-)sequential bigints, you can amortize the latency by reserving blocks at a time.
This is actually a very deep and interesting topic.
Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place. So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.
The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
> Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place.
The identifier is still connected to the user's data, just through the appropriate other fields in the table as opposed to embedded into the identifier itself.
> So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.
Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.
You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
A fully random primary key takes into account that things change - since it's not embedding any real-world information. That said I also don't think there's much issue with embedding creation time in the UUID for performance reasons, as the article is suggesting.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.
E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.
It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).
> Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place.
The surrogate key's purpose isn't to directly store the natural key's information, rather, it's to provide an index to it.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
There isn't 'another' - there's just one. The surrogate key. The other pieces of information you're describing are not the means of indexing the data. They are the pieces of data you wish to retrieve.
Identifier is just "a piece of common token system can use to operate on same entity.
You need it. Because it's maybe one lone unchangeable thing. Taking person for example:
* date of birth can be changed, if there was error and correction in documents
* any and near all of existing physical characteristics can change over time, either due to brain things (deciding to change gender), aging, or accidents (fingerprints no longer apply if you burnt your skin enough)
* DNA might be good enough, but that's one fucking long identifier to share and one hard to validate in field.
So an unique ID attached to few other parts to identify current iteration of individual is the best we have, and the best we will get.
You can't take into account the fact that things change when you don't know what those changes might be. You might end up needing to either rebuild a new database, have some painful migration, or support two codepaths to work with both types of keys.
Your comment is sufficiently generic that it’s impossible to tell what specific part of the article you’re agreeing with, disagreeing with, or expanding upon.
I don't think the timestamped UUIDs are "carrying data", it is just a heuristic to improve lookup performance. If the timestamp is wrong, it will just run as slow as the non-timestamped UUID.
If you take the gender example, for 99% of people, it is male/female and it won't change, and you can use that for load balancing. But if later, you found out that the gender is not the one you expect for that bucket, no big deal, it will cause a branch misprediction, but instead of happening 50% of the times when you use a random value, it will only happen 1% of the times, significant speedup with no loss in functionality.
As soon as you encode imperfect data in an immutable key, you always have to check when you retrieve it. If that piece of data isn't absolutely 100% guaranteed to be perfect, then you have to query both halves of the load balanced DB anyway.
More broadly, this is the ages old surrogate vs natural key discussion, but yes the comment completely misses the point of the article. I can only assume they didn't read it in full!
Perhaps you can clarify something for me, because I think I'm missing it.
> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits
So presumably the format is DDMMYYXXXXX (for some arbitrary number of X's), where the XXX represents e.g. an automatically incrementing number of some kind?
Which means that if it's DDMMYYXXX then you can only have 1000 people born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000 people born on DDMMYY.
So in order for there to be so many such entries in common that people are denied use of their actual birthday, then one of the following must be true:
1. The XXX counter must be extremely small, in order for it to run out as a result of people 'using up' those Jan 1 dates each year
2. The number of people born on Jan 1 or immigrating to Norway without knowledge of their birthday must be colossal
If it was just DDMMXXXXX (no year) then I can see how this system would fall apart rapidly, but when you're dealing with specifically "people born on Jan 1 2014 or who immigrated to Norway and didn't know their birthday and were born on/around 2014 so that was the year chosen" I'm not sure how that becomes a sufficiently large number to cause these issues. Perhaps this only occurs in specific years where huge numbers of poorly-documented refugees are accepted?
(Happy to be educated, as I must be missing something here)
The cause is more just "not having enough bits". UUID is 128 bit. You're not running out even if you use part for timestamp, the random part will be big enough.
Like, it's a valid complaint.. just not for discussion at hand.
Also, we do live in reality and while having entirely random one might be perfect from theory of data, in reality having it be prefixed by date have many advantages performance wise.
> Permanent identifiers should not carry data. This is like the cardinal sin of data management
As long as you don't use the data and have actual fields for what's also encoded in UUID, there is absolutely nothing wrong with it, provided there is enough of the random part to get around artifacts in real life data.
Like the other poster said, this is a problem with default values not encoding the birthday into the personnummer.
I think it also is important to remember the purpose of specific numbers. For instance I would argue a PN without the birthday would be strictly worse. With the current system (I only know the Swedish one, but assume it's the same) I only have to remember a 4 digit (because the number is bdate + unique 4 digits). If we would instead use completely random numbers I would have to remember at least an 8 digit number (and likely to be future proof you'd want at least 9 digits). Sure that's fine for myself (although I suspect some people already struggle with it), but then I also have to remember the numbers for my 2 kids and my partner and things become quickly annoying. Especially, because one doesn't use the numbers often enough that it becomes easy, but still often enough that it becomes annoying to look up, especially when one doesn't always cary their phone with them.
It sounds to me like you’re just arguing for premature optimization of another kind (specifically, prematurely changing your entire architecture for edge cases that probably won’t ever happen to you).
If you have an architecture already, obviously it's hard to change and you may want to postpone it until those edge cases which probably won't ever happen to you, happen. But for new architectures, value your own grey hairs over small performance improvements.
> Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right?
I guess that Norway has solved it in the same or similar way as Sweden? So a person is identified by the PNR and for those systems that need to track a person over several PNR (government agencies) use PRI. And a PRI is just the first PNR assigned to a person with a 1 inserted in the middle. If that PRI is occupied, use a 2,and so on.
Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.
> A prime example of premature optimization.
Disagree. Data is sticky, and PKs especially so. Moreover, if you’re going to spend time optimizing anything early on, it should be your data model.
> Don't make decisions you will regret just to shave off a couple of milliseconds!
A bad PK in some databases (InnoDB engine, SQL Server if clustered) can cause query times to go from sub-msec to tens of msec quite easily, especially with cloud solutions where storage isn’t node-local. I don’t just mean a UUID; a BIGINT PK on a 1:M can destroy your latency for the simple reason of needing to fetch a separate page for every record. If instead the PK is a composite of (<linked_id>, id) - e.g. (user_id, id) - where id is a monotonic integer, you’ll have WAY better data locality.
Postgres suffers a different but similar problem with its visibility map lookups.
I read it (and regret it is a waste of my time). Their arguments are:
* integer keys are faster;
* uuidv7 keys are faster;
* if you want obfuscated keys, using integer and do some your own obfuscation (!!!).
I can get on-board of uuidv7 (with the trade-off, of course, on stronger guessability). The integer keys argument is strange. At that point, you need to come up with a custom-built system to avoid id collision in a distribution system and tries to achieve only 2x saving (the absolute minimal you should do is 64-bit keys). Very puzzling suggestion and to me very wrong.
Note that in this entire article, the recommendation is not about using natural keys (email address, some composite of user identification etc.), so I am skipping that whole discussion.
> Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.
I am not a cryptographer, but I would want his recommendation reviewed by a cryptographer. And then I would have to implement it. UUIDs have been extensively reviewed by cryptographers, I have a variety of excellent implementations I can use, I know they solve the problem well. I know they can cause performance issues; they're a security feature that is easy to implement, and I can deal with the performance issues if and when they crop up. (Which, in my experience, it's unusual. Even at a large company, most databases I encounter do not have enough data. I will err on the side of security until it becomes a problem, which is a good problem to have.)
I think you're attacking a straw man. The article doesn't say "instead of UUIDv4 primary keys, use keys such as birthdays with exposed semantic meaning". On the contrary, they have a section about how to use sequence numbers internally but obfuscated keys externally. (Although I agree with dfox's and formerly_proven's comments [1, 2] that XOR method they proposed for this is terrible. Reuse of a one-time pad is probably the most basic textbook example of bad cryptography. They referred to the values as "obfuscated" so they probably know this. They should have just gone with a better method instead.)
Insert order or time is information. And if you depend on that information you are going to be really disappointed when back dated records have to be inserted.
I don't think the objection is that it exposes semantic meaning, but that any meaningful information is contained within the key at all, eg. even a UUID that includes timestamp information about when it was generated is "bad" in a sense, as it leaks information. Unique identifiers should be opaque and inherently meaningless.
Same with Austrian social security numbers, which, in somes cases, don't contain the persons birth date and in some cases don't contain any existing date at all.
Yet many websites enforce a valid date and pull the persons birthdate from it...
This is incredibly database-specific. In Postgres random PKs are bad. But in distributed databases like Cockroach, Google Cloud Datastore, and Spanner it is the opposite - monotonic PKs are bad. You want to distribute load across the keyspace so you avoid hot shards.
In Google Cloud Bigtable we had the issue that our domain's primary key was a sequential integer autogenerated by another app. So we just reversed it, and it distributed automatically quite nicely.
Even in a distributed database you want increasing (even if not monotonic) keys since the underlying b-tree or whatever will very likely behave badly for entirely random data.
UUIDv7 is very useful for these scenarios since
A: A hash or modulus of the key will be practically random due to the lower bits being random or pseudo-random (ie distributes well between nodes)
B: the first bits are sortable.. thus the underlying storage on each node won't go bananas.
It is, although you can have sharded PostgreSQL, in which case I agree with your assessment that you want random PKs to distribute them.
It's workload-specific, too. If you want to list ranges of them by PK, then of course random isn't going to work. But then you've got competing tensions: listing a range wants the things you list to be on the same shard, but focusing a workload on one shard undermines horizontal scale. So you've got to decide what you care about (or do something more elaborate).
It's also application specific. If you have workload that's write heavy, has temporal skew and is highly concurrent, but rarely creates new records, you're probably better off with a random PK, even in PG.
I wouldn't say it is incredibly database specific, it is more database type specific. For most general, non-sharded, databases, random key values can be a problem as they lead to excess fragmentation in b-trees and similar structures.
As long as the key has sufficient entropy (i.e. not monotonic sequential ints), that ensures the keyspace is evenly distributed, correct? So UUID>=v4, ULID, KSUID, possibly snowflake, should be fine for the sake of even distribution of the hashes.
I think they address this in the article when they say that this advice is specific to monolithic applications, but I may be misremembering (I skimmed).
100%. You can use rendezvous hashing to determine the shard(s). The hash of a sequence should be randomly distributed as changing the LSB should propagate to 50% change in the output bits.
The article sums up some valid arguments against UUIDv4 as PKs but the solution the author provides on how to obfuscate integers is probably not something I'd use in production. UUIDv7 still seems like a reasonable compromise for small-to-medium databases.
I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak the creation times of everything.
Now this doesn't work if you actually have enough data that the randomness of the UUIDv4 keys is a practical database performance issue, but I think you really have to think long and hard about every single use of identifiers in your application before concluding that v7 is the solution. Maybe v7 works well for some things (e.g identifiers for resources where creation times are visible to all with access to the resource) but not others (such as users or orgs which are publicly visible but without publicly visible creation times).
I'm also not a huge fan of leaking server-side information; I suspect UUIDv7 could still be used in statistical analysis of the keyspace (in a similar fashion to the german tank problem for integer IDs). Also, leaking data about user activity times (from your other comment) is a *really* good point that I hadn't considered.
I've read people suggest using a UUIDv7 as the primary key and a UUIDv4 as a user-visible one as a remedy.
My first thought when reading the suggestion was, "well but you'll still need an index on the v4 IDs, so what does this actually get you?" But the answer is that it makes joins less expensive; you only require the index once, when constructing the query from the user-supplied data, and everything else operates with the better-for-performance v7 IDs.
To be clear, in a practical sense, this is a bit of a micro-optimization; as far as I understand it, this really only helps you by improving the data locality of temporally-related items. So, for example, if you had an "order items" table, containing rows of a bunch of items in an order, it would speed up retrieval times because you wouldn't need to do as many index traversals to access all of the items in a particular order. But on, say, a users table (where you're unlikely to be querying for two different users who happen to have been created at approximately the same time), it's not going to help you much. Of course the exact same critique is applicable to integer IDs in those situations.
Although, come to think of it, another advantage of a user-visible v4 with v7 Pk is that you could use a different index type on the v4 ID. Specifically, I would think that a hash index for the user-visible v4 might be a halfway-decent way to go.
I'm still not sure either way if I like the idea, but it's certainly not the craziest thing I've ever heard.
It's not leaking that's the concern. It's that not having the names of objects be easily enumerable is a strongly security-enhancing feature of a system.
Yes of course everyone should check and unit test that every object is owned by the user or account loading it, but demanding more sophistication from an attacker than taking "/my_things/23" and loading "/my_things/24" is a big win.
If all you want is to obfuscate the fact that your social media site only has 200 users and 80 posts, simply use a permutation over the autoincrement primary key. E.g. IDEA or CAST-128, then encode in base64. If someone steps on your toes because somewhere in your codebase you're using a forbidden legacy cipher, just use AES-128. (This is sort of the degenerate/tautological base case of format-preserving encryption)
I always thought they are used and stored as they are because the kind of transformation you mention seems terribly expensive given the YT's scale, and I don't see a clear benefit of adding any kind of obfuscation here.
I didn't see my primary use case for UUID's covered: sharing identifiers across entities is dangerous.
I wrote a CRUD app for document storage. It had user id's and document id's. I wrote a method GetDocumentForUser(docID, userID) that checked permissions for that user and document and returned the document if permitted. I then, stupidly, called that method with GetDocumentForUser(userID, docID), and it took me a good half hour to work out why this never returned anything.
It never returned anything because a valid userID will never be a valid docID. If I had used integers it would have returned documents, and I probably wouldn't have spotted it while testing, and I would have shipped a change that cheerfully handed people other people's documents.
I will put up with a fairly considerable amount of performance hit to avoid having this footgun lurking. And yes, I know there are other ways around this (e.g. types) but those come with their own trade-offs too.
I work on an application where we encrypt the integer primary key and then use the bytes to generate something that looks like a UUID.
In our case, we don't want database IDs in an API and in URLs. When IDs are sequential, it enables things like dictionary attacks and provides estimates about how many customers we have.
Encrypting a database ID makes it very obvious when someone is trying to scan, because the UUID won't decrypt. We don't even need a database round trip.
* How do you manage the key for encrypting IDs? Injected to app environment via envvar? Just embedded in source code? I ask this because I'm curious as to how much "care" I should be putting in into managing the secret material if I were to adopt this scheme.
* Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.
The same way we manage all other secrets in the application. (Summarized below)
> Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.
I don't have the source handy at the moment. It's one of the easier to use symmetric algorithms available in .Net. We aren't talking military-grade security here. In general: a 32-bit int encrypts to 64-bits, so we pad it with a few unicode characters so it's 64-bits encrypted to 128 bits.
---
As far as managing secrets in the application: We have a homegrown configuration file generator that's adapted to our needs. It generates both the configuration files, and strongly-typed classes to read the files. All configuration values are loaded at startup, so we don't have to worry about runtime errors from missing configuration values.
Secrets (connection strings, encryption keys, ect,) are encrypted in the configuration file as base64 strings. The certificate to read/write secrets are stored in Azure Keyvault.
The startup logic in all applications is something like:
1: Determine the environment (production, qa, dev)
2: Get the appropriate certificate
3: Read the configuration files, including decrypting secrets (such as the primary key encryption keys) from the configuration files
4: Populate the strongly-typed objects that hold the configuration values
5: These objects are dependency-injected to runtime objects
Key management seems to be as important as backups but I understand that something so small (an encryption key) could seem unimportant because database backups are so big lol but they really do share important attributes (do not lose your keys, do not lose your data, do not expose your keys, do not expose your data, etc etc)
Counterargument... I do technical diligence so I talk to a lot of companies at points of inflection, and I also talk to lots who are stuck.
The ability to rapidly shard everything can be extremely valuable. The difference between "we can shard on a dime" and "sharding will take a bunch of careful work" can be expensive If the company has poor margins, this can be the difference between "can scale easily" and "we're not getting this investment".
I would argue that if your folks have the technical chops to be able to shard while avoiding surrogate guaranteed unique keys, great. But if they don't.... a UUID on every table can be a massive get-out-of-jail free card and for many companies this is much, much important than some minor space and time optimizations on the DB.
I can see how sharding could be difficult with a bigint FK, but UUIDv7 would still play nice, if I understand your point correctly. Monotonically increasing foreign keys have performance benefits over random UUIDv4 FKs in postgresql is the point of the article.
Sort of related, but we had to shard as usage grew and didn’t have uuids and it was annoying. Wasn’t the most annoying bit though. Whole thing is pretty complex regardless of uuid, if you have a highly interconnected data model that needs to stay online while migrating.
Right, but if you start off with uuids and the expectation that you might use them to shard, you'll wind up factoring that into the data model. Retrofitting, as you rightly say, can be much harder.
> For many business apps, they will never reach 2 billion unique values per table, so this will be adequate for their entire life. I’ve also recommended always using bigint/int8 in other contexts.
I'm sure every dba has a war story that starts with similar decision in the past
> Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have "byte ordering," but this has no useful meaning for how they’re accessed.
Might the author mean that random values are not sequential, so ordering them is inefficient? Of course random values can be ordered - and ordering by what he calls "byte ordering" is exactly how all integer ordering is done. And naive string ordering too, like we would do in the days before Unicode.
Using an UUIDv4 as primary key is a trade-off: you use it when you need to generate unique keys in a distributed manner. Yes, these are not datetime ordered and yes, they take 128 bits of space. If you can't live with this, then sure, you need to consider alternatives. I wonder if "Avoid UUIDv4 Primary Keys" is a rule of thumb though.
> you use it when you need to generate unique keys in a distributed manner
Just to complement this with a point, but there isn't any mainstream database management system out there that is distributed on the sense that it requires UUIDs to generate its internal keys.
There exist some you can find on the internet, and some institutions have internal systems that behave this way. But as a near universal rule, the thing people know as a "database" isn't distributed on this sense, and if the column creation is done inside the database, you don't need them.
I do not understand why 128 bits is considered too big - you clearly can't have less, as on 64 bits the collision probability on real world workloads is just too high, for all but the smallest databases.
Auto-incrementing keys can work, but what happens when you run out of integers? Also, distributed dbs probably make this hard, and they can't generate a key on client.
There must be something in Postgres that wants to store the records in PK order, which while could be an okay default, I'm pretty sure you can this behavior, as this isn't great for write-heavy workloads.
Hi there. Thanks for the feedback. I updated that section to hopefully convey the intent more. The type of ordering we care about for this topic is really B-Tree index traversal when inserting new entries and finding existing entries (single and multiple values i.e. an IN clause, updates, deletes etc). There's a compelling example I re-created from Cybertec showing the pages needed and accessed for equivalent user-facing results, comparing storing PKs as big integers vs. UUID v4s, and how many more pages were needed for v4 UUIDs. I found that to be helpful to support my real world experience as a consultant on various "medium sized" Postgres databases (e.g. single to 10s of millions of records) where clients were experiencing excessive latency for queries, and the UUID v4 PK/FKs selection made for reasons earlier was one of the main culprits. The indexes wouldn’t fit into memory resulting in a lot of sequential scans. I’d confirm this by showing an alternative schema design and set of queries where everything was the same except integer PKs/FKs were used. Smaller indexes (fit in memory), reliable index scans, less latency, faster execution time.
Isn't part of this that inserting into a btree index is more performant when the keys are increasing rather than being random? A random id will cause more re-balancing operations than always inserting at the end. Increasing ids are also more cache friendly
The point is how closely located data you access often is. If data is roughly sorted by creation time then data you access close to one another in time is stored close to one another on disk. And typically access to data is correlated with creation time. Not for all tables but for many.
Accessing data in totally random locations can be a performance issue.
Depends on lots of things ofc but this is the concern when people talk about UUID for primary keys being an issue.
Values of the same type can be sorted if a order is defined on the type.
It's also strange to contrast "random values" with "integers". You can generate random integers, and they have a "sorting" (depending on what that means though)
Why would you need to order by UUID? I am missing something here. Most of the time we use UUID keys for being able to create a new key without coordination and most of the time we do not want to order by primary key.
Most common database indexes are ordered, so if you are using UUIDv4 you will not only bloat the index you will also have poor locality. If you try to use composite keys to fix locality, you'll end up with an even more bloated index.
I have seen a lot of people sort by (generated) integer values to return the rows "in creation order" assuming that sorting by an integer is somehow magically faster than sorting by a proper timestamp value (which give a more robust "creation order" sorting than a generated integer value).
Any fixed sized bitstring has an obvious natural ordering, but since they're allocated randomly they lack the density and locality of sequential allocation.
Permanent identifiers should not carry data. This is like the cardinal sin of data management. You always run into situations where the thing you thought, "surely this never changes, so it's safe to squeeze into the ID to save a lookup". Then people suddenly find out they have a new gender identity, and they need a last final digit in their ID numbers too.
Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"
Librarians in the analog age can be forgiven for cramming data into their identifiers, to save a lookup. When the lookup is in a physical card catalog, that's somewhat understandable (although you bet they could run into trouble over it too). But when you have a powerful database at your fingertips, use it! Don't make decisions you will regret just to shave off a couple of milliseconds!
To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.
But in case, the intention for encoding a timestamp into a UUID isn't for any implied meaning. It's both to guarantee uniqueness with a side effect that IDs are more or less monotonically increasing. Whether this is actually desirable depends on your application, but generally if the application is as a indexed key for insertion into a database, it's usually more useful for performance than a fully random ID as it avoids rewriting lots of leaf-nodes of B-trees. If you insert a load of these such keys, it forms a cluster on one side of the tree that can the rebalance with only the top levels needing to be rewritten.
You still have that problem from organic birthdays and also the problem of needing to change ids to correct birth dates.
well, till you run out of numbers for the immigrants that don't have exact birth date
Maybe the answer is to evenly spread the defaults over 365 days.
YY.MM.DD-AAA.BB
In either the AAA or BB component there is something about the gender.
But it does mean that there is a limit of people born per day of a certain gender.
But for a given year, using a moniker will only delay the inevitable. Sure, there are more numbers, but still limited as there are SOME parts that need to reflect reality. Year, gender (if that's still the case?) etc.
> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits.
You can already feel the disaster rising because sone program expects always the latter.
And it doesn’t fix the problem, it just makes it less likely.
I don't agree with the absolute statement, though. Permanent identifiers should not generally carry data. There are situations where you want to have a way to reconciliate, you have space or speed constraints, so you may accept the trade off, md5 your data and store it in a primary index as a UUID. Your index will fragment and thus you will vacuum, but life will still be good overall.
Random vs time biased uuids are not a decision to shave off ms that you will regret.
Most likely they will be a decision that shaves off seconds (yes, really - especially when you consider locality effects) and you'll regret nothing.
You can choose to never make use of that property. But it's tempting.
I totally used uuidv7s as "inserted at" in a small project and I had methods to find records created between two timestamps that literally converted timestamps to uuidv7 values so I could do "WHERE id BETWEEN a AND b"
Hyrum's Law suggests that someone will.
So, random library: https://pkg.go.dev/github.com/google/uuid#UUID.Time
> Time returns the time in 100s of nanoseconds since 15 Oct 1582 encoded in uuid. The time is only defined for version 1, 2, 6 and 7 UUIDs.
What? The first 48 bits of an UUID7 are a UNIX timestamp.
Whether or not this is a meaningful problem or a benefit to any particular use of UUIDs requires thinking about it; in some cases it’s not to be taken lightly and in others it doesn’t matter at all.
I see what you’re getting at, that ignoring the timestamp aspect makes them “just better UUIDs,” but this ignores security implications and the temptation to partition by high bits (timestamp).
Deleted Comment
Specifically, if your database is small, the performance impact is probably not very noticeable. And if your database is large (eg. to the extent primary keys can't fit within 32-bit int), then you're actually going to have to think about sharding and making the system more distributed... and that's where UUID works better than auto-incrementing ints.
* I think there is a wide range in the middle where your database can fit on one machine if you do it well, but it's worth optimizing to use a cheaper machine and/or extend the time until you need to switch to a distributed db. You might hit this middle range soon enough (and/or it might be a painful enough transition) that it's worth thinking about it ahead of time.
* If/when you do switch to a distributed database, you don't always need to rekey everything:
** You can spread existing keys across shards via hashing on lookup or reversing bits. Some databases (e.g. DynamoDB) actually force this.
** Allocating new ids in the old way could be a big problem, but there are ways out. You might be able to switch allocation schemes entirely without clients noticing if your external keys are sufficiently opaque. If you went with UUIDv7 (which addresses some but not all of the article's points), you can just keep using it. If you want to keep using dense(-ish), (mostly-)sequential bigints, you can amortize the latency by reserving blocks at a time.
The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
The identifier is still connected to the user's data, just through the appropriate other fields in the table as opposed to embedded into the identifier itself.
> So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.
Using a random UUID as primary key does not mean users have to memorize that UUID. In fact in most cases I don't think there's much reason for it to even be exposed to the user at all.
You can still look up their data from their current email or phone number, for instance. Indexes are not limited to the primary key.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
A fully random primary key takes into account that things change - since it's not embedding any real-world information. That said I also don't think there's much issue with embedding creation time in the UUID for performance reasons, as the article is suggesting.
I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.
E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.
It's much cleaner and easier to adapt if each person gets an internal context-less identifier and you use their phone number to convert from their external ID/phone number to an internal ID. The old account still has an identifier, there's just no external identifier that translates to it. Likewise if you have to change your identifier scheme, you can have multiple external IDs that translate to the same internal ID (i.e. you can resolve both their old ID and their new ID to the same internal ID without insanity in the schema).
The surrogate key's purpose isn't to directly store the natural key's information, rather, it's to provide an index to it.
> The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.
There isn't 'another' - there's just one. The surrogate key. The other pieces of information you're describing are not the means of indexing the data. They are the pieces of data you wish to retrieve.
You need it. Because it's maybe one lone unchangeable thing. Taking person for example: * date of birth can be changed, if there was error and correction in documents * any and near all of existing physical characteristics can change over time, either due to brain things (deciding to change gender), aging, or accidents (fingerprints no longer apply if you burnt your skin enough) * DNA might be good enough, but that's one fucking long identifier to share and one hard to validate in field.
So an unique ID attached to few other parts to identify current iteration of individual is the best we have, and the best we will get.
I think IDs should not carry information. Yes, that also means I think UUIDv7 was wrong to squeeze a creation date into their ID.
Isn't that clear enough?
If you take the gender example, for 99% of people, it is male/female and it won't change, and you can use that for load balancing. But if later, you found out that the gender is not the one you expect for that bucket, no big deal, it will cause a branch misprediction, but instead of happening 50% of the times when you use a random value, it will only happen 1% of the times, significant speedup with no loss in functionality.
As long as you're not in China or India around specific years ...
GP's point stands strong.
Deleted Comment
> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits
So presumably the format is DDMMYYXXXXX (for some arbitrary number of X's), where the XXX represents e.g. an automatically incrementing number of some kind?
Which means that if it's DDMMYYXXX then you can only have 1000 people born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000 people born on DDMMYY.
So in order for there to be so many such entries in common that people are denied use of their actual birthday, then one of the following must be true:
1. The XXX counter must be extremely small, in order for it to run out as a result of people 'using up' those Jan 1 dates each year
2. The number of people born on Jan 1 or immigrating to Norway without knowledge of their birthday must be colossal
If it was just DDMMXXXXX (no year) then I can see how this system would fall apart rapidly, but when you're dealing with specifically "people born on Jan 1 2014 or who immigrated to Norway and didn't know their birthday and were born on/around 2014 so that was the year chosen" I'm not sure how that becomes a sufficiently large number to cause these issues. Perhaps this only occurs in specific years where huge numbers of poorly-documented refugees are accepted?
(Happy to be educated, as I must be missing something here)
Like, it's a valid complaint.. just not for discussion at hand.
Also, we do live in reality and while having entirely random one might be perfect from theory of data, in reality having it be prefixed by date have many advantages performance wise.
> Permanent identifiers should not carry data. This is like the cardinal sin of data management
As long as you don't use the data and have actual fields for what's also encoded in UUID, there is absolutely nothing wrong with it, provided there is enough of the random part to get around artifacts in real life data.
I think it also is important to remember the purpose of specific numbers. For instance I would argue a PN without the birthday would be strictly worse. With the current system (I only know the Swedish one, but assume it's the same) I only have to remember a 4 digit (because the number is bdate + unique 4 digits). If we would instead use completely random numbers I would have to remember at least an 8 digit number (and likely to be future proof you'd want at least 9 digits). Sure that's fine for myself (although I suspect some people already struggle with it), but then I also have to remember the numbers for my 2 kids and my partner and things become quickly annoying. Especially, because one doesn't use the numbers often enough that it becomes easy, but still often enough that it becomes annoying to look up, especially when one doesn't always cary their phone with them.
I guess that Norway has solved it in the same or similar way as Sweden? So a person is identified by the PNR and for those systems that need to track a person over several PNR (government agencies) use PRI. And a PRI is just the first PNR assigned to a person with a 1 inserted in the middle. If that PRI is occupied, use a 2,and so on.
PRI could of course have been a UUID instead.
Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.
> A prime example of premature optimization.
Disagree. Data is sticky, and PKs especially so. Moreover, if you’re going to spend time optimizing anything early on, it should be your data model.
> Don't make decisions you will regret just to shave off a couple of milliseconds!
A bad PK in some databases (InnoDB engine, SQL Server if clustered) can cause query times to go from sub-msec to tens of msec quite easily, especially with cloud solutions where storage isn’t node-local. I don’t just mean a UUID; a BIGINT PK on a 1:M can destroy your latency for the simple reason of needing to fetch a separate page for every record. If instead the PK is a composite of (<linked_id>, id) - e.g. (user_id, id) - where id is a monotonic integer, you’ll have WAY better data locality.
Postgres suffers a different but similar problem with its visibility map lookups.
* integer keys are faster;
* uuidv7 keys are faster;
* if you want obfuscated keys, using integer and do some your own obfuscation (!!!).
I can get on-board of uuidv7 (with the trade-off, of course, on stronger guessability). The integer keys argument is strange. At that point, you need to come up with a custom-built system to avoid id collision in a distribution system and tries to achieve only 2x saving (the absolute minimal you should do is 64-bit keys). Very puzzling suggestion and to me very wrong.
Note that in this entire article, the recommendation is not about using natural keys (email address, some composite of user identification etc.), so I am skipping that whole discussion.
I am not a cryptographer, but I would want his recommendation reviewed by a cryptographer. And then I would have to implement it. UUIDs have been extensively reviewed by cryptographers, I have a variety of excellent implementations I can use, I know they solve the problem well. I know they can cause performance issues; they're a security feature that is easy to implement, and I can deal with the performance issues if and when they crop up. (Which, in my experience, it's unusual. Even at a large company, most databases I encounter do not have enough data. I will err on the side of security until it becomes a problem, which is a good problem to have.)
Do you have the same criticism for serial identifiers? How about hashes? What about the version field in UUIDs?
Deleted Comment
I think you're attacking a straw man. The article doesn't say "instead of UUIDv4 primary keys, use keys such as birthdays with exposed semantic meaning". On the contrary, they have a section about how to use sequence numbers internally but obfuscated keys externally. (Although I agree with dfox's and formerly_proven's comments [1, 2] that XOR method they proposed for this is terrible. Reuse of a one-time pad is probably the most basic textbook example of bad cryptography. They referred to the values as "obfuscated" so they probably know this. They should have just gone with a better method instead.)
[1] https://news.ycombinator.com/item?id=46272985
[2] https://news.ycombinator.com/item?id=46273325
Same with Austrian social security numbers, which, in somes cases, don't contain the persons birth date and in some cases don't contain any existing date at all.
Yet many websites enforce a valid date and pull the persons birthdate from it...
Someone should have told Julius Caesar and Gregory XIII that :-p
UUIDv7 is very useful for these scenarios since
A: A hash or modulus of the key will be practically random due to the lower bits being random or pseudo-random (ie distributes well between nodes)
B: the first bits are sortable.. thus the underlying storage on each node won't go bananas.
It's workload-specific, too. If you want to list ranges of them by PK, then of course random isn't going to work. But then you've got competing tensions: listing a range wants the things you list to be on the same shard, but focusing a workload on one shard undermines horizontal scale. So you've got to decide what you care about (or do something more elaborate).
Why?
This is just another case of keys containing information and is not smart.
The obvious solution is to have a field that drives distribution, allowing rebalancing or whatever.
As a consumer of these databases we're stuck with them as designed, which means we have to worry about key distribution.
Now this doesn't work if you actually have enough data that the randomness of the UUIDv4 keys is a practical database performance issue, but I think you really have to think long and hard about every single use of identifiers in your application before concluding that v7 is the solution. Maybe v7 works well for some things (e.g identifiers for resources where creation times are visible to all with access to the resource) but not others (such as users or orgs which are publicly visible but without publicly visible creation times).
I've read people suggest using a UUIDv7 as the primary key and a UUIDv4 as a user-visible one as a remedy.
My first thought when reading the suggestion was, "well but you'll still need an index on the v4 IDs, so what does this actually get you?" But the answer is that it makes joins less expensive; you only require the index once, when constructing the query from the user-supplied data, and everything else operates with the better-for-performance v7 IDs.
To be clear, in a practical sense, this is a bit of a micro-optimization; as far as I understand it, this really only helps you by improving the data locality of temporally-related items. So, for example, if you had an "order items" table, containing rows of a bunch of items in an order, it would speed up retrieval times because you wouldn't need to do as many index traversals to access all of the items in a particular order. But on, say, a users table (where you're unlikely to be querying for two different users who happen to have been created at approximately the same time), it's not going to help you much. Of course the exact same critique is applicable to integer IDs in those situations.
Although, come to think of it, another advantage of a user-visible v4 with v7 Pk is that you could use a different index type on the v4 ID. Specifically, I would think that a hash index for the user-visible v4 might be a halfway-decent way to go.
I'm still not sure either way if I like the idea, but it's certainly not the craziest thing I've ever heard.
See perhaps "UUIDv47 — UUIDv7-in / UUIDv4-out (SipHash‑masked timestamp)":
* https://github.com/stateless-me/uuidv47
* Sept 2025: https://news.ycombinator.com/item?id=45275973
Yes of course everyone should check and unit test that every object is owned by the user or account loading it, but demanding more sophistication from an attacker than taking "/my_things/23" and loading "/my_things/24" is a big win.
(What do you think Youtube video IDs are?)
I shared this article a few weeks ago, discussing the problems with this kind of approach: https://notnotp.com/notes/do-not-encrypt-ids/
I believe it can make sense in some situations, but do you really want to implement such crypto-related complexity?
I actually haven no idea. What are they?
(Also what is the format of their `si=...` thing?)
I wrote a CRUD app for document storage. It had user id's and document id's. I wrote a method GetDocumentForUser(docID, userID) that checked permissions for that user and document and returned the document if permitted. I then, stupidly, called that method with GetDocumentForUser(userID, docID), and it took me a good half hour to work out why this never returned anything.
It never returned anything because a valid userID will never be a valid docID. If I had used integers it would have returned documents, and I probably wouldn't have spotted it while testing, and I would have shipped a change that cheerfully handed people other people's documents.
I will put up with a fairly considerable amount of performance hit to avoid having this footgun lurking. And yes, I know there are other ways around this (e.g. types) but those come with their own trade-offs too.
In our case, we don't want database IDs in an API and in URLs. When IDs are sequential, it enables things like dictionary attacks and provides estimates about how many customers we have.
Encrypting a database ID makes it very obvious when someone is trying to scan, because the UUID won't decrypt. We don't even need a database round trip.
* How do you manage the key for encrypting IDs? Injected to app environment via envvar? Just embedded in source code? I ask this because I'm curious as to how much "care" I should be putting in into managing the secret material if I were to adopt this scheme.
* Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.
The same way we manage all other secrets in the application. (Summarized below)
> Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the plain AES suffice? I assume that the size of IDs would never exceed the block size of AES, but again, I'm not a cryptographer so not sure if it's safe to do so.
I don't have the source handy at the moment. It's one of the easier to use symmetric algorithms available in .Net. We aren't talking military-grade security here. In general: a 32-bit int encrypts to 64-bits, so we pad it with a few unicode characters so it's 64-bits encrypted to 128 bits.
---
As far as managing secrets in the application: We have a homegrown configuration file generator that's adapted to our needs. It generates both the configuration files, and strongly-typed classes to read the files. All configuration values are loaded at startup, so we don't have to worry about runtime errors from missing configuration values.
Secrets (connection strings, encryption keys, ect,) are encrypted in the configuration file as base64 strings. The certificate to read/write secrets are stored in Azure Keyvault.
The startup logic in all applications is something like:
1: Determine the environment (production, qa, dev)
2: Get the appropriate certificate
3: Read the configuration files, including decrypting secrets (such as the primary key encryption keys) from the configuration files
4: Populate the strongly-typed objects that hold the configuration values
5: These objects are dependency-injected to runtime objects
We're not worried about key compromises.
If the key is lost, we have much bigger problems.
The ability to rapidly shard everything can be extremely valuable. The difference between "we can shard on a dime" and "sharding will take a bunch of careful work" can be expensive If the company has poor margins, this can be the difference between "can scale easily" and "we're not getting this investment".
I would argue that if your folks have the technical chops to be able to shard while avoiding surrogate guaranteed unique keys, great. But if they don't.... a UUID on every table can be a massive get-out-of-jail free card and for many companies this is much, much important than some minor space and time optimizations on the DB.
Worth thinking about.
I'm sure every dba has a war story that starts with similar decision in the past
But the author does not say timestamp ordering, he says ordering. I think he actually means and believes that there is some problem ordering UUIDv4.
Just to complement this with a point, but there isn't any mainstream database management system out there that is distributed on the sense that it requires UUIDs to generate its internal keys.
There exist some you can find on the internet, and some institutions have internal systems that behave this way. But as a near universal rule, the thing people know as a "database" isn't distributed on this sense, and if the column creation is done inside the database, you don't need them.
Auto-incrementing keys can work, but what happens when you run out of integers? Also, distributed dbs probably make this hard, and they can't generate a key on client.
There must be something in Postgres that wants to store the records in PK order, which while could be an okay default, I'm pretty sure you can this behavior, as this isn't great for write-heavy workloads.
Accessing data in totally random locations can be a performance issue.
Depends on lots of things ofc but this is the concern when people talk about UUID for primary keys being an issue.
Values of the same type can be sorted if a order is defined on the type.
It's also strange to contrast "random values" with "integers". You can generate random integers, and they have a "sorting" (depending on what that means though)