The best advice I can give you is to use bigserial for B-tree friendly primary keys and consider a string-encoded UUID as one of your external record locator options. Consider other simple options like PNR-style (airline booking) locators first, especially if nontechnical users will quote them. It may even be OK if they’re reused every few years. Do not mix PK types within the schema for a service or application, especially a line-of-business application. Use UUIDv7 only as an identifier for data that is inherently timecoded, otherwise it leaks information (even if timeshifted). Do not use hashids - they have no cryptographic qualities and are less friendly to everyday humans than the integers they represent; you may as well just use the sequence ID. As for the encoding, do not use base64 or other hyphenated alphabets, nor any identifier scheme that can produce a leading ‘0’ (zero) or ‘+’ (plus) when encoded (for the day your stuff is pasted via Excel).
Generally, the principles of separation of concerns and mechanical sympathy should be top of mind when designing a lasting and purposeful database schema.
Finally, since folks often say “I like stripe’s typed random IDs” in these kind of threads: Stripe are lying when they say their IDs are random. They have some random parts but when analyzed in sets, a large chunk of the binary layout is clearly metadata, including embedded timestamps, shard and reference keys, and versioning, in varying combinations depending on the service. I estimate they typically have 48-64 bits of randomness. That’s still plenty for most systems; you can do the same. Personally I am very fond of base58-encoded AES-encrypted bigserial+HMAC locators with a leading type prefix and a trailing metadata digit, and you can in a pinch even do this inside the database with plv8.
I dont understand the recommendation of using bigserial with uuid column when you can use UUIDv7. I get that it made sense years ago when there was no UUIDv7, but why do people keep recommending it over UUIDv7 now beats me.
As uuid v7 hold time information, they can help bad actors for timing attacks or pattern recognition because they contain a time information linked to the record.
You can guess the time the system took between 2 uuid v7 id's.
They can only be used if they're not shown to the user.
(so not in the form mysite.com/mypage? id=0190854d-7f9f-78fc-b9bc-598867ebf39a)
A big serial starting at a high number can't provide the time information.
Just out of curiosity, why string-encoded UUIDs? A native/binary UUID column type should not take more than 16 bytes, whereas its hexadecimal string representation requires 36 bytes (if we assume 1 byte per character). That would result in an index size more than twice as big as required.
IMO using bigserial by default is wrong. Use whatever data type is appropriate. Not every table will grow to 4 billion rows and not every table will grow to even 60k rows. ID data type leaks to every foreign key referencing given table. Many foreign key usually will be indexed, so this further degrades performance. There are multiple data types for a reason.
Using 32 bit ints for IDs is insane in today’s world. If an attacker can control record generation, e.g. creating a record via API, then they can easily exhaust your ID space. A lot of kernel vulnerabilities stem from using incrementing 32 bit integers as an identifier. If you’re considering using 32 bits for an ID, don’t do it!
An int PK does not give you 4 billion rows. For example, in Postgres every time an insert transaction has to rollback, the id's that would have been used by that insert are discarded and gone forever. Likewise people often don't realize that Postgres' very convenient upsert syntax (insert on conflict do update) will consume id's from the sequence _every time it runs_. If you do an upsert with 100 rows, it will peel off 100 new id's just in case they're needed... if they aren't used, they're discarded. This can chew through 4 billion id's really, really fast.
Personally I would only use an int PK in a table where you know the bounds of the data with very high confidence. Like, at my last job the software involved working with our client's physical store locations. If we managed to capture the entire North American market for the type of business that we worked with, we'd be looking at about 50k stores.
Defaulting to 64-bit integers internally is to me a matter of mechanical sympathy, it has little to do with row capacity. It’s just a word size that current CPUs and memory architectures like working with.
there is unlikely significant performance degradation for int vs big int, but it will be huge PITA, if 10 years later and tons of legacy code written that table will grow over 4B rows..
This is simply bad advice. There are very, very few cases where you shouldn’t use a bigserial/bigint as a primary key. Perhaps the only case is where it is impossible for the rows to grow beyond a specific bound: an “AWS region” table or a “countries” table. These are almost always lookup tables of some kind, and the domain makes it impossible for it to grow past a certain bound.
Outside of that, put absolutely no thought into it and pick a bigint/bigserial.
Trying to be clever about this ignores the fact that tables and their usage change, and what you believe now will not be correct later.
And when you realise you’ve been too clever, it’s 3am on a Sunday, your production database has exploded and nothing works. You now need to rewrite 30 tables and their foreign keys ASAP with upper management breathing down your neck.
I agree and apologise for writing bigserial out of ossified habit. Identity columns hew to the SQL standard, bigserial is a pg-ism. I actually do still use it when the ORM’s DDL generator prefers as much (looking at you Rails), there’s no gain from fighting the middleware.
For other readers: the two are almost the same in behaviour, they differ primarily in that identity columns don’t need a USAGE grant on their associated sequence, and the ALWAYS makes it harder (but despite the term not actually impossible) to override the default value on row insert.
> class UUIDField: A field for storing universally unique identifiers. Uses Python’s UUID class. When used on PostgreSQL and MariaDB 10.7+, this stores in a uuid datatype, otherwise in a char(32)
> [...] Lookups on PostgreSQL and MariaDB 10.7+: Using iexact, contains, icontains, startswith, istartswith, endswith, or iendswith lookups on PostgreSQL don’t work for values without hyphens, because PostgreSQL and MariaDB 10.7+ store them in a hyphenated uuid datatype type.
> For backends that have no “native” UUID datatype, the value will make use of CHAR(32) and store the UUID as a 32-character alphanumeric hex string.
> For backends which are known to support UUID directly or a similar uuid-storing datatype such as SQL Server’s UNIQUEIDENTIFIER, a “native” mode enabled by default allows these types will be used on those backends.
> In its default mode of use, the Uuid datatype expects Python uuid objects, from the Python uuid module
> safe: The UUID was generated by the platform in a multiprocessing-safe way
And there's not yet a uuid.uuid7() in the uuid Python module.
UUIDv7 leaks timing information ( https://news.ycombinator.com/item?id=40886496 ); which is ironic because uuids are usually used to avoid the "guess an autoincrement integer key" issue
Naive question. Above comment suggests using bigserial as internal identifier and uuid as public facing ID.
Now let's say there's a user table and post table. Both will have only uuid available in the APIs. So every time API requests a post of the user or user of the post, we will find the the relevant row using uuid right? Since uuid will be sent by the public facing APIs?
How would bigserial be used here?
I don't know much databases in general BTW..
Edit: Can I use uuidv4 as PK since it doesn't leak timestamp data? Why bigserial + uuid7 is better?
Each object has an external key and an internal key. This separation allows you to migrate to other layouts, technologies, etc. without breaking your customer's links or records.
Internally, your database looks like:
User
ID - uint128
external_id - UUID (of some sort)
name - string
Post
ID - uint128
UserId - uint128 (User.ID)
external_id - UUID
...
Then you have secondary indices on the external_id columns in both tables, and the UserId column on Post. You can then join from one to the other using the ID column.
e.g.
SELECT count(*) FROM
Post JOIN User
ON User.ID = Post.UserID
WHERE
User.external_id = <%USERID_FROM_REQUEST%>;
// Don't forget to bind USER_ID_FROM_REQUEST instead of
// concating the string, no Johny-Tables here!
There should be significant performance benefits from using int's for the joining keys (at least compared to strings), but my experience might be old.
If you have distributed data creation. (Creating data on the client). And a CRDT style mechanism for syncing, then you can’t use bigserial because of the simple fact that it is sequential. The best solution here is uuidv7. Since you can generate these at the client even when offline.
That's not true, you can increment by 2, 10, 100, or any number. I'm not saying that's necessarily the best solution, but it's not true that you can't use it.
One challenge with PNR is actually restricting the alphabet appropriately. They sure are easy to say aloud -- just five or six letters in many cases -- but how do you ensure you have (a) enough letters to get a reasonable bitwidth and (b) not form ridiculous words?
Take all the Roman alphabet apart from the vowels - 21 characters and length 6 gives you 100 million possibilities which is plenty for most applications.
You can still get vaguely offensive sequences like FKNNGR or BLKCNT, but at some point you have to put this down not to your software being offensive or hateful but to humans finding patterns in randomness.
For example https://sqids.org/ ensures that there are no profanities in the generated ids. And it allows you to add additional words that you want to avoid.
> Stripe are lying when they say their IDs are random
Where does Stripe make that claim?
I think most of the value of Stripe-style IDs is in their other properties, like the information they bear, readability, copy-pasteability (which uuids lack - double-click to select often stops at hyphens), etc.
I've never played enough with UUIDs in Postgres, but I wonder if you could publicly expose only the random bits (i.e. the second half if an UUIDv7) and have another non-b-tree index on SUBSTR(id, 16) for quick lookups based on that value. Similar is done for "short commit hashes" too.
Though I would wonder why go with UUIDs in that case at all?
Offhand, I remember _reading_ about that but haven't ever used it in practice so please test and confirm...
# Postgres can Index Function Results (including what you described)
# Postgres does not magically know to USE those results. To use them you must E.G. JOIN (probably works for WHERE?) or ORDER or LIMIT by the same parameter that went into the Index (which is how it's matched up).
Generally, the Primary Key should either be 'small' (BigInt can count as such) and only used internally or be something strongly relevant to how the records in that table can be limited for most (or the most) queries.
> They have some random parts but when analyzed in sets, a large chunk of the binary layout is clearly metadata, including embedded timestamps, shard and reference keys, and versioning, in varying combinations depending on the service.
100% this. Context is everything, and advice without understanding context should be suspect at best.
That said, I've found that contexts can change. Personally I started with the database being local (send single). It the migrated to remote (web), after that to distributed and from there to mobile.
As contexts gave changed I've found we needed to switch to UUID (and the switch was a pain.) A lot of our problems would simply not exist if we'd used uuid from the start.
So now my "default advice" is uuid, unless there's a -specific- reason not to. For me, performance is not that reason (your database outlives multiple generations of hardware, and faster hardware will dwarf optimisation-via-data-type.)
And don't get me started on natural versus surrogate identifiers- that ship sailed decades ago.)
My somewhat naive understanding was that random UUIDs were not that big of a deal in Postgres because it does not cluster by primary key. And of course a UUID (16 bytes) is larger than a serial (4 bytes) or bigserial (8 bytes) by a factor of 2-4 . This certainly might matter for an index, but on a whole table level where you have 20+ bytes overhead per row this doesn't seem that big of a deal for anything except very narrow tables with large row counts.
So far my impression is that there are a whole lot of other things I need to worry about in Postgres before I spend time considering serial vs. random UUID vs. ordered UUID. Am I wrong here and this is something that really matters and you should invest more time in?
Random UUID's are super useful when you have distributed creation of UUID's, because you avoid conflicts with very high probability and don't rely on your DB to generate them for you, and they also leak no information about when or where the UUID was created.
Postgres is happier with sequence ID's, but keeping Postgres happy isn't the only design goal. It does well enough for all practical purposes if you need randomness.
> Postgres is happier with sequence ID's, but keeping Postgres happy isn't the only design goal.
It literally is the one thing in the entire stack that must always be happy. Every stateful service likely depends on it. Sad DBs means higher latency for everyone, and grumpy DBREs getting paged.
> Random UUID's are super useful when you have distributed creation of UUID's, because you avoid conflicts with very high probability and don't rely on your DB to generate them for you
See Snowflake IDs for a scheme that gives you the benefit of random UUIDs but are strictly increasing. Which is really UUIDv7 but fits in your bigint column. No entropy required.
> So far my impression is that there are a whole lot of other things I need to worry about in Postgres before I spend time considering serial vs. random UUID vs. ordered UUID. Am I wrong here and this is something that really matters and you should invest more time in?
Like any sort of optimization I believe this will depend on your workload and what's important.
For me when I switched to UUIDv7 a few months ago, it was basically no effort to switch v4 to v7 on a relatively new system. I was observing much higher batch insertion latencies than I expected, and producing inserts that touch less of the btree on an index created a very noticeable reduction in insertion latencies. But my workloads and latencies may look nothing like yours. On amazon RDS instances with EBS volumes and relatively low memory, insertion latency stood out, so using strategies that reduce the number of disk blocks that are needed has an outsized performance impact.
This of course would produce different results on different hardware / system sizing.
While Postgres does not cluster table, uuids do affect indexes significantly, random insertion causes increased index slack which translates to cache bloat and thus longer traversal and lower cache residency.
> Am I wrong here and this is something that really matters and you should invest more time in?
Specifically, no - you don't need to worry about it. Reconfiguring your tables to use a different style of unique identifier if your tables have a unique identifier is a bit of a pain but no more so than any other instance of renaming a column - if you want to minimize downtime you add the new column, migrate data to the column, deploy code that utilizes the new column and then finally retire the old column. Even if the previous version of the table lacked any sort of unique key it is still possible to add one after the fact (it's a bit technically harder to properly keep them in sync but it is possible to do safely).
It's just a question of the cost of doing so and the benefits of it - I work in a system that exclusively uses integral keys and our data is such that we don't really suffer any downsides from that choice - if you're working in a larger system with less confidence in the security practices of other teams then avoiding sequential keys so that you have obscurity to fall back on if someone really drops the ball on real security isn't the worst idea... but I think the really compelling reason to prefer UUIDs is for the power of distributed generation... that really only applies to inherently decentralized or astoundingly large products though - and if your product eventually grows to astoundingly large you'll have plenty of time to switch first (probably the wake-up call will be closing in on running out of 4 bit serial unique keys).
Reconfiguring tables to use a different kind of unique ID (primary key in this context) can be a much bigger pain than an ordinary column rename if it is in use by foreign key constraints.
UUIDv7 are good enough that you can just avoid thinking and use it as a “default”. The worst consequences of doing so are some slightly impacted performance and leak of some timing information but these are extremely minor drawbacks.
Using completely random UUIDs is IMO the worst choice. It’s fine right up until it isn’t, and then you are stuck in hell with no good way out.
> Using completely random UUIDs is IMO the worst choice
Can't say that without context. I've worked in systems where even the version bits were randomized, and for good reason (i know, technically, no UUID anymore).
A dev at $WORK thought it wasn't a big deal too because they "tested" on their unrestricted dev setup with 3 rows in the table.
Of course production crashed when their latest sql query did a multijoin on the real table with a few millions of rows. The size of the uuid needed to join filled the available RAM, everything slowed to a crawl and the system collapsed.
The uuid as primary key can be seen as a risk multiplicator : it will amplify any performance issue you may have, converting a temporary slowness into a full system stop.
Postgres doesn't automatically reorder rows by clustering on a primary key but you can perform clustering manually, and if the table isn't completely filled it can attempt to maintain rows in order (as long as there is room for it)
You still get performance hits from Visibility Map lookups, and WAL bloat.
As a DBRE, I believe it always matters, and you should invest time in it. Pragmatically, it is unlikely to have noticeable effects until your tables are at least in the 1E5 rows range, if not higher. Unfortunately, by that point, it’s likely that you’ll find other things of higher importance to deal with, so the problem will only grow.
I see some comments conflating privacy of sequence statistics with global uniqueness considerations and UX.
If your concern is globally unique identifiers (i.e. so that you can merge tables across multiple instances of your database), then UUID is exactly what you want. This is entirely what it is designed for.
If your concern is the privacy of sequence statistics, then UUID incidentally solves your problem. It may not be precisely what you want, and could continue to leak private information depending on the specific variant used. If you want privacy of sequence statistics, then I would suggest something like a sha256 hash of the primary key concatenated with a cryptographic salt stored in a separate column. These make excellent identifiers in places like APIs and URLs.
If you desire a unique identifier that has a high quality UX, then this is in addition to the above columns. This sequence generally has lower entropy than the cryptographic approach (or the UX would suffer), so additional measures should be taken to protect the privacy of the identifiers (e.g. expire redemption tokens after a period of time and reissue via email).
Autoincrementing integers are really nice if you don't actually need UUIDs. Lots of tricks you can apply with them.
> It may not be precisely what you want, and could continue to leak private information depending on the specific variant used. If you want privacy of sequence statistics, then I would suggest something like a sha256 hash of the primary key concatenated with a cryptographic salt stored in a separate column. These make excellent identifiers in places like APIs and URLs.
Is there a reason not to use version 4 UUID and if time ordering is needed, save timestamp explicitly as another column?
Most of the comments here seem to gear towards “you must always…”. But as you rightfully point out, “it depends” and is an architectural trade off depending our the required qualities.
Call me old fashion but I really like integer autoincrement primary keys. It's easy to understand and obviously simple to sort. Furthermore when working on large batch projects you can just store the last primary key as your high water mark and get everything greater than that.
I suppose TSID works to this end, but certainly more complicated.
It is simpler but like everything else it depends on the application. For a private app you can generally get away with it. Something that's more public facing? The ID will most likely leak information. As systems mature and you introduce things like replication, having IDs that are more universal starts looking good. In general, starting off with a uuid like uuid v4 or nanoid is a good bet.
Be advised using the last auto-incremented value as a "high water mark" and getting everything greater than that is not 100% reliable, because in some scenarios auto-incrementing values can be written out of order.
Yup, not many people seem to know about CACHE vals and/or sequence reservation. It's a problem that I've had to explain a couple of times already to colleagues. And unfortunately hidden assumptions are oft the most dangerous kind.
It would be nice for these comparisons to also include 'int64' so people can see how much of an overhead UUID's are compared to the traditional approach.
I personally prefer ULID since it is compat with a UUID type and you also get a timestamp lexicographically built into the ID so that sorting by ID also means sorting by timestamp. There are multiple PG extensions to make it easy to drop in and use.
How do you deal with ulid exposing the timestamp (since is lexicographically sortable) ?
Maybe your ULID is not public facing? Or this is not an issue for your application?
I want to use something url friendly too since uuid sucks..
UUIDs are miserable to work with. For 99% of use cases, sequential IDs are fine. Can they be guessed? Sure, but your software should guard against unauthorized access. Security through obscurity or randomness is a poor excuse for using UUIDs as PKs. If you don't want to expose your IDs, use a slug.
The dev experience of debugging with UUIDs involved degrades so much, it is depressing.
For example it's much harder to spot patterns and wrong IDs in SQL query results if you are looking at these giant blobs of random characters called UUIDs.
Generally, the principles of separation of concerns and mechanical sympathy should be top of mind when designing a lasting and purposeful database schema.
Finally, since folks often say “I like stripe’s typed random IDs” in these kind of threads: Stripe are lying when they say their IDs are random. They have some random parts but when analyzed in sets, a large chunk of the binary layout is clearly metadata, including embedded timestamps, shard and reference keys, and versioning, in varying combinations depending on the service. I estimate they typically have 48-64 bits of randomness. That’s still plenty for most systems; you can do the same. Personally I am very fond of base58-encoded AES-encrypted bigserial+HMAC locators with a leading type prefix and a trailing metadata digit, and you can in a pinch even do this inside the database with plv8.
You can guess the time the system took between 2 uuid v7 id's.
They can only be used if they're not shown to the user. (so not in the form mysite.com/mypage? id=0190854d-7f9f-78fc-b9bc-598867ebf39a)
A big serial starting at a high number can't provide the time information.
This is especially true for PostgreSQL which increments sequences for upserts (even if no records are inserted).
That's how I've hit 32-bit limits on tables that had only a couple million rows.
---
I would only use 32-bit for very selective items that is used in a lot of FKs, like a tenant ID.
Personally I would only use an int PK in a table where you know the bounds of the data with very high confidence. Like, at my last job the software involved working with our client's physical store locations. If we managed to capture the entire North American market for the type of business that we worked with, we'd be looking at about 50k stores.
Never again.
Secondly, it's not XXXk rows currently-you have to consider the complete lifetime of the table. When rows are deleted/created and how often.
So what you've said is true, but the set of appropriateness for smallint is a much smaller than expected.
Deleted Comment
Outside of that, put absolutely no thought into it and pick a bigint/bigserial.
Trying to be clever about this ignores the fact that tables and their usage change, and what you believe now will not be correct later.
And when you realise you’ve been too clever, it’s 3am on a Sunday, your production database has exploded and nothing works. You now need to rewrite 30 tables and their foreign keys ASAP with upper management breathing down your neck.
For other readers: the two are almost the same in behaviour, they differ primarily in that identity columns don’t need a USAGE grant on their associated sequence, and the ALWAYS makes it harder (but despite the term not actually impossible) to override the default value on row insert.
Postgres' UUID datatype: https://www.postgresql.org/docs/current/datatype-uuid.html#D...
django.db.models.fields.UUIDField: https://docs.djangoproject.com/en/5.0/ref/models/fields/#uui... :
> class UUIDField: A field for storing universally unique identifiers. Uses Python’s UUID class. When used on PostgreSQL and MariaDB 10.7+, this stores in a uuid datatype, otherwise in a char(32)
> [...] Lookups on PostgreSQL and MariaDB 10.7+: Using iexact, contains, icontains, startswith, istartswith, endswith, or iendswith lookups on PostgreSQL don’t work for values without hyphens, because PostgreSQL and MariaDB 10.7+ store them in a hyphenated uuid datatype type.
From the sqlalachemy.types.Uuid docs: https://docs.sqlalchemy.org/en/20/core/type_basics.html#sqla... :
> Represent a database agnostic UUID datatype.
> For backends that have no “native” UUID datatype, the value will make use of CHAR(32) and store the UUID as a 32-character alphanumeric hex string.
> For backends which are known to support UUID directly or a similar uuid-storing datatype such as SQL Server’s UNIQUEIDENTIFIER, a “native” mode enabled by default allows these types will be used on those backends.
> In its default mode of use, the Uuid datatype expects Python uuid objects, from the Python uuid module
From the docs for the uuid Python module: https://docs.python.org/3/library/uuid.html :
> class uuid.SafeUUID: Added in version 3.7.
> safe: The UUID was generated by the platform in a multiprocessing-safe way
And there's not yet a uuid.uuid7() in the uuid Python module.
UUIDv7 leaks timing information ( https://news.ycombinator.com/item?id=40886496 ); which is ironic because uuids are usually used to avoid the "guess an autoincrement integer key" issue
I don't know much databases in general BTW..
Edit: Can I use uuidv4 as PK since it doesn't leak timestamp data? Why bigserial + uuid7 is better?
Internally, your database looks like:
Then you have secondary indices on the external_id columns in both tables, and the UserId column on Post. You can then join from one to the other using the ID column.e.g.
There should be significant performance benefits from using int's for the joining keys (at least compared to strings), but my experience might be old.Deleted Comment
You can still get vaguely offensive sequences like FKNNGR or BLKCNT, but at some point you have to put this down not to your software being offensive or hateful but to humans finding patterns in randomness.
Do you ensure that your software does not form ridiculous words in every language? Or just another US-centric thing?
The idea of avoiding identifiers to be ridiculous is ridiculous to me, honestly...
Depends on what you mean by ridiculous.
For example https://sqids.org/ ensures that there are no profanities in the generated ids. And it allows you to add additional words that you want to avoid.
Why not just use the UUID type??
Where does Stripe make that claim?
I think most of the value of Stripe-style IDs is in their other properties, like the information they bear, readability, copy-pasteability (which uuids lack - double-click to select often stops at hyphens), etc.
Though I would wonder why go with UUIDs in that case at all?
# Postgres can Index Function Results (including what you described)
# Postgres does not magically know to USE those results. To use them you must E.G. JOIN (probably works for WHERE?) or ORDER or LIMIT by the same parameter that went into the Index (which is how it's matched up).
Generally, the Primary Key should either be 'small' (BigInt can count as such) and only used internally or be something strongly relevant to how the records in that table can be limited for most (or the most) queries.
Could you share this analysis? Seems interesting.
I don't trust anyone who gives recommendations sight unseen.
That said, I've found that contexts can change. Personally I started with the database being local (send single). It the migrated to remote (web), after that to distributed and from there to mobile.
As contexts gave changed I've found we needed to switch to UUID (and the switch was a pain.) A lot of our problems would simply not exist if we'd used uuid from the start.
So now my "default advice" is uuid, unless there's a -specific- reason not to. For me, performance is not that reason (your database outlives multiple generations of hardware, and faster hardware will dwarf optimisation-via-data-type.)
And don't get me started on natural versus surrogate identifiers- that ship sailed decades ago.)
So far my impression is that there are a whole lot of other things I need to worry about in Postgres before I spend time considering serial vs. random UUID vs. ordered UUID. Am I wrong here and this is something that really matters and you should invest more time in?
Postgres is happier with sequence ID's, but keeping Postgres happy isn't the only design goal. It does well enough for all practical purposes if you need randomness.
It literally is the one thing in the entire stack that must always be happy. Every stateful service likely depends on it. Sad DBs means higher latency for everyone, and grumpy DBREs getting paged.
See Snowflake IDs for a scheme that gives you the benefit of random UUIDs but are strictly increasing. Which is really UUIDv7 but fits in your bigint column. No entropy required.
Like any sort of optimization I believe this will depend on your workload and what's important.
For me when I switched to UUIDv7 a few months ago, it was basically no effort to switch v4 to v7 on a relatively new system. I was observing much higher batch insertion latencies than I expected, and producing inserts that touch less of the btree on an index created a very noticeable reduction in insertion latencies. But my workloads and latencies may look nothing like yours. On amazon RDS instances with EBS volumes and relatively low memory, insertion latency stood out, so using strategies that reduce the number of disk blocks that are needed has an outsized performance impact.
This of course would produce different results on different hardware / system sizing.
> Am I wrong here and this is something that really matters and you should invest more time in?
Specifically, no - you don't need to worry about it. Reconfiguring your tables to use a different style of unique identifier if your tables have a unique identifier is a bit of a pain but no more so than any other instance of renaming a column - if you want to minimize downtime you add the new column, migrate data to the column, deploy code that utilizes the new column and then finally retire the old column. Even if the previous version of the table lacked any sort of unique key it is still possible to add one after the fact (it's a bit technically harder to properly keep them in sync but it is possible to do safely).
It's just a question of the cost of doing so and the benefits of it - I work in a system that exclusively uses integral keys and our data is such that we don't really suffer any downsides from that choice - if you're working in a larger system with less confidence in the security practices of other teams then avoiding sequential keys so that you have obscurity to fall back on if someone really drops the ball on real security isn't the worst idea... but I think the really compelling reason to prefer UUIDs is for the power of distributed generation... that really only applies to inherently decentralized or astoundingly large products though - and if your product eventually grows to astoundingly large you'll have plenty of time to switch first (probably the wake-up call will be closing in on running out of 4 bit serial unique keys).
Using completely random UUIDs is IMO the worst choice. It’s fine right up until it isn’t, and then you are stuck in hell with no good way out.
Can't say that without context. I've worked in systems where even the version bits were randomized, and for good reason (i know, technically, no UUID anymore).
Of course production crashed when their latest sql query did a multijoin on the real table with a few millions of rows. The size of the uuid needed to join filled the available RAM, everything slowed to a crawl and the system collapsed.
The uuid as primary key can be seen as a risk multiplicator : it will amplify any performance issue you may have, converting a temporary slowness into a full system stop.
https://www.postgresql.org/docs/current/sql-cluster.html
As a DBRE, I believe it always matters, and you should invest time in it. Pragmatically, it is unlikely to have noticeable effects until your tables are at least in the 1E5 rows range, if not higher. Unfortunately, by that point, it’s likely that you’ll find other things of higher importance to deal with, so the problem will only grow.
If your concern is globally unique identifiers (i.e. so that you can merge tables across multiple instances of your database), then UUID is exactly what you want. This is entirely what it is designed for.
If your concern is the privacy of sequence statistics, then UUID incidentally solves your problem. It may not be precisely what you want, and could continue to leak private information depending on the specific variant used. If you want privacy of sequence statistics, then I would suggest something like a sha256 hash of the primary key concatenated with a cryptographic salt stored in a separate column. These make excellent identifiers in places like APIs and URLs.
If you desire a unique identifier that has a high quality UX, then this is in addition to the above columns. This sequence generally has lower entropy than the cryptographic approach (or the UX would suffer), so additional measures should be taken to protect the privacy of the identifiers (e.g. expire redemption tokens after a period of time and reissue via email).
Autoincrementing integers are really nice if you don't actually need UUIDs. Lots of tricks you can apply with them.
Is there a reason not to use version 4 UUID and if time ordering is needed, save timestamp explicitly as another column?
I personally prefer the extra 128 bits of entropy, but I have no evidence that says it's more secure in practice.
I suppose TSID works to this end, but certainly more complicated.
Just hope you never have to merge tables from two databases together.
I know of only one person in my entire career that had to do this. And they managed it just fine despite working with auto-incrementing big ints.
Yet some folks advocate that all projects should pay an expensive insurance against this elusive event of two databases being merged.
The size of that problem depends on your situation
I personally prefer ULID since it is compat with a UUID type and you also get a timestamp lexicographically built into the ID so that sorting by ID also means sorting by timestamp. There are multiple PG extensions to make it easy to drop in and use.
I want to use something url friendly too since uuid sucks..
https://datatracker.ietf.org/doc/html/rfc9562
(which wasn't yet finished at the time of the article)
The dev experience of debugging with UUIDs involved degrades so much, it is depressing.
For example it's much harder to spot patterns and wrong IDs in SQL query results if you are looking at these giant blobs of random characters called UUIDs.