The main driver behind lexically sortable identifiers is that you generally insert into databases in time order so if your ids are sorted by time then you are appending to the end of the table. If, on the other hand, your ids are random (e.g. the prevalent UUIDv4) then your database is spending all it's time shuffling everything around to insert your new rows in the middle of everything.
Once you have time-based identifiers as primary keys you can do fancy things like using some id you know was generated at time x as a where clause on some other table to dramatically cut down the search space. This kind of dirty trick can really improve performance sometimes when it really matters.
However, the main contender for lexically sorted IDs is ... UUIDs! The new UUID specs have ULID versions of UUIDs.
That spec lists 16 prior lexically-sortable identifiers, and has good motivation for adding time sorting to the UUID standard.
Two things I particularly like about the UUIDv7 standard is that it is easy to swap into existing codebases that are using UUIDv4, and that it is easy to extract the timestamp part in systems that don't know about it and are treating it as a string e.g. in lots of databases you can do some sql like TIMESTAMP_MILLIS(CAST(CONCAT("0x", LEFT(REPLACE(id, '-', ''), 12)) AS INT64)). This is a massive advantage over systems that use non-hex encodings.
So these days, use UUIDv7 if you want to generate your ID outside of a database and the time the id was created is not a secret; else use UUIDv4. Your database will thank you.
> your database is spending all it's time shuffling everything around to insert your new rows in the middle of everything
Is it? The order of data on disk tends to be the insertion order. What might get shuffled around is the index for the primary key, but indexes can handle that quite well.
I learned the lexically-sortable-id trick from wizened DBAs in the 90s. It's an old trick. I've applied it myself to speed things up and have used both the ULID types described in the article and UUIDv7 etc.
I even advocate using UUID7 on BigQuery where there are no indexes. Having sortable keys is nice even if it doesn't add speed.
You don't need a total order for the locality properties. E.g. a 32bit ms timestamp that overflows every ~50 days + 96bit of entropy will give you the same locality with a much lower collision probability.
That being said, a lot of index types don't care about this kind of locality and might even benefit from higher entropy (randomized algorithms).
But if you for instance use MS SQL and the uniqueidentifier type, sorted UUIDv7 is not sorted im the database because MS SQL swapsnthe bytes around. So ULID (stored as binary(16)) is an advantage in that case to avoid making that mistake.
I really like that ULID looks differentt from UUIDv4. UUIDv7 is something sorted that on the surface looks like it will be random and is mixed with random IDs etc..
It depends per storage system. I think we evaluated this on Postgres b-tree and weren’t super jazzed about it. Some storage systems don’t like hotspots. They can write two far-apart keys in parallel, but adjacent keys need to be serialized. Sometimes this happens with sharding systems.
What drives me nuts is people storing uuids in databases as hex strings. They're bytes, and if you do it right, even the type 1's are lexigraphically sortable.
Hex strings may require more storage, but they remove the need to convert back and forth and make it easier to manually query the database when debugging. The best type is your database's uuid type if it has one.
There is a draft for a UUID variant called v7 which is basically the same like ulid
It is absolutely valid
In particular when you want ids to be in mostly ascending order but where you can’t use an auto incrementing int
It’s sortable is a key feature because it can help make searching tables more efficient in situations where the data is naturally tied to the time at which the entry was made and your queries relate to time also
For example, if you append rows with ulid or uuid v7, then you can binary search the records in the file without first sorting or making an extra index of the ids as they are already for the most part in ascending order
If your team hasn't engineered in monotonically increasing distributed clocks, the results of sorting are only partly correct. I'd rather not promote an option we can't rely on.
They’re only partially correct no matter what. The timestamp in a ULID has only has millisecond resolution, so two IDs generated in the same millisecond, even within the same process, will be randomly ordered.
In practice though there are a lot of advantages to having approximately-time-ordered IDs, and I’ve found the pitfalls easy enough to avoid.
At this point I think that ideal setup is to use numerical id for primary and foreign keys and maintaining a separate uuid field for everything else.
The reason being that index size matters a lot (for caches and other things) and index size depends on underlying field size, obviously.
Whether to use UUID or ULID is depends on tooling. While it's not hard to write ascending UUID generator and I did it myself few times, using standard APIs might be better approach despite the shortcomings.
I wouldn’t, it’s redundant for all intents and purposes unless you’re Google, which you are not. UUIDs/ULIDs provide strong uniqueness guarantees on their own.
Yes, I'd keep a uniqueness constraint, because it prevents bugs from inserting the same value more than once. From concurrency issues to broken id generation, I've seen it happen before. So unless there are other considerations, having the database enforce uniqueness is a good thing in my book.
Depends on how many new U(U/L)IDs you insert, but when it's part of an index anyway then I personally would, as the performance penalty would be small (much smaller than anything else you do).
This topic resurfaces quite frequently, I've noticed. One thing I find quite frustrating is that most of the time this is discussed, people don't remember that exposing the creation time of a database row to the public is not acceptable in many systems. This makes v7 UUIDs, ULIDs, KSUIDs, etc. not appropriate for the task.
In any application where the user is able to first create a draft of something (a video upload, a news article, a blog post, ...) and then release it some day in the future, we probably don't want the public to be aware of when the author created the initial draft. For instance, news websites often start drafting articles about people that are speculated to pass away soon.
The problem also affects numeric/auto-incremented IDs. The "Finnish BBC" YLE uses numeric IDs in their news article URLs, making it possible to ballpark when the article was first started. Here's [1] an article about the death of former president Martti Ahtisaari from 2023/10/16 and here's [2] the next integer ID dated 2023/5/5 (5 months before).
I'm wondering about the security of using a random alphabet with this instead of the default one. In my mind this amounts to a form of cryptography, but I have no idea how to analyse how much security it gives.
EDIT: Reading the faq I see that they insist that sqids cannot provide any encryption. This does not fit with my understanding of the word. Using unique random alphabet is probably the oldest form of encryption. Whether or not it is secure enough depends on your threat model. What I want and what I need is a way to calculate the security provided by a random alphabet.
I've been doing something that includes a type character as the first character - followed by 11 random base58 digits. This allows > 2^64 possible ids and a type in ~12 bytes. I occasionally wish they were ordered, but most of my tables usually have a 'created_at' field anyway should ordering matter.
I feel like uuid / ulid are just overkill for most situations - and they're long and kinda ugly imo.
The encoding is completely different? And given that encoding I don't go to any great means to store them as a special type in the db - just text. uuid being hexidecimal, databases often want to store them as binary since that's half the bytes...
Once you have time-based identifiers as primary keys you can do fancy things like using some id you know was generated at time x as a where clause on some other table to dramatically cut down the search space. This kind of dirty trick can really improve performance sometimes when it really matters.
However, the main contender for lexically sorted IDs is ... UUIDs! The new UUID specs have ULID versions of UUIDs.
The new UUID spec includes lexically sortable variants, of which UUIDv7 is gaining widespread support. https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122b...
That spec lists 16 prior lexically-sortable identifiers, and has good motivation for adding time sorting to the UUID standard.
Two things I particularly like about the UUIDv7 standard is that it is easy to swap into existing codebases that are using UUIDv4, and that it is easy to extract the timestamp part in systems that don't know about it and are treating it as a string e.g. in lots of databases you can do some sql like TIMESTAMP_MILLIS(CAST(CONCAT("0x", LEFT(REPLACE(id, '-', ''), 12)) AS INT64)). This is a massive advantage over systems that use non-hex encodings.
So these days, use UUIDv7 if you want to generate your ID outside of a database and the time the id was created is not a secret; else use UUIDv4. Your database will thank you.
Is it? The order of data on disk tends to be the insertion order. What might get shuffled around is the index for the primary key, but indexes can handle that quite well.
Here are some nice benchmarks: https://www.toomanyafterthoughts.com/uuids-are-bad-for-datab...
I learned the lexically-sortable-id trick from wizened DBAs in the 90s. It's an old trick. I've applied it myself to speed things up and have used both the ULID types described in the article and UUIDv7 etc.
I even advocate using UUID7 on BigQuery where there are no indexes. Having sortable keys is nice even if it doesn't add speed.
That being said, a lot of index types don't care about this kind of locality and might even benefit from higher entropy (randomized algorithms).
[1] https://news.ycombinator.com/item?id=40272446
I really like that ULID looks differentt from UUIDv4. UUIDv7 is something sorted that on the surface looks like it will be random and is mixed with random IDs etc..
Deleted Comment
- it’s slightly more complex than UUID, but not enough to be a problem
- it’s sortable (in time?) which UUID can also be (but usually recommended not to be)
- it can produce slightly more ids per second, but not enough to make a difference.
So, it’s a tie, a tie and a tie. Why would you switch?
It is absolutely valid
In particular when you want ids to be in mostly ascending order but where you can’t use an auto incrementing int
It’s sortable is a key feature because it can help make searching tables more efficient in situations where the data is naturally tied to the time at which the entry was made and your queries relate to time also
For example, if you append rows with ulid or uuid v7, then you can binary search the records in the file without first sorting or making an extra index of the ids as they are already for the most part in ascending order
> I am not going to cover all the reasons why Ulid is better, but in the end, it's up to your personal preference.
It gives you the sortable parts without the "recommended not to" parts.
In practice though there are a lot of advantages to having approximately-time-ordered IDs, and I’ve found the pitfalls easy enough to avoid.
Eg messaging, tweets etc: if two messages are sent at pretty much the same time, it does not matter which order they show in.
The reason being that index size matters a lot (for caches and other things) and index size depends on underlying field size, obviously.
Whether to use UUID or ULID is depends on tooling. While it's not hard to write ascending UUID generator and I did it myself few times, using standard APIs might be better approach despite the shortcomings.
https://sqids.org/
Even for huge tables 64 bits are enough. And for many tables 32 bits is more than enough.
ULIDs and Primary Keys (2022) - https://news.ycombinator.com/item?id=40016413 - April 2024 (33 comments)
Ulid: Universally Unique Lexicographically Sortable Identifier - https://news.ycombinator.com/item?id=39878319 - March 2024 (6 comments)
ULID – Sortable Unique Identifier - https://news.ycombinator.com/item?id=34281969 - Jan 2023 (23 comments)
Using ULIDs at Incident.io - https://news.ycombinator.com/item?id=34230652 - Jan 2023 (1 comment)
Understanding UUIDs, ULIDs and string representations - https://news.ycombinator.com/item?id=29794186 - Jan 2022 (100 comments)
Going Deep on UUIDs and ULIDs - https://news.ycombinator.com/item?id=28948815 - Oct 2021 (2 comments)
Universally Unique Lexicographically Sortable Identifier - https://news.ycombinator.com/item?id=23160641 - May 2020 (2 comments)
ULID: Universally Unique Lexicographically Sortable Identifier - https://news.ycombinator.com/item?id=18768909 - Dec 2018 (129 comments)
Universally Unique Lexicographically Sortable Identifier in Go - https://news.ycombinator.com/item?id=13116308 - Dec 2016 (52 comments)
ULID: Universally Unique Lexicographically Sortable Identifier - https://news.ycombinator.com/item?id=12205158 - Aug 2016 (3 comments)
In any application where the user is able to first create a draft of something (a video upload, a news article, a blog post, ...) and then release it some day in the future, we probably don't want the public to be aware of when the author created the initial draft. For instance, news websites often start drafting articles about people that are speculated to pass away soon.
The problem also affects numeric/auto-incremented IDs. The "Finnish BBC" YLE uses numeric IDs in their news article URLs, making it possible to ballpark when the article was first started. Here's [1] an article about the death of former president Martti Ahtisaari from 2023/10/16 and here's [2] the next integer ID dated 2023/5/5 (5 months before).
[1] https://yle.fi/a/74-20030335
[2] https://yle.fi/a/74-20030336
Deleted Comment
I'm wondering about the security of using a random alphabet with this instead of the default one. In my mind this amounts to a form of cryptography, but I have no idea how to analyse how much security it gives.
EDIT: Reading the faq I see that they insist that sqids cannot provide any encryption. This does not fit with my understanding of the word. Using unique random alphabet is probably the oldest form of encryption. Whether or not it is secure enough depends on your threat model. What I want and what I need is a way to calculate the security provided by a random alphabet.
I feel like uuid / ulid are just overkill for most situations - and they're long and kinda ugly imo.