Readit News logoReadit News
klabb3 · 3 years ago
A couple of suggestions:

Lock down the prefix string now before it’s too late and document it. I see in Go that it’s lowercase ascii, which seems fine except for compound types (like “article-comment”). May be worth looking at allowing a single separator given that many complex projects (and ORMs) can’t avoid them.

The Go implementation has no tests. This is very unit-testable. Add tests goddammit!

For Go, I’d align with Googles UUID implementation, with proper parse functions and an internal byte array instead of strings. Strings are for rendering (and in your case, the prefix). Right now, it looks like the parsing is too permissive, and goes into generation mode if the suffix is empty. And the SplitN+index thing will panic if no underscores, no? Anyway, tests will tell.

As for the actual design decisions, I tried to poke holes but I fold! I think this strikes the sweet spot between the different tradeoffs. Well done!

dloreto · 3 years ago
Thanks for the feedback!

We have tests for the base32 encoding which is the most complicated part of the implementation (https://github.com/jetpack-io/typeid-go/blob/main/base32/bas...) but your point stands. We'll add a more rigorous test suite (particularly as the number of implementations across different languages grows, and we want to make sure all the implementations are compatible with each other)

Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

dolmen · 3 years ago
There is no tests.

There is just a single test. Which only tests the decoding of a single known value. No encoding test.

Go has infrastructure for benchmarking and fuzzing. Use it!

Also, you took code from https://github.com/oklog/ulid/blob/main/ulid.go which has "Copyright 2016 The Oklog Authors" but this is not mentionned in your base32.go.

klabb3 · 3 years ago
> We have tests for the base32 encoding which is the most complicated part of the implementation

I didn’t look into it much but it seems like a great encoding even outside of this project. Predictable length, reasonable density, “double clickable” etc. I’ve been annoyed with both hex and base64 for a while so it’s pretty cool just by itself.

> Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

Yeah, the worry is almost entirely “subtle deviations across stacks”, which is usually due to ambiguous specs. It’s so annoying when there’s minor differences, compatibility options etc (like base64 which has another “URL-friendly” encoding - ugh).

kbumsik · 3 years ago
> Re: prefix, is the concern that I haven't defined the allowed character set as part of the spec?

It would be great if you add suggestions for compound types (like “article-comment”) in README as OP stated as well.

avgcorrection · 3 years ago
> The Go implementation has no tests. This is very unit-testable. Add tests goddammit!

Yep. The readme asks people to provide other implementations. Having a test suite would be good for third-party code.

dloreto · 3 years ago
A follow up:

1. We've now implemented pretty thorough testing: https://github.com/jetpack-io/typeid-go/blob/main/typeid_tes...

2. I clarified the prefix in the spec

Thanks for the feedback!

rtheunissen · 3 years ago
I'll write some tests for this.
jhoechtl · 3 years ago
I think your misconception is that prefix is sthg. fixed? You decided on the prefix depending on the usage domain.
tomcam · 3 years ago
> Add tests goddammit!

Hey, you’re pretty smart. How about you add them?

klabb3 · 3 years ago
I’m by no means a test police. I’m in fact opposed to a lot of mindless testing for the sake of it. But there are places where unit tests shine, and this is one of them.

If you mean that criticism is only allowed if you are willing to commit labor, I disagree with that. I always welcome critique myself - it may be something that I’ve missed. The maintainers always has the last word. As long as there are no hidden expectations, it’s all good.

aartav · 3 years ago
I've been doing this kind of thing for years with two notable differences:

1. I don't believe people actually hand type-in these values, so I'm not really concerned about the 'l' vs '1' issue. I do base 32 without `eiou` (vowels) to reduce the likelihood of words (profanity) sneaking in.

2. I add two base-32 characters as a checksum (salted of course). This is prevents having to go look at the datastore when the value is bogus either by accident or malice. I'm unsure why other implementations don't do this.

sokoloff · 3 years ago
> base 32 without `eiou` (vowels) to reduce the likelihood of words (profanity) sneaking in.

We had “analrita” as an autogenerated password that resulted in a complaint many years ago. Might consider adding ‘a’ as an excluded letter.

michaelt · 3 years ago
Presumably base 32 means 26 letters + 10 digits - 4 banned letters

So adding an excluded letter is not easy.

manquer · 3 years ago
Wouldn’t that be excluded because i is already removed ?

Deleted Comment

tlrobinson · 3 years ago
I agree with the addition of the checksum, however I’m curious:

> either by accident or malice

1. if you don’t believe people hand type these then how else will they accidentally enter an invalid? I suppose copy/paste errors, or if a page renders it as uppercase, though you should just normalize the case if it’s base 32.

2. How does a 2 byte (non-cryptographically secure) checksum help in the case of malice?

dloreto · 3 years ago
The checksum idea is interesting. I'm considering whether it makes sense to add it as part of the TypeID spec.
veec_cas_tant · 3 years ago
What value does the checksum provide? I think I'm missing something because I really don't see a benefit.
zrail · 3 years ago
I implemented number two as part of an encoding scheme a few months ago. I'm not sure how much it's saved in terms of database lookups but it's aesthetically pleasing to know it won't hit a more inscrutable error while trying to decode.
ajkjk · 3 years ago
Unrelated, but this links to "Crockford's alphabet", https://www.crockford.com/base32.html , which is a base-32 system that includes all alphanumeric characters except I and L (which are confusable with 1), O (which is confusable with 0), and U (????). The page says the reason for excluding U is "accidental obscenity'. What the heck is it talking about?
kibwen · 3 years ago
> The page says the reason for excluding U is "accidental obscenity'.

Crockford is being cheeky. To make a nice base32 alphabet out of non-confusable alphanumeric characters you only need to exclude O, I, and L. This leaves you with 33 characters still, so you need to remove one more, and it doesn't matter which one you remove, so you might as well pick an arbitrary reason for the last character that gets removed (and it's not the worst reason, if your goal is to use these as user-readable IDs, although obviously it's not even remotely bulletproof).

pluijzer · 3 years ago
You could argue that U can be confused with V.

Deleted Comment

dolmen · 3 years ago
This assumes that english is the only relevant language regarding curse words. Which is quite biased.
quickthrower2 · 3 years ago
U is a fairly new letter anyway.
codeulike · 3 years ago
If I and O are already excluded and you also exclude U that removes a lot of potential rude looking three letter combinations like *** and *** and *** and also the four letter ones like **** and **** and the dreaded ****. Of course because you have A then **** is still a possibility but very very unlikely
titanomachy · 3 years ago
Wow I didn't know HN even had obscenity filters, and I've been here for many years.

Guess that's a credit to the general civility of the community.

EDIT: It appears that other people in this thread are freely using profanity, so either your comment was targeted by automation due to the unusual density of banned words, or it's a joke that went over my head :)

AceJohnny2 · 3 years ago
you accidentally the whole thing
pavlov · 3 years ago
True Latinists find the letter U vulgar to the point of obscenity because it didn’t exist in Cicero’s time.
oleganza · 3 years ago
Trve Latinists wovld appreciate yovr point.
Zamicol · 3 years ago
There's more!

- base 58 - Satoshi's/Bitcoin's https://en.wikipedia.org/wiki/Binary-to-text_encoding#Base58

- "base62" - Keybase's saltpack https://github.com/keybase/saltpack

- The famous "Adobe 85" - https://en.wikipedia.org/wiki/Ascii85

- basE91 - https://base91.sourceforge.net

At work we defined several new "bases" for QR code. IMHO, it is an under applied area of computer science.

hinkley · 3 years ago
A coworker and I came up with basically this same set about 4 years before Crockford. We were trying to solve the url slug problem, and they were long enough that we felt 5 bits per byte would reduce transcription annoyances.

In the end I think we had a couple of characters to spare, and so, sitting by ourselves because everyone else had gone home for the day, we ranked swear words by how offensive they were to prioritize removal of a few extra letters. Then I convinced him that slurs were a bigger problem so we focused on that, which got rid of the letter n, instead of u

tggr is just cute, n**r is an uncomfortable conversation with multiple HR teams (we were B2B)

I'm a bit fuzzy now on what our ultimate character set was, because typically you're talking [a-z][0-9], an there are a lot of symbols you can't use in urls and some that are difficult to dictate. My recollection is that we eliminated both 0, l, and 1, but I think we relied on transcription happening either from all caps or all lowercase. 0o are not a problem. Nor are 1L.

hinkley · 3 years ago
Other comments are jogging my memory. I think we went case sensitive (62 characters -> 30 spares), eliminated aA4, eE3, iI1l oO0 (maybe Q), uU, which is 16 characters, 14 to go. Remove the remaining 7 numbers (once you remove most for leetspeak what's the point of the rest?), nN, yY. That leaves 2 left and I can't recall what we did with those. Maybe kK or rR.

Y is pretty versatile for pissing people off.

theptip · 3 years ago
pizzapill · 3 years ago
E-Mail accounts seem the worst. Just lets write letters again, if you need a pencil I recommend penisland.net
programmarchy · 3 years ago
There's enough comedic content in this article for several Silicon Valley episodes.
jszymborski · 3 years ago
FUCK
Racing0461 · 3 years ago
yep, youtube video ids has/had? same issue where it would have things like fag/f4g etc in it.

eg: google "allinurl:fag site:youtube.com"

avgcorrection · 3 years ago
> The page says the reason for excluding U is "accidental obscenity'. What the heck is it talking about?

Because he’s an American?

deanmen · 3 years ago
The F word has a U in it. Sure you could just say FVCK
bongobingo1 · 3 years ago
Or Fwck if you doubly mean it.

Deleted Comment

programmarchy · 3 years ago
Yeah, wtf?
atonse · 3 years ago
Obviously by “wtf” you must mean “why the face?” Right? Right?? :-)
inopinatus · 3 years ago
I'm not wild about the Crockford encoding. In practice I've found it to be a flat-out mistake when you come to provide technical support or analysis for values encoded this way. The Crockford alphabet is based on design goals that are rarely encountered in practice, such as pronouncing identifiers over the phone. It introduces ambiguity, which is a disaster for grepping logs or any other circumstances where you might query or cross-reference based on the encoded string instead of the decoded value, then permits hyphens, a leading source of cut-and-paste and line-break errors.

Note that people generally do not type in object identifiers, but they do frequently cut-and-paste them between applications and chat/forum interfaces, forward them by email, search for them in log files. Verbal transmission is rare to non-existent. Under these conditions, pronunciation proves irrelevant, and case-insensitivity becomes an impediment, but consistency and paste/break resilience become necessary.

Base 58 offers a bijective encoding that fits these concerns much more effectively and is more compact to boot. Similarly inspired by Stripe, I've been using type-prefixed base58-encoded UUIDs for object identifiers for some years. user_1BzGURpnHGn6oNru84B3Ri etc.

Edit to add: to be fair to Douglas Crockford, his encoding of base 32 was designed two decades ago, when the usage landscape looked quite different.

dloreto · 3 years ago
I hear you ... and I debated using either base58 or base64url. I do like the more compact encoding they provide.

Ultimately I ended up leaning towards a base32 encoding, because I didn't want to pre-suppose case sensitivity. For example, you might want to use the id as a filename, and you might be in an environment where you're stuck with a case insensitive filesystem.

Note that TypeID is using the Crockford alphabet and always in lowercase – *not* the full rules of Crockford's encoding. There's no hyphens allowed in TypeIDs, nor multiple encodings of the same ID with different variations of the ambiguous characters.

Deleted Comment

ash · 3 years ago
I agree that pronouncing identifiers over the phone is rare. But I’m occasionally typing identifiers from:

1. a screenshot or a screen share that contains an identifier

2. another device where I can’t easily take an identifier

inopinatus · 3 years ago
That's fair. From experience I think the most common problem with screenshots is [0O] and [Il] ambiguity. As a point of comparison I'm willing to suggest that both base58 and crockford32 handle the matter reasonably, albeit differently, through their omitted-characters and decoding tables.

One feature I do like from crockford32, that base58 lacks, and which also assists transcription from noisy sources, is the check symbol. So much that it is quite unfortunate that this check symbol is optional. In 2023 it's hard to fight the urge to specify a mandatory emoji to encode a check value (caveat engineer: this is not actually a good idea :))

Lazare · 3 years ago
I agree; base58 or base62 (which KSUIDs use) have a lot to recommend them. Crockford's base32 works, but I don't love it.

My first choice would be to just use type-prefixed KSUIDs, which gives you 160-bit K-sortable IDs with base62 encoding, which works great unless you need 128-bit IDs for compatability reasons.

yencabulator · 3 years ago
Wait, where's the hyphen in Crockford Base32? https://en.wikipedia.org/wiki/Base32#Crockford's_Base32

My favorite base-32 encoding is z-base-32, which I find just gentler on the eyes: https://philzimmermann.com/docs/human-oriented-base-32-encod...

The biggest problems with base58 are 1) it works for integers, less so for arbitrary binary data like crypto keys 2) case-sensitivity ISnOtNIcEtOLoOKaT (in my opinion).

inopinatus · 3 years ago
The specification of crockford32 is at https://www.crockford.com/base32.html

z-base32 has some nice ideas, although I don’t really give a damn how these things look except where that has functional/ergonomic consequences, since none of them have real aesthetic value. The beauty of numbers is in their structural properties, not their representations. If we really cared about how it feels I’d suggest using an S/KEY-style word mapping instead to get some poetry out of it.

stephen · 3 years ago
Neat! Love the "type-safe" prefix; we'd called them "tagged ids" in our ORM that auto-prefixes the otherwise-ints-in-the-db with similar per-entity tags:

https://joist-orm.io/docs/advanced/tagged-ids

We'd used `:` as our delimiter, but kinda regretting not using `_` because of the "double-click to copy/paste" aspect...

In theory it'd be really easy to get Joist to take "uuid columns in the db" and turn them into "typeids in the domain model", but probably not something that could be configured/done via userland atm...that'd be a good idea though.

wongarsu · 3 years ago
Reddit does something similar, but optimized for string length: elements have ids like "t3_15bfi0" where t3_ is a prefix for the type (t3 is a post, t1 a comment, t5 a subreddit, etc) and the remaining is a base36 encoding of the autoincrementing primary key.
stephen · 3 years ago
Nice!

The `t<X>` makes sense; we currently guess a tag name of "FooBarZaz" --> "fbz", but allow the user to override it, so you could hand-assign "t1", "t2", etc. as you added entities to the system.

Abbreviating/base36-ing even the auto-incremented numeric primary key to optimize lengths is neat; we haven't gotten to the size of ids where that is a concern, but it sounds like a luxurious problem to have! :-)

hamburglar · 3 years ago
My company has a typed internal ID system that originally used colons as delimiters but we quickly switched to dots (.) as the delimiter because it’s very annoying to have url-encoded IDs balloon in size because colons need to be %-encoded. Makes your urls ugly and long.
wood_spirit · 3 years ago
UUIDv7 has been taking HN by storm for years now! When is it going to become a proper standard, and when are libraries and databases and all the rest going to natively support it?
vbezhenar · 3 years ago
What kind of support do you expect? I'm pretty sure that absolute majority of software does not care about any particular bits in UUID, so you can use it today. If some software cared about any particular bits, just imitate UUIDv4, I mean those bits could be randomly generated as well. If you need generation procedure, write it yourself, it's easy.
dolmen · 3 years ago
+1

IDs generation is usually private to a company scope and rarely need to be "universally unique".

kijeda · 3 years ago
It would appear to be in the final stages of standardization in the IETF: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122b...
Daegalus · 3 years ago
Its been going through drafts and improvements. It's very close to being standardized, and many libraries are supporting it already, or new offerings are being added. For example I maintain the Dart UUID library, and my latest beta major release has v6, v7 and a custom v8. There is a list of them somewhere, I know I get pinged on every new draft by the authors because I am listed as a library maintainer on one of their pages.
Nelkins · 3 years ago
How much does it change between drafts? Close enough to where I could use it in production?
eezing · 3 years ago
“…can be selected for copy-pasting by double-clicking”

Details matter.

bombela · 3 years ago
I have some complaints about UUIDs. Why not just combining time + random number without the ceremony of UUID versioning. And for when locality doesn't matter, just use a 128bit random number directly.

And in my experience most people somehow think a UUID must be stored into the human friendly hex representation, dashes included. Wasting so much space in database, network, memory.

rjh29 · 3 years ago
Many people had the same idea. For example ULID https://github.com/ulid/spec is more compact and stores the time so it is lexically ordered.
jerf · 3 years ago
While this isn't the worst area I see this in, there does seem to be a tendency in the UUID space to speak as if one use case stands for all and therefore there is a best UUID format.

The reality is that it is just like any other engineering situation. Sit down, write down your requirements, and see what, if anything, solves it.

Reading about the advantages of various formats is very helpful in helping you skip learning about certain things the hard way and use somebody else's experience of learning them the hard way instead. From that point of view I recommend at least glancing through them all. Sortability and time-based locality is one that you may not naturally think about, and if you need it, you will appreciate not learning that the hard way four years into a project after you threw that data away and then realizing you needed it. And some UUID formats actually managed to introduce small security issues into themselves (thinking MAC address leak from UUID v1 here), nice to avoid those too.

If you have a use case where there's an existing solution then, hey, great, go ahead and use it. Maybe if anyone ever needs that but in another language they can pull a library there too.

But if not, don't sweat it. The biggest use of UUIDs I personally have I specified as "just send me a unique string, use a UUID library of your choice if it makes you feel better". I think I've got a unique format per source of data in this system and it's fine. I don't have volume problems, it's tens of thousands of things per day. I don't have any need to sort on the UUID, they're not really the "identifier", they're just a unique token generated for a particular message by the originator of the message so we can detect duplicate arrivals downstream in a heterogenous system where I can't just defer that task to the queue itself since we have multiple. I don't even need them to be globally unique, I just need them unique within a rather small shard, and in principle I wouldn't even mind if they get repeated after a certain amount of time (though I left the system enforcing across all time anyhow for simplicity). In this particular case, I do indeed generate my own UUIDs for the stuff I'm originating by just grabbing some stuff from /dev/urandom and encoding it base64, with a size selected such that base64 doesn't end the encoding with ==. Even that's just for aesthetic's sake rather than any actual problem it would cause.

stronglikedan · 3 years ago
> combining time + random number

You can't guarantee that this will be globally unique.

ceejayoz · 3 years ago
No identifier can guarantee that. We just get close enough to be acceptable.

Per Wikipedia, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.

so-youre-saying-theres-a-chance.gif

hot_gril · 3 years ago
The only worthwhile UUID standard IMO is v4 (simple random), and I still don't get why it needs dashes. The other ones don't really accomplish anything.
deepsun · 3 years ago
The worst thing about dashes is you cannot easily double-click it whole to copy-paste.