The super effectiveness of Pokémon embeddings using only raw JSON and images

There seem to be a lot of properties that are numeric or boolean, e.g.

    "base_happiness": 50,
    "capture_rate": 190,
    "forms_switchable": false,
    "gender_rate": 4,
    "has_gender_differences": true,
    "hatch_counter": 10,
    "is_baby": false,
    "is_legendary": false,
    "is_mythical": false,

Why not treat each of those properties as an extra dimension, and have the embedding model handle only the remaining (non-numeric) fields?

Is it because:

A) It's easier to just embed everything, or

B) Treating those numeric fields as separate dimensions would mean their interactions wouldn't be considered (without PCA), or

C) Something else?

dcl · a year ago

Because then you couldn't use a pretrained LLM to give you the embeddings. If you added these numerics as extra dimensions, you would need to train a new model that somehow learns the meaning of those extra dimensions based on some measure.

rahimnathwani · a year ago

The embedding model outputs a vector, which is a list of floats. If we wrap the embedding model with a function that adds a few extra dimensions (one for each of these numeric variables, perhaps compressed into the range zero to one) then we would end up with vectors that have a few extra dimensions (e.g. 800 dimensions instead of 784 dimensions). Vector similarity should still just work, no?

Almost everyone uses MiniLM-L6-v2.

You almost certainly don't want to use MiniLM-L6-v2.

MiniLM-L6-V2 is for symmetric search: i.e. documents similar to the query text.

MiniLM-L6-V3 is for asymmetric search: i.e. documents that would have answers to the query text.

This is also an amazing lesson in...something: sentence-transformers spells this out, in their docs, over and over. Except never this directly: i.e. it has a doc on how to make a proper search pipeline, and a doc on the correct model for each type of search, but not a doc saying "hey use this"

And yet, I'd wager there's $100+M invested in vector DB startups who would be surprised to hear it.

radarsat1 · a year ago

It would be nice if you spelled out on your post how you know this, then. Is it written somewhere? A relevant paper for example?

refulgentis · a year ago

> it has a doc on how to make a proper search pipeline, and a doc on the correct model for each type of search, but not a doc saying "hey use this"

bc569a80a344f9c · a year ago

Very nice! This took me about 30 minutes to re-implement for Magic: The Gathering cards (with data from mtgjson.com), and then about 40 minutes or so to create the embeddings. It does rather well at finding similar cards for when you want more than a 4-of, or of course for Commander. That's quite useful for weirder effects where one doesn't have the common options memorized!

minimaxir · a year ago

I was thinking about redoing this with Magic cards too (I have quite a lot of code for that preprocessing that data already) so it's good to know it works there too! :)

jszymborski · a year ago

I would be interested in how this might work with just looking for common words between the text fields of the JSON file weighted by e.g. TF-IDF or BM25.

I wonder if you might get similar results. Also would be interested in the comperative computation resources it takes. Encoding takes a lot of resources, but I imagine look-up would be a lot less resource intensive (i.e.: time and/or memory).

bfung · a year ago

> minimaxir uses Embeddings!

> It’s super effective!

> minimaxir obtains HN13

axpy906 · a year ago

Nice article. I remember the original work. Can you elaborate on this one Max? > Even if the generative AI industry crashes

It's a note that embeddings R&D is orthogonal to whatever happens with generative AI even though both involve LLMs.

I'm not saying that generative AI will crash but if it's indeed at the top of the S-curve there could be issues, notwithstanding the cost and legal issues that are only increasing.

qeternity · a year ago

While there is no real definition of LLM I’m not sure I would say both involve LLMs. There is a trend towards using the hidden state of an LLM as an embedding but this is relatively recent, and overkill for most use-cases. Plenty of embedding models are not large, and it’s fairly trivial to train a small domain-specific embedding model that has incredible utility.

pqdbr · a year ago

I think the author is implying that even if you can't extract real world value from generative AI, the current AI hype has evolved embeddings to a point they can provide real world value to a lot of projects (like the semantic search demonstrated in the article, where no generative AI was used).

moralestapia · a year ago

Nice.

Can you compare distances just like that on a 2D space post-UMAP?

I was under the impression that UMAP makes metrics meaningless.

flipflopclop · a year ago

Great post, really enjoyed the flow of narrative and quality deep technical details