Very nice! This took me about 30 minutes to re-implement for Magic: The Gathering cards (with data from mtgjson.com), and then about 40 minutes or so to create the embeddings. It does rather well at finding similar cards for when you want more than a 4-of, or of course for Commander. That's quite useful for weirder effects where one doesn't have the common options memorized!
I was thinking about redoing this with Magic cards too (I have quite a lot of code for that preprocessing that data already) so it's good to know it works there too! :)
Because then you couldn't use a pretrained LLM to give you the embeddings.
If you added these numerics as extra dimensions, you would need to train a new model that somehow learns the meaning of those extra dimensions based on some measure.
The embedding model outputs a vector, which is a list of floats. If we wrap the embedding model with a function that adds a few extra dimensions (one for each of these numeric variables, perhaps compressed into the range zero to one) then we would end up with vectors that have a few extra dimensions (e.g. 800 dimensions instead of 784 dimensions). Vector similarity should still just work, no?
I would be interested in how this might work with just looking for common words between the text fields of the JSON file weighted by e.g. TF-IDF or BM25.
I wonder if you might get similar results. Also would be interested in the comperative computation resources it takes. Encoding takes a lot of resources, but I imagine look-up would be a lot less resource intensive (i.e.: time and/or memory).
You almost certainly don't want to use MiniLM-L6-v2.
MiniLM-L6-V2 is for symmetric search: i.e. documents similar to the query text.
MiniLM-L6-V3 is for asymmetric search: i.e. documents that would have answers to the query text.
This is also an amazing lesson in...something: sentence-transformers spells this out, in their docs, over and over. Except never this directly: i.e. it has a doc on how to make a proper search pipeline, and a doc on the correct model for each type of search, but not a doc saying "hey use this"
And yet, I'd wager there's $100+M invested in vector DB startups who would be surprised to hear it.
It's a note that embeddings R&D is orthogonal to whatever happens with generative AI even though both involve LLMs.
I'm not saying that generative AI will crash but if it's indeed at the top of the S-curve there could be issues, notwithstanding the cost and legal issues that are only increasing.
While there is no real definition of LLM I’m not sure I would say both involve LLMs. There is a trend towards using the hidden state of an LLM as an embedding but this is relatively recent, and overkill for most use-cases. Plenty of embedding models are not large, and it’s fairly trivial to train a small domain-specific embedding model that has incredible utility.
I think the author is implying that even if you can't extract real world value from generative AI, the current AI hype has evolved embeddings to a point they can provide real world value to a lot of projects (like the semantic search demonstrated in the article, where no generative AI was used).
Is it because:
A) It's easier to just embed everything, or
B) Treating those numeric fields as separate dimensions would mean their interactions wouldn't be considered (without PCA), or
C) Something else?
I wonder if you might get similar results. Also would be interested in the comperative computation resources it takes. Encoding takes a lot of resources, but I imagine look-up would be a lot less resource intensive (i.e.: time and/or memory).
You almost certainly don't want to use MiniLM-L6-v2.
MiniLM-L6-V2 is for symmetric search: i.e. documents similar to the query text.
MiniLM-L6-V3 is for asymmetric search: i.e. documents that would have answers to the query text.
This is also an amazing lesson in...something: sentence-transformers spells this out, in their docs, over and over. Except never this directly: i.e. it has a doc on how to make a proper search pipeline, and a doc on the correct model for each type of search, but not a doc saying "hey use this"
And yet, I'd wager there's $100+M invested in vector DB startups who would be surprised to hear it.
> It’s super effective!
> minimaxir obtains HN13
I'm not saying that generative AI will crash but if it's indeed at the top of the S-curve there could be issues, notwithstanding the cost and legal issues that are only increasing.
Can you compare distances just like that on a 2D space post-UMAP?
I was under the impression that UMAP makes metrics meaningless.