Much improved new version. Search for words similar to the query. For example, "death" will find "death", "dying", "dead", "killing"... Incredibly useful for exploring large text datasets where exact matches are too restrictive.
You can compute the similarity much faster by using BLAS. Good BLAS libraries have SIMD-optimized implementations. Or if you do multiple tokens as once, you can do a matrix-vector multiplication (sgemv), which will be even faster in many implementations. Alternatively, there is probably also a SIMD implementation in Go using assembly (it has been 7 years since I looked at anything in the Go ecosystem).
You could also normalize the vectors while loading. Then during runtime the cosine similarity is just the dot product of the vectors (whether it pays off depends on the size of your embedding matrix and the size of the haystack that you are going to search).
SIMD situation in Go is still rather abysmal, it’s likely easier to just FFI (though FFI is very slow so I guess you're stuck with ugly go asm if you are using short vectors). As usual, there's a particular high-level language that does it very well, and has standard vector similarity function nowadays...
That's totally clever and sound really useful. And it's one of those ideas where you go "Why didn't I think of that" when stumbling over the materials, word2vec in this case.
I take a Show HN to be a solicitation for feedback.
It took me a few tries to get this to run. I tried passing -model_path, but it complained that it couldn't find config.json. I then made a config.json in the pwd (which contained the executable), but it couldn't expand "~". I then tried searching *.txt, but it couldn't find anything unless I specified just one file.
In terms of offering an experience similar to grep, it's a lot slower. It's not quite as slow with the SLIM file, but I wonder if there's some low-hanging fruit for optimization. Perhaps caching similarity for input tokens?
The model of a word to a vector breaks down really quickly one you introduce the context and complexity of human language. That's why we went to contextual embeddings, but even they have issues.
Curious would it handle negation of trained keywords, e.g "not urgent"?
For a lot of cases, word2vec/glove still work plenty well. It also runs much faster and lighter when doing development -- the FSE library [1] does 0.5M sentences/sec on CPU, whereas the fastest sentence_transformers [2] do something like 20k sentences/sec on a V100 (!!).
For the drawbacks:
Word embeddings are only good at similarity search style queries - stuff like paraphrasing.
Negation they'll necessarily struggle with. Since word embeddings are generally summed or averaged into a sentence embedding, a negation won't shift the sentence vector space around the way it would in a LM embedding.
Also things like homonyms are issues, but this is massively overblown as a reason to use LM embeddings (at least for latin/germanic languages).
Most people use LM embeddings because they've been told it's the best thing by other people rather than benchmarking accuracy and performance for their usecase.
This implementation does not because the query has to be a word. One way to extend it to phrases is to average the vectors of each word in the phrase. Another way is to have a word2vec model that embeds phrases. (The large Google News model I think does have phrases, with spaces converted to underscore). But going from words to phrases opens up a whole new can of worms - how large a phrase to consider from the input stream etc. Plus, I don't think averaging words in the phrase is the same as learning the embedding for the phrase. Sentence embedding models are necessary for that, but they are far too slow for this use case as pointed out by others.
To summarize, this is a simple implementation that works for the simplest use case - semantic matching of words.
I wonder if it would be possible to easily add support for multiple CPUs? It seems to be taking at most 150% CPU, so on my workstation it could be (assuming high parallellism) 10 times as fast.
Alas the word2vec repository has reached its quota:
fetch: Fetching reference refs/heads/master
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/mmihaltz/word2vec-GoogleNews-vectors.git/info/lfs'
This would be really useful if it could take a descriptive phrase or a compound phrase (like SQL 'select X and Y and Z') and match against the semantic cluster(s) that the query forms. IMO that's the greatest failing of today's search engines -- they're all one hit wonders.
https://github.com/arunsupe/semantic-grep/blob/b7dcc82a7cbab...
You can read the vector all at once. See e.g.:
https://github.com/danieldk/go2vec/blob/ee0e8720a8f518315f35...
---
https://github.com/arunsupe/semantic-grep/blob/b7dcc82a7cbab...
You can compute the similarity much faster by using BLAS. Good BLAS libraries have SIMD-optimized implementations. Or if you do multiple tokens as once, you can do a matrix-vector multiplication (sgemv), which will be even faster in many implementations. Alternatively, there is probably also a SIMD implementation in Go using assembly (it has been 7 years since I looked at anything in the Go ecosystem).
You could also normalize the vectors while loading. Then during runtime the cosine similarity is just the dot product of the vectors (whether it pays off depends on the size of your embedding matrix and the size of the haystack that you are going to search).
Well, Apple did experiment with embeddings for next word prediction on iOS in 2018 (apparently, they also used it for many other features)! https://machinelearning.apple.com/research/can-global-semant... / https://archive.is/1oCwr
The configuration thing is unclear to me. I think that "current directory" means "same directory as the binary", but it could mean pwd.
Neither of those is good: configuration doesn't belong where the binaries go, and it's obviously wrong to look for configs in the working directory.
I suggest checking $XDG_CONFIG_HOME, and defaulting to `~/.config/sgrep/config.toml`.
That extension is not a typo, btw. JSON is unpleasant to edit for configuration purposes, TOML is not.
Or you could use an ENV variable directly, if the only thing that needs configuring is the model's location, that would be fine as well.
If that were the on ramp, I'd be giving feedback on the program instead. I do think it's a clever idea and I'd like to try it out.
It took me a few tries to get this to run. I tried passing -model_path, but it complained that it couldn't find config.json. I then made a config.json in the pwd (which contained the executable), but it couldn't expand "~". I then tried searching *.txt, but it couldn't find anything unless I specified just one file.
In terms of offering an experience similar to grep, it's a lot slower. It's not quite as slow with the SLIM file, but I wonder if there's some low-hanging fruit for optimization. Perhaps caching similarity for input tokens?
Curious would it handle negation of trained keywords, e.g "not urgent"?
For the drawbacks:
Word embeddings are only good at similarity search style queries - stuff like paraphrasing.
Negation they'll necessarily struggle with. Since word embeddings are generally summed or averaged into a sentence embedding, a negation won't shift the sentence vector space around the way it would in a LM embedding.
Also things like homonyms are issues, but this is massively overblown as a reason to use LM embeddings (at least for latin/germanic languages).
Most people use LM embeddings because they've been told it's the best thing by other people rather than benchmarking accuracy and performance for their usecase.
1. https://github.com/oborchers/Fast_Sentence_Embeddings
2. https://www.sbert.net/docs/sentence_transformer/pretrained_m...
To summarize, this is a simple implementation that works for the simplest use case - semantic matching of words.
- check to see if the search query contains more than one word
- spin up a more modestly sized LLM, such as mistral 7b
- ask the LLM to try to condense / find single word synonym for the user query
- send to sgrep
Something like SBERT addresses this.
Alas the word2vec repository has reached its quota:
So here are another sources I found for it: https://stackoverflow.com/a/43423646I also found https://huggingface.co/fse/word2vec-google-news-300/tree/mai... but I'm unsure if that's the correct format for this tool. The first source from Google Drive seems to work and there's little chance of being malicious..
Haven't tried the huggingface model, but, looks very different. Unlikely to work.
Like grep but for natural language questions. Based on Mistral LLMs.