I'm curious about this as well. There are potentially many different versions of embedding models used in production and correlating different versions together could be very important.
Thank you! Comparing this and the link the other commenter posted, what handles the actual search querying? Does instructor-xl include an LLM in addition to the embeddings? The other commenter's repo uses Pinecone for the embeddings and OpenAI for the LLM.
My apologies if I am completely mangling the vocabulary here - I have an, at best, rudimentary understanding of this stuff that I am trying to hack my education on.
Edit: If you're at the SF meetup tomorrow, I'd happily buy you a beverage in return for this explanation :)
I have a system that download all my data from Google, Facebook, Twitter, and others. Geo data is fun to look at, but now the text and images have some more meaning to gleam. I’m thinking about going back to it. Not sure how to package a bunch of python stuff in an app though.
I've done some quick-and-dirty testing with OpenAI's embedding API + Zilliz Cloud. The 1st gen embeddings leave something to be desired (https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...), but the 2nd gen embeddings are actually fairly performant relative to many open source models with MLM loss.
I'll have to dig out the notebook that I created for this, but I'll try to post it here once I find it.
Please do and thanks in advance for any insights you can provide -- it would be great to understand any benchmarking improvement with ada-002 from your previous findings, and whether you tested the specific OpenAI text-search-*-{query,doc} models as a comparison for large document search.
Very interested in this - I've been using embeddings / semantic search doing information retrieval from PDFs, using ada-002, and have been impressed by the results in testing.
The reasons the article listed, namely a) lock-in and b) cost, have given me pause with embedding our whole corpus of data. I'd much rather use an open model but don't have much experience in evaluating these embedding models and search performance - still very new to me.
Like what you did with ada-002 vs Instruct XL, has there been any papers or prior work done evaluating the different embedding models?
It’s fine to use their embeddings for a proof of concept, but since you don’t own it, you probably shouldn’t rely on it because it could go away at any time.
For example, mapping embeddings of Llama to GPT-3?
That way you can see how similar the models “understand the world”.
>download all my tweets (about 20k) and build a semantic searcher on top ?
How can utilize 3rd party embeddings with OpenAI's LLM API? Am I correct to understand from this article that this is possible?
My apologies if I am completely mangling the vocabulary here - I have an, at best, rudimentary understanding of this stuff that I am trying to hack my education on.
Edit: If you're at the SF meetup tomorrow, I'd happily buy you a beverage in return for this explanation :)
https://github.com/mayooear/gpt4-pdf-chatbot-langchain for example
No code needed :)
I'll have to dig out the notebook that I created for this, but I'll try to post it here once I find it.
The reasons the article listed, namely a) lock-in and b) cost, have given me pause with embedding our whole corpus of data. I'd much rather use an open model but don't have much experience in evaluating these embedding models and search performance - still very new to me.
Like what you did with ada-002 vs Instruct XL, has there been any papers or prior work done evaluating the different embedding models?
Generally MiniLM is a good baseline. For faster models you want this library:
https://github.com/oborchers/Fast_Sentence_Embeddings
For higher quality ones, just take the bigger/slower models in the SentenceTransformers library
(Although there is a lot more advantages to just having Office 2021 like the flat fee)
Dead Comment