Just migrated all embeddings to this same model a few weeks ago in my company, and it's a game changer. Having 32k context is a 64x increase when compared with our previous used model. Plus being natively multilingual and producing very standard 1024 long arrays made it a seamless transition even with millions of embeddings across thousands of databases.
Depends on your needs. You surely don't want 32k long chunks for doing the standard RAG pipeline, that's for sure.
My use case is basically a recommendation engine, where retrieve a list of similar forum topics based on the current read one. As with dynamic user generated content, it can vary from 10 to 100k tokens. Ideally I would generate embeddings from an LLM generated summary, but that would increase inference costs considerably at the scale I'm applying it.
Having a larger possible context out of the box just made a simple swap of embeddeding models increase quality of recommendations greatly.
I do recommend using https://github.com/huggingface/text-embeddings-inference for fast inference.
My use case is basically a recommendation engine, where retrieve a list of similar forum topics based on the current read one. As with dynamic user generated content, it can vary from 10 to 100k tokens. Ideally I would generate embeddings from an LLM generated summary, but that would increase inference costs considerably at the scale I'm applying it.
Having a larger possible context out of the box just made a simple swap of embeddeding models increase quality of recommendations greatly.
I’ve generated embedding for “objects” or whole documents to get similarity scores. Helps with “relevant articles” type features.
I’ve also made embeddings for paragraphs or fixed sized chunks for RAG lookups. Good for semantic search.
I don’t understand why you would want embeddings on sentences.
> Chunking Strategies
> Sentence-level chunking works well for most use cases, especially when the document structure is unclear or inconsistent.