Ah didn’t realize it was every language. Yes I’m using a light weight open model - but also my use case doesn’t require anything super heavy weight. Wikipedia articles are very feature-dense and differentiable from one another. It doesn’t require a massive feature vector to create meaningful embeddings.
Where does the performance difference come from? And in what kind of processor & gpu? I didn't even know llama.cpp had a 32 bit option. For now I'm pretty suspicious it's a fair comparison.