I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?
Groq has paying customers below the enterprise-level and actually serves all their models to everyone in a wide berth, unlike Cerebras who is very selective, so they have that going for them. But in terms of sheer speed and in the largest models, Groq doesn't really compare.
Since I see DuckDB mentioned, folks wanting serverless may also be interested in LanceDB, written in Rust, with most features built out for Python.
https://github.com/lancedb/lancedb
Side note, I wrote a proof of concept of embeddings generator being handled inside PostgreSQL, independent of the index.