Main differences:
* *Cost-efficiency:* USEARCH / FAISS / HNSW keep most of the index in RAM; at the billion scale that often means hundreds of GB. In CHEESE both build and search stream from disk. For the 5.5 B-compound Enamine set the footprint is ~1.7 TB NVMe plus ~4 GB RAM (only the centroids), so it can run on a laptop and still scale to tens of billions of vectors. This is also huge difference over commercial vector DB providers (pinecone, milvus...) who would bill you many thousands USD per month for it, because of the RAM heavy instances.
* *Vector type:* USEARCH demo uses binary fingerprints with Tanimoto distance. I use 256-D float embeddings trained to approximate 3-D shape and electrostatic overlap, searched with Euclidean distance.
* *Latency vs. accuracy:* BigANN-style work optimises for QPS and milisecond latency. Chemists usually submit queries one-by-one, so they don’t mind 1–6 s if the top hits are chemically meaningful. I pull entire clusters from disk and scan them exactly to keep recall high.
So the trade-off is a few seconds slower, but far cheaper hardware and results optimized for accuracy.
I wonder how OpenCitations populates their data? One example I tried showed 9 references where the paper had 30+.
Generally, an About page is always appreciated for such web tools with minimal UX, particularly when it's rather automagical.
APIs Used OpenCitations API (v2)
Endpoint: https://opencitations.net/index/api/v2/references/ Purpose: Retrieves a list of all references from a paper by its DOI Data format: JSON containing cited DOIs and metadata DOI Content Negotiation
Endpoint: https://doi.org/{DOI} Purpose: Fetches metadata and formatted citations for DOIs Formats: BibTeX, RIS, CSL JSON, RDF XML, etc. Implements CSL (Citation Style Language) for text-based citations Local Citation Style Files
Purpose: Provides access to thousands of citation styles Storage: Pre-generated JSON files with style information
Where it helps
- Deep-dive reading – fetch bulk RIS file and dump a seminal paper’s entire bibliography into Zotero/Mendeley and follow the threads.
- Bulk citing – grab BibTeX's for a cluster of related papers without hunting them down one-by-one.
- LLM grounding – feed language models a clean reference list so they stop hallucinating citations.