The best way to use text embeddings portably is with Parquet and Polars

The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates. Although I have had good results with DuckDB and Parquet files in object storage. Fast load times.

If you host your own embedding model, then you can transmit numpy float32 compressed arrays as bytes, then decode back into numpy arrays.

Personally I prefer using SQLite with usearch extension. Binary vectors then rerank top 100 with float32. It’s about 2 ms for ~20k items, which beats LanceDB in my tests. Maybe Lance wins on bigger collections. But for my use case it works great, as each user has their own dedicated SQLite file.

For portability there’s Litestream.

dijksterhuis · a year ago

> The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates.

parquet is columnar storage, so it’s use case is lots of heavy filtering/aggregation within analytical workloads (OLAP).

consistent writes / updates, i.e. basically transactional (OLTP), use cases are never going to have great performance in columnar storage. its the wrong format to use for that.

for faster writes/updates you’d want row-based, i.e. CSV or an actual database. which i’m glad to see is where you kind of ended up anyway.

yorwba · a year ago

There's no reason why an update query that doesn't change the file layout and only twiddles some values in place couldn't be made fast with columnar storage.

When you run a read query, there's one phase that determines the offsets where values are stored and another that reads the value at a given offset. For an update query that doesn't change the offsets, you can change the direction from reading the value at an offset to writing a new value to that location instead, and it should be plenty fast.

Parquet libraries just don't seem to consider that use case worth supporting for some reason and expect people to generate an entire new file with mostly the same content instead. Which definitely doesn't have great performance!

csunbird · a year ago

Parquet files being immutable is not a bug, it is a feature. That is how you accomplish good compression and keep the columnar data organized.

Yes, it is not useful for continuous writes and updates, but it is not what it is designed for. Use a database (e.g. SQLite just like you suggested) if you want to ingest real time/streaming data.

pantsforbirds · a year ago

I've had great luck using either Athena or DuckDB with parquet files in s3 using a few partitions. You can query across the partitions pretty efficiently and if date/time is one of your partitions, then it's very efficient to add new data.

jt_b · a year ago

> The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates. Although I have had good results with DuckDB and Parquet files in object storage. Fast load times.

You can use glob patterns in DuckDB to query remote parquets though to get around this? Maybe break things up using a hive partitioning scheme or similar.

memhole · a year ago

I like the pattern described too. Only snag is deletes and updates. Ime, you have to delete the underlying file or create and maintain a view that handles the data you want visible.

df = ( pl.scan_parquet('hf://datasets/minimaxir/mtg-embeddings/mtg_embeddings.parquet') .filter( pl.col("type").str.contains("Sorcery"), pl.col("manaCost").str.contains("B"), ) .collect()

Lots of great findings

---

I'm curious if anyone knows whether it is better to pass structured data or unstructured data to embedding api's? If I ask ChatGPT, it says it is better to send unstructured data. (looking at the authors github, it looks like he generated embeddings from json strings)

My use case is for jsonresume, I am creating embeddings by sending full json versions as strings, but I've been experimenting with using models to translate resume.json's into full text versions first before creating embeddings. The results seem to be better but I haven't seen any concrete opinions on this.

My understanding is that unstructured data is better because it contains textual/semantic meaning because of natural lanaguage aka

  skills: ['Javascript', 'Python']

is worse than;

  Thomas excels at Javascript and Python

Another question: What if the search was also a json embedding? JSON <> JSON embeddings could also be great?

minimaxir · a year ago

In general I like to send structured data (see the input format here: https://github.com/minimaxir/mtg-embeddings), but the ModernBERT base for the embedding model used here specifically has better benefits implicitly for structured data compared to previous models. That's worth another blog post explaining why.

notpublic · a year ago

please do explain why

vunderba · a year ago

I'd say the more important consideration is "consistency" between incoming query input and stored vectors.

I have a huge vector database that gets updated/regenerated from a personal knowledge store (markdown library). Since the user is most likely to input a comparison query in the form of a question "Where does X factor into the Y system?" - I use a small 7b parameter LLM to pregenerate a list of a dozen possible theoretical questions a user might pose to a given embedding chunk. These are saved as 1536 dimension sized embeddings into the vector database (Qdrant) and linked to the chunks.

The real question you need to ask is - what's the input query that you'll be comparing to the embeddings? If it's incoming as structured, then store structured, etc.

I've also seen (anecdotally) similarity degradation for smaller chunks as well - so keep that in mind as well.

intalentive · a year ago

banku_brougham · a year ago

Really cool article, I've enjoyed your work for a long time. You might add a note for those jumping into a sqlite implementation, that duckdb reads parquet and launched a few vector similarity functions which cover this use-case perfectly:

https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...

I have tinkered with using DuckDB as a poor man's vector database for a POC and had great results.

One thing I'd love to see is being able to do some sort of row group level metadata statistics for embeddings within a parquet file - something that would allow various readers to push predicates down to an HTTP request metadata level and completely avoid loading in non-relevant rows to the database from a remote file - particularly one stored on S3 compatible storage that supports byte-range requests. I'm not sure what the implementation would look like to define sorting the algorithm to organize the "close" rows together, how the metadata would be calculated, or what the reader implementation would look like, but I'd love to be able to implement some of the same patterns with vector search as with geoparquet.

I thought about this some more and did some research - and found an indexing approach using HNSW, serialized to parquet, and queried from the browser here:

https://github.com/jasonjmcghee/portable-hnsw

Opens up efficient query patterns for larger datasets for RAG projects where you may not have the resources to run an expensive vector database

mhh__ · a year ago

I still don't like dataframes but oh my God Polars is so much better than pandas.

I was doing some time series calculations, simple equity price adjustments basically, in Polars and my two thoughts were:

- WTF, I can actually read the code and test it.

- it's running so fast it seems like it's broken.

eskaytwo · a year ago

There’s some nice plugins too, some are finance related: https://github.com/ddotta/awesome-polars

The one thing I really want is for someone to make it so I can use it in F#. Presumably it's possible given how the python bit is implemented under the hood?

LaurensBER · a year ago

Yeah, the readability difference is immense. I worked for years with Pandas and I still cannot "scan" it as quickly as with a "normal" programming language or SQL. Then there's the whole issue with (multi)-indexes, serialisation, etc.

Polars makes programming fun again instead of a chore.

stephantul · a year ago

Check out Unum’s usearch. It beats anything, and is super easy to use. It just does exactly what you need.

https://github.com/unum-cloud/usearch

esafak · a year ago

Have you tested it against Lance? Does it do predicate pushdown for filtering?

ashvardanian · a year ago

USearch author here :)

The engine supports arbitrary predicates for C, C++, and Rust users. In higher level languages it’s hard to combine callbacks and concurrent state management.

In terms of scalability and efficiency, the only tool I’ve seen coming close is Nvidia’s cuVS if you have GPUs available. FAISS HNSW implementation can easily be 10x slower and most commercial & venture-backed alternatives are even slower: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search...

In this use-case, I believe SimSIMD raw kernels may be a better choice. Just replace NumPy and enjoy speedups. It provides hundreds of hand-written SIMD kernels for all kinds of vector-vector operations for AVX, AVX-512, NEON, and SVE across F64, F32, BF16, F16, I8, and binary vectors, mostly operating in mixed precision to avoid overflow and instability: https://github.com/ashvardanian/SimSIMD

Usearch is a vector store afaik, not a vector db. At least that’s how I use it.

I haven’t compared it to lancedb, I reached for it here because the author mentioned Faiss being difficult to use and install. usearch is a great alternative to Faiss.

But thanks for the suggestion, I’ll check it out

dwagnerkc · a year ago

If you want to try it out. Can lazily load from HF and apply filtering this way.

)

Polars is awesome to use, would highly recommend. Single node it is excellent at saturating CPUs, if you need to distribute the work put it in a Ray Actor with some POLARS_MAX_THREADS applied depending on how much it saturates a single node.

thomasfromcdnjs · a year ago

k2so · a year ago

A neat trick in Vespa (vectors DB among other things) documentation is to use hex representation of vectors after converting them to binary.

This trick can be used to reduce your payload sizes. In Vespa, they support this format which is particularly useful when the same vectors are referenced multiple times in a document. For ColBERT or ColPaLi like cases (where you have many embedding vectors), this can reduce the size of the vectors stored on disk massively.

https://docs.vespa.ai/en/reference/document-json-format.html...

Not sure why this is not more commonly adopted though

jtrueb · a year ago

Polars + Parquet is awesome for portability and performance. This post focused on python portability, but Polars has an easy-to-use Rust API for embedding the engine all over the place.

blooalien · a year ago

Gotta love stuff that has multiple language bindings. Always really enjoyed finding powerful libraries in Python and then seeing they also have matching bindings for Go and Rust. Nice to have easy portability and cross-language compatibility.