Readit News logoReadit News
jabo · 2 years ago
Plug: If you're ever looking for an open source alternative to Pinecone, we recently added vector search to Typesense: https://typesense.org/docs/0.24.1/api/vector-search.html

The key thing is that it's in-memory and allows you to combine attribute-based filtering, together with nearest-neighbor search.

We're also working on a way to automatically generate embeddings from within Typesense using any ML models of your choice.

So Algolia + Pinecone + Open Source + Self-Hostable with a cloud hosted option = Typesense

Dachande663 · 2 years ago
This is what we’re using. We already sync database content to a typesense DB for regular search so it wasn’t much more work to add in embeddings and now we can do semantic search.
gangster_dave · 2 years ago
Could you do boosting with Typesense, like favoring more recent results?
jabo · 2 years ago
Yup. If you store timestamps as Unix timestamps you can gave Typesense sort on text match score and the timestamp field: https://typesense.org/docs/guide/ranking-and-relevance.html#...
arecurrence · 2 years ago
I was using pinecone before installing pgvector in Postgres. Pinecone works and all but having the vectors in Postgres resulted in an explosion of use for us. Full relational queries with where clauses and order by etc AND vector embeddings is wicked.
itake · 2 years ago
Why do you use pgvector instead of pgANN? My understanding is pgANN is built with FAISS. When I compared pgvector with FAISS, pgvector was 3-5x slower.

https://github.com/netrasys/pgANN

arecurrence · 2 years ago
There is certainly a wide variety of problems today for which pgvector is unsuitable due to performance limitations... but fear not! This is an area that is getting significant focus right now.

A marqo.ai dev is currently working on adding HNSW-IVF and HNSW support https://news.ycombinator.com/item?id=35551684 and the maintainer has recently noted that they are actively working on an IVFPQ/ScaNN implementation https://github.com/pgvector/pgvector/issues/93

The pgAnn creator actually asked about performance a month ago here https://github.com/pgvector/pgvector/issues/58

Expect to see performance improve dramatically later this year.

extesy · 2 years ago
If I understand correctly, pgAnn is using a standard "cube" extension: https://github.com/netrasys/pgANN#setup
sv123 · 2 years ago
This hits home, it is a big ask to keep data in sync for yet another store. We already balance MS SQL and Algolia and all the plumbing required to catch updates, deletes, etc. adding another feels like a bridge too far. Hopefully MS will get on this train at some point and catch up to postgres.
neximo64 · 2 years ago
What does it do is it just storing vectors and cosine similarity? Why is there a whole new category of db for that
javier2 · 2 years ago
How is disk usage in pgvector?
arecurrence · 2 years ago
There's a section in the readme describing vector space https://github.com/pgvector. . Also, the maintainer is actively working on product quantization https://github.com/pgvector/pgvector/issues/93

Supabase was also asking for sparse vectors https://github.com/pgvector/pgvector/issues/81

Speaking of the repo, they have a number of features they want to add if anyone is interested in contributing, there's lots of room for advancement. Many of these features already have active branches https://github.com/pgvector/pgvector/issues/27

Freebytes · 2 years ago
Do you know of anything similar available for SQL Server 2019?
indeed30 · 2 years ago
You have to say that Pinecone got the timing absolutely right.
alooPotato · 2 years ago
when were they founded?
chicagobuss · 2 years ago
2021
swyx · 2 years ago
there's at least $168m being poured into vector db's this year.

recent vector database fundraises:

- Chroma - $18M seed https://www.trychroma.com/blog/seed

- Weaviate - $50m A https://www.theinformation.com/articles/index-ventures-leads...

- Pinecone - $100M B

1xdevloper · 2 years ago
swyx · 2 years ago
i had a feeling i missed something.. thanks!
opisthenar84 · 2 years ago
Zilliz also raised $60M, although it was last year before the ChatGPT hype. https://www.businesswire.com/news/home/20220824005057/en/Vec...
mmq · 2 years ago
Not recent, but the company that runs on top of milvus: https://www.businesswire.com/news/home/20220824005057/en/Vec...
opisthenar84 · 2 years ago
Imo the most advanced vector DB out there is Vespa https://vespa.ai/.

Harder to set up than wrappers like Chroma, but very powerful.

swyx · 2 years ago
what makes it most advanced/when do you need something like Vespa vs the newer options
mikeryan · 2 years ago
I honestly hope they use it to improve their documentation. I consider myself and pretty adept developer but without much background in AI and was looking for a solution for building out a recommendation engine and ended up at Pinecone.

Maybe I'm not the target audience but after spending some time poking around I couldn't honestly couldn't even figure out how to use it.

Even a simple Goole Search for "What is a Vector Database" ends up with this page.

https://www.pinecone.io/learn/vector-database/#what-is-a-vec...

Pinecone is a vector database that makes it easy for developers to add vector-search features to their applications

Um okay... what's "vector-search"? For that sake what the eff is a "Vector" to begin with? Finally after getting about a third down the page we start defining what a vector is....

Maybe I'm not their target audience but I ended up poking around for about an hour or two before just throwing up my hands and thinking it wasn't for me.

Ended up just sticking with Algolia, since we had them in place for Search anyway...

linuxdude314 · 2 years ago
Respectfully if you don’t know what a vector is, you probably don’t need a vector DB.

When they say “vector-search” they mean semantic search. I.e. “which document is the most semantically similar to the query text”.

So how do we establish semantic similarity?

In a database like Elasticsearch, you store text and the DB indexes the text so you can search.

In a vector DB you don’t just store the raw text, you store a vectorized version of the text.

A vector can be thought of as an array of numbers. To get a vector representation we need some way to take a string and map it to an array while also capturing the notion of semantics.

This is hard, but machine learning models save the day! The first popular model used for this was called “word2vec” while a more modern model is BERT.

These take an input like “fish” and output a vector like [ 3.12 … 4.092 ] (with over a thousand more elements).

So let’s say we have a sentence that we vectorized and we want to compare some user input to see how similar it is to that sentence. How?

If we call our sentence A and the input vector B, we can compute a number between zero and one that tells us how similar they are.

This is called cosine similarity and is computed by taking the dot product of the two vectors and dividing by both of their magnitudes.

When you load a bunch of vectors in a vector DB, the principal operation you will perform is “give me the top K documents that are similar to the input”. The databases indexing process computes k nearest neighbors algorithm on all vectors in the DB and stores this for use at query time.

Without the indexing process there is no real difference between a vector db and key value store.

mikeryan · 2 years ago
Respectfully if you don’t know what a vector is, you probably don’t need a vector DB.

I wasn't looking for one ;-) I was looking for a recommendation engine, similarly most often I'm looking for various ways to use ML and AI to improve various features and workflows.

Which I guess is my point, I don't know who Pinecone's target market is but from following this thread it seems like all the folks who know how to do what they do have alternatives that suit them better. If they are targeting folks like me they're not doing it well.

Pinecone's examples[1] (hat tip to Jurrasic in this thread - I've seen these) all show potential use cases that I might want to leverage, but when you dive into them (for example the Movie Recommender[2] - my use case) I end up with this:

The user_model and movie_model are trained using Tensorflow Keras. The user_model transforms a given user_id into a 32-dimensional embedding in the same vector space as the movies, representing the user’s movie preference. The movie recommendations are then fetched based on proximity to the user’s location in the multi-dimensional space.

It took me another 5 minutes of googling stuff to parse that sentence. And while I could easily get the examples to run I was still running back and forth to Google to figure out what it was doing in the examples - again the documentation is poor here. I'm not a Python dev but I could follow it but I still had to google tqdm to figure out it was a progress bar library?

Also, and this is not unique to Pinecone, I've found generally that while some things are fairly well documented on "Here's how to build a Movie Recommender based on these datasets) frequently in this space there's very little data on how to build a model using your own datasets ie how to take this example and do it with your own data.

[1] https://docs.pinecone.io/docs/examples

[2] https://docs.pinecone.io/docs/movie-recommender

jurassic · 2 years ago
I don't have an ML background and have found the examples pretty easy to follow. https://docs.pinecone.io/docs/examples
bugglebeetle · 2 years ago
Perfect example of AI gold rush nonsense. Pinecone has zero moat and quite a few free alternatives (Faiss, Weviate, pg-vector). Their biggest selling point is that AI hype train people don’t Google “alternatives to pinecone” when cloning the newest trending repo (or I guess, ask ChatGPT).
jhj · 2 years ago
> Pinecone has zero moat and quite a few free alternatives (Faiss, Weviate, pg-vector)

Faiss is a collection of algorithms for in-memory exact and approximate high-dimensional (e.g., > ~30 dimensional) dense vector k-nearest neighbor, it doesn't add or really consider persistence (beyond full index serialization to an in memory or on disk binary blob), fault tolerance, replication, domain-specific autotuning and the like. The "vector database" companies like Pinecone, Weviate, Zilliz and what not will add these other features to turn them into a complete service, they're not really the same. pgvector seems to be DB-backed IndexFlat and IndexIVFFlat (?) from the Faiss library at present but is of course not a complete service.

However which kind of approximate indexing you want to use very much depends upon the data you're indexing, and where in the tradeoff space between latency, throughput, encoding accuracy, NN recall and memory/disk consumption you want to be (these are the fundamental tradeoffs in the vector search domain), and whether you are performing batched queries or not. To access the full range of tradeoffs you'd need to use all of the options which are available in Faiss or similar low-level libraries which may be difficult to use or require knowledge of underlying algorithms.

(I'm the author of the GPU half of Faiss)

moab · 2 years ago
Spot on. There is zero moat and the self-hosted alternatives are rapidly improving (if not better) than Pinecone. There are good open-source contributions coming from bigcorp beyond Meta too, e.g., DiskANN (https://github.com/microsoft/DiskANN).
zzzzzzzza · 2 years ago
why use this instead of something like pgvector?
mmq · 2 years ago
Most demos shared by people can use numpy arrays similar to this https://twitter.com/karpathy/status/1647374645316968449
bugglebeetle · 2 years ago
It’s true, but I also wouldn’t want any developer beside Karpathy to implement a production service using numpy arrays (I use Faiss).
chadash · 2 years ago
Oracle DB has quite a few free alternatives. Yet Larry Ellison is the world's 4th wealthiest person.
boc · 2 years ago
You're telling me the VC that went hog-wild on NFTs and Crypto is fomo'ing into AI without any DD?

I'm shocked.

Deleted Comment

nemothekid · 2 years ago
Maybe I am fundamentally missing something, but a "cloud database company" seems like the most boring tech? No one is calling Planetscale or Yugabyte nonsense because there are free alternatives like Postgres.
AYBABTME · 2 years ago
To be fair PlanetScale is a whole lot more than hosted MySQL.
findjashua · 2 years ago
i mean it’s Andreesen. They’re like the Softbank of Silicon Valley
teaearlgraycold · 2 years ago
It has a moat for those that start using it. Vendor lock in.
m1117 · 2 years ago
Andreessen horowitz should've just googled
danielmarkbruce · 2 years ago
Is it possible Andreessen are misunderstanding how pinecone/vector dbs are used? It seems like they are pitching it as "memory for large language models" or something. Are people using vector db's in some way I'm not aware of? To me it's a database to help you do a semantic search. A multi-token string is converted into a single embedding. Like maybe 1000 words into one embedding. This is helpful because you can quickly find the relevant parts of a document to answer a question and there are token limits into an LLM, but the idea that it's helping the LLM keep state or something seems off?

Is it possible they are confusing the use of embeddings across whole swaths of text to do a semantic search with the embeddings that happen on a per token basis as data runs through an LLM? Same word, same basic idea, but used so differently that they may as well be different words?

darkteflon · 2 years ago
I might be mistaken, but my understanding from having played around with LangChain for a couple months is that because you’ve got to keep all your state in the context window, giving the model access to a vectorstore containing the entire chat history allows it to retrieve relevant messages against your query that can then be stuffed or mapreduced into the context window for the next response.

The alternative - and I believe the way the ChatGPT web app currently works - is just to stuff/mapreduce user input and machine response into the context window on each step, which quickly gets quite lossy.

kordlessagain · 2 years ago
You aren't mistaken. Keeping state, or storing memories, is where it's at with prompts. The trick is knowing what to remember and what to forget.

I consider vector engines to be "hot" models, given they are storing the vector representations of text already run through the "frozen" model.

Having written something a while back that indexes documents and enters into discussion with them, I'm pretty sure ChatGPT is using some type of embedding lookup/match/distance on the history in the window. That means not all text is submitted at the next entry, but whatever mostly matches what is entered by the user (in vector space) is likely pulled in and sent over in the final prompt.

rahimnathwani · 2 years ago
Yes, one of the memory options in Langchain is to use a vector store:

https://python.langchain.com/en/latest/modules/memory/types/...

It also has a more basic version that just keeps a log of past messages.

I don't know whether there's a way (or even a need) to combine these approaches. In a long conversation, it might be useful to trust more recent information more than earlier messages, but Langchain's vector memory doesn't care about sequence.

monological · 2 years ago
Why not just use pgvector for Postgres. It’s free and works great. You can even do cosine distance between embeddings.
jillesvangurp · 2 years ago
Same with opensearch and elasticsearch, both of which have added vector search as well (slight differences between their implementations). And since vector search is computationally expensive, there is a lot of value in narrowing down your result set with a regular query before calculating the best matches from the narrowed down result set.

From what I've seen, the big limitation currently is dimensionality. Most of the more advanced models have a high dimensionality and especially Elasticsearch and Lucene limit the dimensionality to 1024. E.g. several of the openai models have a much higher dimensionality. Opensearch works around this by supporting alternate implentations to lucene for vectors.

Of course it's a sane limitation from a cost and computation point of view, having these huge embeddings doesn't scale that well. But it does limit the quality of the results unless you can train your own models and tailor them to your use case.

If you are curious on how to use this stuff, I invested some time a few weeks ago getting my kt-search kotlin library to support this and wrote some documentation for this: https://jillesvangurp.github.io/kt-search/manual/KnnSearch.h.... The quality was underwhelming IMHO but that might be my complete lack of experience with this stuff.

I have no experience with pinecone and I'm sure it's great. But I do share the sentiment that they might not come out on top for this. There are too many players here and it's a fast moving field. OpenAI just majorly moved the whole field forward enormously in terms of what is possible and feasible.

dmix · 2 years ago
A good sales team, tons of devs, and some custom integrations might answer this question in the future
hospitalJail · 2 years ago
>A good sales team

It quite angers me that people (on HN) will consider the following to be benefits worth mentioning as pros to the consumer:

>Sleek/shiny finish

>Marketing/Branding

>Ability to Monetize

We arent shareholders, all 3 of these are bad for the customer.

breckenedge · 2 years ago
How fast is it compared to Pinecone?