What is a Vector Database? (2021)

Unfortunately this piece is nebulous on what an embedding is. Apparently it is saved as an array of floats, and it has some string of text it is associated with, and the float arrays are compared by "similarity".

None of these explains what an embedding really is. My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.

ta20211004_1 · 2 years ago

> My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.

Yeah, you've got it. A mapping from words to vectors such that semantic similarity between words is reflected in mathematical similarity between vectors.

An idea of how you might train this thing: lets say the words "king" and "queen" are being embedded. In your training data there are lots of examples where "king" and "queen" are interchangeable, for example in the sentence "The ___ is dead, long live the ____", either word is appropriate in either slot, so each time we see an example like this we nudge "king" and "queen" a little closer together in some sense. However you also find phrases where they are not interchangeable, such as "The first born male will one day be ____". So when you see those examples you nudge "king" a little closer in some sense to other words which appropriately complete the sentence (which does not include "queen" in this case).

In this way, repeated over a giant training set with thousands of words, concepts like "male/female" and "royalty", "person/object" and tons of others end up getting reflected in the relationships between the vectors.

These vectors are then useful representations of words to ML models.

atq2119 · 2 years ago

Right, makes sense. But then what do you actually do with a database?

Starting with: what do you store in it?

Maybe sentence/vector pairs. But what does that give you? What do you do with that data algorithmically? What's the equivalent of a SELECT statement? What's the application that benefits an end user? That part still seems rather hazy.

manytree8 · 2 years ago

Great explanation, thank you!

therealdrag0 · 2 years ago

How is each dimension maintained to have a sticky meaning among scenarios?

gregsadetsky · 2 years ago

There are good sibling explanations by @ta20211004_1 and @HarHarVeryFunny, but if I can try in an additional way:

Imagine you wanted to go from words to numbers (which are easier to work with mathematically), like you wanted to assign a number to some words.

How could you do it? Well you could do it randomly: cat could be 2, dog could be 10, sweater could be 4.534 and frog could be 8.

Not super useful, but hey - words are now numbers! How can we make this "better"?

What if we decided on a way to put words on a line - let's say we ordered words by how much they had to do with animals. Let's say 10 meant it's a very animal-related word, and 0 is very not-animal related. So cat and dog would be 10, and maybe zoo would be 9, and fur could be 8. But something like sweater would be 1 (depending if the sweater was made from animal wool...?)

What now? Well what's cool is that if you assign words on that "animal-ness" line, you can find the words that are "similar" by looking at the numbers that are close. So, words whose value is around 6 are probably similar in meaning. At least, in terms of how much they relate to animals.

That's the core idea. Ordering words by animal-ness is not that useful in the real world, so maybe we can place words on a 2d grid instead of a line. Horizontally, it would go from 0 to 10 (not animal at all - very animal) and vertically, it could be ordered by brightness - 0 for dark, and 10 for bright.

So now, bright animals will congregate together in one part of the grid, and dark non animals will also live close together. For example, a dark frog might be in the bottom right at position (10, 0) - very animal (right end of the x axis) but not bright (bottom of the y axis). Any other word whose position is close to (10, 0) would presumably also be animal-y and dark.

That's really it. The magic is that... this works in thousands of dimensions. Each dimension being some way that "AIs" see words / our world. It's harder to think about what each dimension "is" or represents. But embeddings are really just that - the position in a space with a huge number of dimensions. Just like dark frogs were (10, 0) in our simple example, the word "frog" might be (0.124, 0.51251, 0.61, 0.2362, 0.236236, ..............) as an embedding.

That's it!

boopbeepbop · 2 years ago

Wow. Great explanation.

The example you used going from 1 to 2 to n dimensions really made sense

aidanf · 2 years ago

An embedding is a collection of learned vectors.

Each vector is an array of n floats that represent a location of a thing in an n-dimensional space. The idea of learning an embedding is that you have some learning process that will put items that are similar into similar parts of that vector space.

The vectors don’t necessarily need to represent words and the model that produces them doesn’t necessarily to be a language model.

For example, embeddings are widely used to generate recommendations. Say you have a dataset of users clicking on products on a website. You could assume that products that get clicked in the same session are probably similar and use that dataset to learn an embedding for products. This would give you vector representing each product. When you want to generate recommendations for a product, you take the vector for that product and then search through the set of all product vectors to find those that are closest to it in the vector space.

opwieurposiu · 2 years ago

An embedding is a a way to map words into a high-dimensional "concept space", so they can be processed by ML algorithms. The most popular one is word2vec

https://jalammar.github.io/illustrated-word2vec/

crabbone · 2 years ago

Sorry, that's even less helpful in the context of a database... but thanks for trying.

cubefox · 2 years ago

Okay, "mapping into concept space" is at least compatible with my meaning theory, but by itself it doesn't say much, since in principle anything can be mapped to anything.

HarHarVeryFunny · 2 years ago

Embeddings are a mapping of some type of thing (pictures, words, sentences, etc) to points in a high-dimensional space (e.g. few hundred dimensions) such that items that are close together in this space have some similarity.

The general idea is that the items you are embedding may vary in very many different ways, so trying to map them into a low dimensional space based on similarity isn't going to be able to capture all of that (e.g. if you wanted to represent faces in a 2-D space, you could only use 2 similarity measures such as eye and skin color). However a high enough dimensional space is able to represent many more axis of similarity.

Embeddings are learnt from examples, with the learning algorithm trying to map items that are similar to be close together in the embedding space, and items that are dissimilar to be distant from each other. For example, one could generate an embedding of face photos based on visual similarity by training it with many photos of each of a large number of people, and have the embedding learn to group all photos of the same person to be close together, and further away from those of other individuals. If you now had a new photo and wanted to know who it is (or who it most looks like), you'd generate the embedding for the new photo and determine what other photos it is close to in the embedding space.

Another example would be to create an embedding of words, trying to capture the meanings of words. The common way to to this is to take advantage of the fact that words are largely defined by use/context, so you can take a lot of texts and embed the constituent words such that words that are physically close together in the text are close together in the embedding space. This works surprisingly well, and words that end up close together in the embedding space can be seen to be related in terms of meaning.

Word embeddings are useful as an input to machine learning models/algorithms where you want the model to "understand" the words, and so it is useful if words with similar meaning have similar representations (i.e. their embeddings are close together), and vice versa.

gk1 · 2 years ago

Choose your flavor:

https://www.pinecone.io/learn/vector-embeddings/

https://www.pinecone.io/learn/vector-embeddings-for-develope...

vharuck · 2 years ago

As opwieurposiu said, embeddings are high-dimensional vectors. Often, they're created by classic math techniques (e.g. principal component analysis), or they are extracted from a model that proved useful for something else.

For example, a neural net model accepts a massive number of input values that directly map to the input. So those initial values don't add any info. But a layer further inside the model, with fewer values and probably close to the end, is smaller and should reflect what the model's learned. Like a lot of deep learning, three values work but don't give much insight.

If I'm wrong, I hope somebody more knowledge corrects me. I got my understanding from basic into tutorials and Wolfram's essay on ChatGPT: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

fudged71 · 2 years ago

A word or sentence embedding is a long array of numbers that represents the semantic "position" in a high dimensional space, which allows you to find the distance between any two sentences in this semantic space. My understanding of paragraph and document embeddings is that they are an average of all the sentence vectors combined as one point, which lets you find the distance between any two sentences in this semantic space.

crabbone · 2 years ago

Yeah... for a while I wanted to understand what a vector database is, but this article reads like a thinly-veiled advertorial: too many buzzwords, and the content feels like the author doesn't really have a good knowledge of the subject and is just trying to advertise the tech their company is selling.

Buttons840 · 2 years ago

An embedding is a series of numbers that have been gradually shifted to better fit some purpose. The gradients tell me that if I increase the first number of embedding X a little, the model will perform better, so I do.

I don't understand how so much money has been poured in to these companies?

I get the why the techniques are suitable, but I just assumed who ever wants to do this kind of retrieval can probably implement a suitable Approx. NN library themselves?

Especially so, because getting good embeddings is the hard part, not the search?

itsoktocry · 2 years ago

>I don't understand how so much money has been poured in to these companies?

First time here? Just kidding. But not.

You have to separate the VC hype with the product, because the VCs always need something to overhype. Half these people were pumping money in to crypto and whatever-the-hell-web3-is/was just a couple months ago, this is just the next thing they like. Half these companies probably aren't remotely good companies.

The VC money hardly ever makes sense.

jderiksen · 2 years ago

I am on a small team that initially rolled our own semantic search system. We quickly ran into issues around scaling, maintenance, and performance. Since we want to focus on delivering features and not turning into a DevOps team, we switched to Pinecone and it has met our needs pretty well. We would like to see auto-scaling and I believe that this feature is in the works. Support has been very responsive and helpful when we do have questions and issues.

There are plenty of LLMs to choose from with regard to finding sources of embeddings. Some free, some for money.

moneywoes · 2 years ago

may I ask what your use case is for the vector db?

struggling to see the reason for the sudden demand

heipei · 2 years ago

For the same reason you have money going to various SQL-as-a-service companies that run Postgres / MySQL for you as a service: Some folks would rather eat the network latency, give up control of their data and complicate their compliance process than operating a database themselves.

pantulis · 2 years ago

The difference with SQL is that it's not like storing vector embeddings outside your perimeter suppose a big compliance issue --at least in security or legal terms. Giving up control of their data and network latency are legit concerns, that's for sure.

hiyou102 · 2 years ago

A lot of people use some form of managed services if they are in the cloud. Be it S3 or Dynamo DB. Generally cheaper than running things yourself and operationally much easier too.

jstx1 · 2 years ago

Anyone who wants this can implement their own library themselves? When has this worked for any problem ever?

Searching efficiently is a problem, and there's several open source and proprietary solutions but I don't get how you can put it in the "everyone should roll their own" category.

softwaredoug · 2 years ago

At billion vector scale, doing this yourself is pretty impossible

dmezzetti · 2 years ago

Faiss has long discussed strategies for scaling to 1B - 1T records here - https://github.com/facebookresearch/faiss/wiki/Indexing-1G-v...

There are plenty of options available to run your own local vector database, txtai is one of them. Ultimately depends if you have a sizable development team or not. But saying it is impossible is a step too far.

quickthrower2 · 2 years ago

If you have a billion vectors, is “yourself” a large tech company who does stuff like roll their own browsers, programming languages, invents kubernetes etc. Probably could roll this! And indeed sell this.

gpderetta · 2 years ago

Last time I had to deal with vector representation of documents was more than 10 years ago, so I'm a bit rusty, but billion vector scale sound relatively trivial.

VHRanger · 2 years ago

not particularly?

1B vectors * 300dimensions * float32 (4 Bytes) ~= 1.2TB

This pretty much still runs on consumer hardware.

Just run that on a 4TB nvme ssd, or a RAID array of ssd's if you're frisky.

ramoz · 2 years ago

lol, not true. Even for huge vectors (1000 page docs), today you can do this with enough disk storage with something like leveldb on a single node, and in memory with something like ScaNN for nearest neighbor.

Deleted Comment

ShamelessC · 2 years ago

It is at the intersection of technology investors "know" (databases) and technology investors don't know, but have been told is about to blow up (ML).

It is also effectively "roll your own Google/Shazam/whatever", which probably makes for a fancy demo to those who don't know how trivial it is to implement.

Basically investors are morons on average.

carimura · 2 years ago

Because if AI is the gold rush VC's want to find the Levi's and Wells Fargo's.

mritchie712 · 2 years ago

If you want to play with a vector database and already use postgres, there's pgvector[0]. It's easy to add as an extension (supports Postgres 11+).

Supabase wrote a solid tutorial[1] (you don't need to run it on Supabase).

0 - https://github.com/pgvector/pgvector

1 - https://supabase.com/blog/openai-embeddings-postgres-vector

imaurer · 2 years ago

I am bullish Pgvector because I am “postgres for everything guy”.

Current concerns are the scaling and recall performance.

The author is looking at product quantization along with other ideas: https://github.com/pgvector/pgvector/issues/27

More details on product quantization: https://mccormickml.com/2017/10/13/product-quantizer-tutoria...

A nice repo that tracks the ANN relative performance of different indexes: https://mccormickml.com/2017/10/13/product-quantizer-tutoria...

Also shoutout to Weaviate because they have great docs, are open source and have very informative YouTube channel.

https://weaviate.io/

videlov · 2 years ago

Over the past couple of days I tried 11 different vector databases, in order to evaluate and decide which one we'd choose for our use case.

I ended up choosing Weaviate specifically because of the nice docs, but beyond that, time will tell.

Yep, we're (https://www.definite.app/) using pgvector and I was initially concerned about scaling, but it doesn't seem it will be a problem for our use case. I definitely wouldn't use it if I was building a feature for Slack, but works for us!

supriyo-biswas · 2 years ago

Can you search both by an equality comparison and a vector search in weaviate? I’d like to do something along the lines of `SELECT * FROM table t WHERE cosine_dist(:my_embedding, t.doc_embedding) < :x AND some_column = “XYZ”`

KyeRussell · 2 years ago

Amen. After suffering through many years of people telling me to use document databases when I was much better served with—at most—Postgres with a jsonb field, I feel vindicated enough to feel justified in doing my due diligence before going off the beaten track.

Not that document databases don’t have their place, but…MongoDB is webscale and all that.

mmaia · 2 years ago

Yup. pgvector will do it for a lot of projects, specially if you're just trying things out. It think of it as using PostgreSQL full text search before you need to deploy a decidated solution.

swe_dima · 2 years ago

Google Cloud still doesn't support this plugin, big shame

AWS just added yesterday. Hosting options tracked here:

https://github.com/pgvector/pgvector/issues/54

tornato7 · 2 years ago

Also plugging my crappy vector database, which you probably shouldn't use for anything but a fun project, however it can be set up and used in seconds. https://github.com/corlinp/Victor

justchad · 2 years ago

I'm bullish on pgvector as well. Now that RDS supports it as well as plenty of other cloud providers it seems like a no-brainer to be able to stick with your existing stack (assuming it's postgres). Andrew Kane is such a prolific open-source maintainer as well.

Replit now has Postgres databases. Do you know if it's possible to use pgvector on replit?

time_to_smile · 2 years ago

Out of curiosity, what's the use case here?

It seems like if the goal is to "play around with vector databases", why not just install it on your local machine? Part of using these tools is learning how they work and configuring them yourself.

If the goal is "start developing products using vector data bases" then it seems like you would surely want something a bit more under your control than using replit.

andre-z · 2 years ago

An open-source Pinecone alternative https://github.com/qdrant/qdrant With a cloud offering along https://cloud.qdrant.io with 1GB cluster for free to try out. Disclaimer: I'm part of the team.

vlovich123 · 2 years ago

Curious with why you went for an Apache license. Aren’t you worried about copy-cat services? Or does the OSS version lack the scaling/distributed features that would be more difficult to replicate? I think that was ES’s fatal mistake and their licensing games are unlikely to pan out.

softfalcon · 2 years ago

The Coral Project [0] (commenting platform used on Washington Post, New York Times, The Verge) uses an Apache 2.0 license [1]. Which doesn't seem to have prevented it from raking in big SaaS customers.

A lot of people worry about copy-cat services, but it's kind of rare that someone will be able to compete with you as the original in hosting your own service as well as you can. Especially when you consider support and maintenance requirements of a new product you aren't personally developing.

I could see copy-cat services being more of an issue in the late stage of a product though? When everyone knows lots about how to stand it up and use it?

[0] https://coralproject.net/

[1] https://github.com/coralproject/talk/blob/develop/LICENSE

zcesur · 2 years ago

qdrant also pays open source contributors: https://news.ycombinator.com/item?id=35828003

disclaimer: i'm a founder at algora.io, the platform that enables these paid contributions

ffback · 2 years ago

"Algora charges a 23% fee over your rewarded bounties (20% Algora fee + 3% Stripe fee). The fee is applied when you complete your bounty payments."

https://docs.algora.io/bounties/payments#compliance

If algora.io didn't charge %23 of the bounty I would have tried to contribute. It felt unfair to me.

bobosha · 2 years ago

A +1 for qdrant from a happy user. we use qdrant in production with a 50-100MM rows scale. Haven't experienced many bottlenecks thus far, and has performed quite well.

@qdrant_team: perhaps you should look into offering it as a service, a la pinecone.

edit: oops just checked your (updated) website and notice you have an offering already. Congrats! will check it out. ty =)

You are welcome! Feel free to reach out for the early adopter discount ;)

echelon · 2 years ago

Perfect for our next zero shot model. We'll give it a spin!

Thanks for building this.

mbrochh · 2 years ago

Is it different/better than https://milvus.io ?

jcmoscon · 2 years ago

Hey this is pretty cool! I will try it! Thanks for sharing it.

monkeydust · 2 years ago

Might try this, using pinecone but there documentation even for simple use cases is pretty poor.

woile · 2 years ago

how much do one need to know of databases to work at qdrant? Sounds like a nice place to work, specially if remote

Well, to work on the core of the Qdrant engine https://github.com/qdrant/qdrant you should have some db knowledge but even more important are Rust skills. However, we have also other products, like the cloud platform https://cloud.qdrant.io there we are looking for different skills.

zh217 · 2 years ago

If anyone wants to try a FOSS vector-relational-graph hybrid database for more complicated workloads than simple vector search, here it is: https://github.com/cozodb/cozo/

About the integrated vector search: https://docs.cozodb.org/en/latest/releases/v0.6.html

It also does duplicate detection (Minhash-LSH) and full-text search within the query language itself: https://docs.cozodb.org/en/latest/releases/v0.7.html

HN discussion a few days ago: https://news.ycombinator.com/item?id=35641164

Disclaimer: I wrote it.

digdugdirk · 2 years ago

Glad I hopped into this thread while your comment was recent enough to be at the top. This is super interesting! Apologies if you went over this in your other post (or the docs, I'll be digging into this over the weekend) but could you share a bit about why you went this route? What you tried, what the hangups were/are with other approaches, and if there are any interesting possibilities with your approach that other vector databases just wouldn't be able to do?

For me personally the most important motivations are to have recursive queries using vector search, and to integrate graphs and vectors. Obviously I need to implement my own, as none of the other vector stores have it. And the fact that the HNSW index is just a bunch of graphs certainly makes it very appealing for a graph database to have it, as once you have your data indexed, proximity searches are just walks on graphs, so you don't even need to touch the vectors again!

drunkan · 2 years ago

Thanks for the links and discussions, I’m keeping an eye on this one it looks really promising, at least in the hybrid area compared to the much hyped surrealDB whose graph implementation looks more like an afterthought when you get down to the technical details, functionality and performance

dcl · 2 years ago

Y not Elasticsearch?

I don't see it addressed in the article, but Elastic 8 has ANN support, and every other feature you'd expect out of a ranking system. Vectors are only one piece of the puzzle for building such a system. (honest question, not trying to troll, as I truly do <3 these pinecone articles)

(Similarly, Y not Solr, Vespa, etc etc) :)

sidi · 2 years ago

There currently isn't a way to filter docs alongside a KNN query, and the dimension support is limited to 1024 (a Lucene limitation) and OpenAI embeddings are 1536 dimensions - also indexing performance is not comparable. Wishing this changes, as they're a good stack for the reasons you state

True though I do think 2k dims is coming it 8.8

Plus Elasticsearch is a breeze to operate and scale in a fault-tolerant matter.

bbarnett · 2 years ago

I suspect missing sarcasm tags here. The very least, from a lack of a stable, security update only release beanch.

trgn · 2 years ago

Vector will eventually just be another data-type in all db-systems. Already so many production systems have their data replicated across multiple dbs, just to accommodate different use-cases. I'm not keen in adding yet another one.

hobs · 2 years ago

In the ANN benchmarks Elastic sets the bottom bar afaict.http://ann-benchmarks.com/

That appears to be the old community maintained plugin, Elastic KNN, not the official Lucene based HNSW implementation.

Hey Doug :)

We always encourage folks to do their own testing. Everyone has different performance requirements, data shapes/sizes, budgets, and expectations of the user experience.

Elasticsearch is a great option. But clearly there's a large cohort of smart teams that decided the combination of performance + cost + scale + [etc] on Pinecone makes more sense for them.

Hey Greg! Yes I am trolling a bit, to see what the answers might be.

IMO - the real reason "Y Not Elasticsearch" is not because they're dumb or its bad. It's actually because they're not building for the search / AI market like you all are :)

When someone runs out of RAM with their Numpy array, they google, and you guys come up really speaking to that audience, building features, showing people how to build specific solutions, etc.

carom · 2 years ago

I looked at the concepts in FAISS and it seems fairly straightforward. In non-jargon you have dimensionality reduction and neighborhoods.

DR is taking a long embedding and doing something to make it shorter. An easy to follow method for this is minhash.

Neighborhoods is representing a cluster of embeddings with a single representative to speed up comparisons. For example, find me the two closest representatives then doing a deeper comparison on all the residents.

Now the feature I haven't seem that will probably cause me to build instead of buy. Most seem designed for a single organization and a single use. For example, Spotify song recommender.

I would like to store embedding from multiple models and be able to search per model. I would also like fine grain user access control, so users could search their embeddings and grant access to others.

If the different models use the same dimensionality, you can keep their embeddings within different namespaces inside the same index. See: https://docs.pinecone.io/docs/namespaces

If you mean "user access control" within your company, there are basic access controls within Pinecone. See: https://docs.pinecone.io/docs/add-users-to-projects-and-orga...

If you mean for your end-users, you can use namespaces again to separate embeddings for different users inside one index. See: https://docs.pinecone.io/docs/multitenancy

There isn't yet a combination of the two, where you provide Pinecone API access to end-users.

Thank you, I'll definitely play with pinecone before I build. The dimensionality might vary between models or versions of models. Additionally, the end goal would be to expose it to users and not have to post filter. So probably an index per user. Not sure how expensive that is to recalculate regularly.

nutanc · 2 years ago

Do any of the vector databases have support for bit embeddings. We have created bit embeddings[1] for sentences and they save a lot of space. Currently we are just using numpy and sometimes faiss to search through these bit embeddings. Would love for one of the vector dbs to support bit embeddings natively. Then we don't have to engineer that piece :)

[1] https://gpt3experiments.substack.com/p/building-a-new-embedd...

esafak · 2 years ago

These are called binary embeddings, and they have been used successfully at pinterest (https://www.arxiv-vanity.com/papers/1908.01707/) and Tencent (https://paperswithcode.com/paper/binary-embedding-based-retr...)

I can't speak for the competition, but weaviate seems to support them: https://weaviate.io/developers/weaviate/concepts/binary-pass...

bckr · 2 years ago

Love this idea. Do you have measurements on how it impacts performance of algorithms?