xavcochran (u/xavcochran)

xavcochran commented on Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust) github.com/HelixDB/helix-... · Posted by u/GeorgeCurtis

quantike · 7 months ago

I appreciate the reply! Yeah that sounds like the correct path forward is swapping out the type for some enum of numeric types you want to cover.

I'd be curious if there's some benefit to the runtime-memory utilization to baking in the precision of the vector if it's known at comptime/runtime. In my own usage of vector DBs I've only ever used a single-precision (f32), and often have a single, known dimension. But if Helix is aiming for something more general purpose, then it makes sense to offer the mixing of precision and dimension in the internals.

Cheers

xavcochran · 7 months ago

The benefit of baking in the dimension and size of individual elements (the precision) is the fact that the size will be known at compile time meaning it can be allocated on the stack instead of being heap allocated.

xavcochran commented on Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust) github.com/HelixDB/helix-... · Posted by u/GeorgeCurtis

iannacl · 7 months ago

Looks really interesting. A couple of questions: Can you explain how helix handles writes? What are you using for keys? UUIDs? I'm curious if you've done, or are thinking about, any optimizations here.

Feel free to point me to docs / code if these are lazy questions :)

xavcochran · 7 months ago

We utilize some of LMDB's optimizations such as the APPEND put flags. We also make use of LMDB handling duplicates as a one-to-many key instead of duplicating keys. This means we can get all values for one key in one call rather than a call for each duplicate.

For keys we are using UUIDs, but using the v6 timestamped uuids so that they are easily lexicographically ordered at creation time. This means keys inserted into LMDB are inserted using the APPEND flag, meaning LMDB shortcuts to the rightmost leaf in its B-Tree (rather than starting at the root) and appends the new record. It can do this because the records are ordered by creation time meaning each new record is guaranteed to be larger (in terms of big-endian byte order) than the previous record.

We also store the UUIDs as u128 values for two reasons. The first is that a u128 takes up 16 bytes where as a string UUID takes up 36 bytes. This means we store 56% less data and LMDB has to decode 56% less bytes when doing code accesses.

For the outgoing/incoming edges for nodes, we store them as fixed sizes which means LMDB packs them in, removing the 8 byte header per Key-Value pair.

In the future, we are also going to separate the properties from the stored value as empty property objects still take up 8 bytes of space. We will also make it so nothing is inserted if the properties are empty.

You can see most of this in action in the storage core file: https://github.com/HelixDB/helix-db/blob/main/helixdb/src/he...

xavcochran commented on Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust) github.com/HelixDB/helix-... · Posted by u/GeorgeCurtis

BlooIt · 7 months ago

Shameless plug: If you're exploring graph+vector databases, check out https://github.com/Pometry/Raphtory/ — with a full Python SDK and built-in support for most common graph algorithms.

It’s built in Rust with native vector support. The open-source version is in-memory, but the commercial version supports disk-based scaling (we tested it with a 3TB graph on an M1 MacBook + insert all 100x faster than existing GraphDBs).

xavcochran · 7 months ago

Looking at your benchmarks you say for inserting 1k edges its around 500,000 ns/iteration. Is this 500,000 ns/per edge insertion or for all 1k of them?

xavcochran commented on Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust) github.com/HelixDB/helix-... · Posted by u/GeorgeCurtis

lleymrl651 · 7 months ago

Congrats on the launch!

xavcochran · 7 months ago

thank you! any feedback would be much appreciated