Deleted Comment
Deleted Comment
Deleted Comment
Feel free to point me to docs / code if these are lazy questions :)
For keys we are using UUIDs, but using the v6 timestamped uuids so that they are easily lexicographically ordered at creation time. This means keys inserted into LMDB are inserted using the APPEND flag, meaning LMDB shortcuts to the rightmost leaf in its B-Tree (rather than starting at the root) and appends the new record. It can do this because the records are ordered by creation time meaning each new record is guaranteed to be larger (in terms of big-endian byte order) than the previous record.
We also store the UUIDs as u128 values for two reasons. The first is that a u128 takes up 16 bytes where as a string UUID takes up 36 bytes. This means we store 56% less data and LMDB has to decode 56% less bytes when doing code accesses.
For the outgoing/incoming edges for nodes, we store them as fixed sizes which means LMDB packs them in, removing the 8 byte header per Key-Value pair.
In the future, we are also going to separate the properties from the stored value as empty property objects still take up 8 bytes of space. We will also make it so nothing is inserted if the properties are empty.
You can see most of this in action in the storage core file: https://github.com/HelixDB/helix-db/blob/main/helixdb/src/he...
It’s built in Rust with native vector support. The open-source version is in-memory, but the commercial version supports disk-based scaling (we tested it with a 3TB graph on an M1 MacBook + insert all 100x faster than existing GraphDBs).
I'd be curious if there's some benefit to the runtime-memory utilization to baking in the precision of the vector if it's known at comptime/runtime. In my own usage of vector DBs I've only ever used a single-precision (f32), and often have a single, known dimension. But if Helix is aiming for something more general purpose, then it makes sense to offer the mixing of precision and dimension in the internals.
Cheers