Build your own fast, persistent KV store

I wrote a simple dynamodb style database with a python dictionary and a Google pygtrie ("trie" data structure)

It's still a toy but I kept adding features.

I then kept working on it and added distribution with consistent hashing, rudimentary SQL joins, Cypher graph database queries and document storage. You can even query documents with SQL.

I didn't get around to changing the graph storage to be multimodal.

It takes very little code to write something with lots of features.

https://GitHub.com/samsquire/hash-db

There's an AVL tree that farleyknight wrote and a btree that I wrote that need to be integrated into it.

cabalamat · 3 years ago

> "trie" data structure

How is that word pronounced? If it's pronounced "tree", it clashes with the name of another data structure. But if it's pronounced "try", it clashes with the name of a reserved word in many languages.

neonrider · 3 years ago

https://en.wikipedia.org/wiki/Trie#History,_etymology,_and_p...

The idea was independently described in 1960 by Edward Fredkin, who coined the term trie, pronouncing it /ˈtriː/ (as "tree"), after the middle syllable of retrieval. However, other authors pronounce it /ˈtraɪ/ (as "try"), in an attempt to distinguish it verbally from "tree".

Deleted Comment

moosedev · 3 years ago

Minutes of fun to be had in trying to pronounce it at the precise midpoint of however you pronounce "tree" and "try".

(That word breaks my brain, too - the above is a coping mechanism.)

In the past I recall reading/hearing "trie-tree" (pronounced "try tree") as an ambiguity reducer, but I don't know how common that is now, or ever was.

Volt · 3 years ago

Intended to be like "tree", usually pronounced "try".

meling · 3 years ago

I usually pronounce it as “try”, as I think I’ve heard others do the same, but I’m non-native English speaker so don’t take my word for it ;-)

CornCobs · 3 years ago

I've heard before that it's meant to be re-TRIE-val and therefore is pronounced "tree"

afiori · 3 years ago

I usually pronounce it as "trye" with the y as in try and the e as the first e in Mercedes

zepolen · 3 years ago

tree-eh

How are the keys indexed and looked up? How does it handle disk fragmentation? This has always seemed to me to be one of the harder problems with databases.

bob1029 · 3 years ago

I've built an append-only scheme where the on-disk format consists of 1 gigantic splay tree with updates being written as a modified sub-tree. The last element you write to disk is always the (latest) root node. Append-only implicitly solves most disk fragmentation concerns. It also has GC capabilities by having the physical log divided into multiple files. The actual "cleanup" is simply taking an old file and feeding its items into the storage engine again. The heuristics for determining when to do this are based upon statistics collected at insert/update time.

Assuming the root node knows how to find the various things you are looking for (i.e. physical offsets to child nodes), this is how you can address the storage.

The trickiest part is finding the latest root node in adverse scenarios (i.e. plug pulled/partial write to disk). You can develop some 2-tier system where a 32-bit magic cookie is scanned for in blocks from back to front, and once it is encountered the relevant offsets are applied and the candidate attempts deserialization+checksum. No partial writes can be recovered in this scheme and wind up as wasted bytes in the log.

At some point I had intended to use this for a work project, but then SQLite came in and ruined my little party.

azurelake · 3 years ago

If you want to read about an option that doesn't suffer from fragmentation (no pedantic replies please), check out LSM Trees which are what RocksDB uses.

Designing Data-Intensive Applications has a VERY clear and understandable chapter about them. Once you've finished reading it, you'll have the tools to whip together a toy implementation for something like this.

Deleted Comment

vitiral · 3 years ago

Thanks!

EVa5I7bHFq9mnYK · 3 years ago

The hardest problem with the databases seems to be how it handles a sudden power down event. Most implementations don't pass that test.

samsquire · 3 years ago

codr7 · 3 years ago

I've built several versions of a log based db with composite keys over the years, the most complete version so far in Common Lisp:

https://github.com/codr7/whirlog

I've found that reinventing wheels is a great way to learn, even if you never use them.

mr-karan · 3 years ago

I recently implemented the Bitcask paper in Golang and shared my learnings: https://mrkaran.dev/posts/barreldb/

https://github.com/mr-karan/barreldb/

Bitcask is an excellent paper that is not overwhelming to understand and offers a great stepping stone in building your own data stores. The simple yet powerful design of an append only file is eloquent and performant.

I’d love to read about more such implementations, if anyone has any recommendations.

whartung · 3 years ago

Is there an elegant solution to the garbage collection problem that append only DBs have to deal with?

yencabulator · 3 years ago

This is most exhaustively talked about in the context of Log-Structured Merge Trees, like LevelDB and RocksDB. They essentially structure the append-only/write-once chunks into tiers of generations (each compaction shifting the still-alive data to a more long-lived tier), reminiscent of generational GC except now shaped like a tree.

https://en.wikipedia.org/wiki/Log-structured_merge-tree

https://en.wikipedia.org/wiki/Tracing_garbage_collection#Gen...

I think what bitcask proposes (to routinely collect datafiles and mark them as stale and GC them later) is quite a simple solution. It would work for a lot of usecases.

itsmemattchung · 3 years ago

If building a distributed KV store interests you, you might want to check out Georgia Tech's OMSCS Distributed Computing course: https://omscs.gatech.edu/cs-7210-distributed-computing . While rewarding, this class was hell and still gives me nightmares....

By the way, the course is built on top of Distributed System's lab: https://github.com/emichael/dslabs

MooseBurger · 3 years ago

Going through this class now, and I can attest that it is in fact tough as hell. However, I already feel as if it will be the most rewarding learning experience in the entire program.

ddlutz · 3 years ago

Any way to watch the lectures while not a current GT student? I'm an alumni on the OMSCS, but this course was not available before I graduated.

mvuksano · 3 years ago

I like how the author says "fast" kV store but puts no performance analysis. From looking at the implementation and just making a guess this looks like is a very slow KV store.

If you need a fast and performant kV store stick to RocksDb or LevelDb. Don't reinvent the wheel. Most of performance comes from optimizing for OS and CPU.

human · 3 years ago

Sometimes you just want to have fun. But I agree, I don’t know why he suggests it would be fast…

zambal · 3 years ago

Bitcask was a storage engine created by basho for their riak db (dynamodb like distributed db).

The other storage engine riak could use was google's levellb.

At the time bitcask was as fast or faster than leveldb for most tasks. This was I think around 2010. I have no idea how it compares with a recent version of rocksdb, but at the time it was pretty fast.

avinassh · 3 years ago

I am a big fan of Build Your Own X educational projects. Recently, I released a Go version of my build your own KV Store project.

I have set up this project in TDD fashion with the tests. So, you start with simple functions, pass the tests, and the difficulty level goes up. There are hints if you get stuck. When all the tests pass, you will have written a persistent key-value store.

go - https://github.com/avinassh/go-caskdb

python - https://github.com/avinassh/py-caskdb

robbs · 3 years ago

Building your own database is a great exercise. There's lots of topics to explore, like distributed computing, how indexes work, caching, parsing, etc and you can pick and choose what to implement.

ignoramous · 3 years ago

Strictly speaking, building a database is more system engineering than distributed computing, but folks skip right past that when they opt to use "boltdb" or "rocksdb" or "innodb" and so on.