I wrote a simple dynamodb style database with a python dictionary and a Google pygtrie ("trie" data structure)
It's still a toy but I kept adding features.
I then kept working on it and added distribution with consistent hashing, rudimentary SQL joins, Cypher graph database queries and document storage. You can even query documents with SQL.
I didn't get around to changing the graph storage to be multimodal.
It takes very little code to write something with lots of features.
How is that word pronounced? If it's pronounced "tree", it clashes with the name of another data structure. But if it's pronounced "try", it clashes with the name of a reserved word in many languages.
The idea was independently described in 1960 by Edward Fredkin, who coined the term trie, pronouncing it /ˈtriː/ (as "tree"), after the middle syllable of retrieval. However, other authors pronounce it /ˈtraɪ/ (as "try"), in an attempt to distinguish it verbally from "tree".
Minutes of fun to be had in trying to pronounce it at the precise midpoint of however you pronounce "tree" and "try".
(That word breaks my brain, too - the above is a coping mechanism.)
In the past I recall reading/hearing "trie-tree" (pronounced "try tree") as an ambiguity reducer, but I don't know how common that is now, or ever was.
Bitcask is an excellent paper that is not overwhelming to understand and offers a great stepping stone in building your own data stores. The simple yet powerful design of an append only file is eloquent and performant.
I’d love to read about more such implementations, if anyone has any recommendations.
This is most exhaustively talked about in the context of Log-Structured Merge Trees, like LevelDB and RocksDB. They essentially structure the append-only/write-once chunks into tiers of generations (each compaction shifting the still-alive data to a more long-lived tier), reminiscent of generational GC except now shaped like a tree.
I think what bitcask proposes (to routinely collect datafiles and mark them as stale and GC them later) is quite a simple solution. It would work for a lot of usecases.
If building a distributed KV store interests you, you might want to check out Georgia Tech's OMSCS Distributed Computing course: https://omscs.gatech.edu/cs-7210-distributed-computing . While rewarding, this class was hell and still gives me nightmares....
Going through this class now, and I can attest that it is in fact tough as hell. However, I already feel as if it will be the most rewarding learning experience in the entire program.
I like how the author says "fast" kV store but puts no performance analysis. From looking at the implementation and just making a guess this looks like is a very slow KV store.
If you need a fast and performant kV store stick to RocksDb or LevelDb. Don't reinvent the wheel. Most of performance comes from optimizing for OS and CPU.
Bitcask was a storage engine created by basho for their riak db (dynamodb like distributed db).
The other storage engine riak could use was google's levellb.
At the time bitcask was as fast or faster than leveldb for most tasks. This was I think around 2010. I have no idea how it compares with a recent version of rocksdb, but at the time it was pretty fast.
I am a big fan of Build Your Own X educational projects. Recently, I released a Go version of my build your own KV Store project.
I have set up this project in TDD fashion with the tests. So, you start with simple functions, pass the tests, and the difficulty level goes up. There are hints if you get stuck. When all the tests pass, you will have written a persistent key-value store.
Building your own database is a great exercise. There's lots of topics to explore, like distributed computing, how indexes work, caching, parsing, etc and you can pick and choose what to implement.
Strictly speaking, building a database is more system engineering than distributed computing, but folks skip right past that when they opt to use "boltdb" or "rocksdb" or "innodb" and so on.
How are the keys indexed and looked up? How does it handle disk fragmentation? This has always seemed to me to be one of the harder problems with databases.
I've built an append-only scheme where the on-disk format consists of 1 gigantic splay tree with updates being written as a modified sub-tree. The last element you write to disk is always the (latest) root node. Append-only implicitly solves most disk fragmentation concerns. It also has GC capabilities by having the physical log divided into multiple files. The actual "cleanup" is simply taking an old file and feeding its items into the storage engine again. The heuristics for determining when to do this are based upon statistics collected at insert/update time.
Assuming the root node knows how to find the various things you are looking for (i.e. physical offsets to child nodes), this is how you can address the storage.
The trickiest part is finding the latest root node in adverse scenarios (i.e. plug pulled/partial write to disk). You can develop some 2-tier system where a 32-bit magic cookie is scanned for in blocks from back to front, and once it is encountered the relevant offsets are applied and the candidate attempts deserialization+checksum. No partial writes can be recovered in this scheme and wind up as wasted bytes in the log.
At some point I had intended to use this for a work project, but then SQLite came in and ruined my little party.
If you want to read about an option that doesn't suffer from fragmentation (no pedantic replies please), check out LSM Trees which are what RocksDB uses.
Designing Data-Intensive Applications has a VERY clear and understandable chapter about them. Once you've finished reading it, you'll have the tools to whip together a toy implementation for something like this.
It's still a toy but I kept adding features.
I then kept working on it and added distribution with consistent hashing, rudimentary SQL joins, Cypher graph database queries and document storage. You can even query documents with SQL.
I didn't get around to changing the graph storage to be multimodal.
It takes very little code to write something with lots of features.
https://GitHub.com/samsquire/hash-db
There's an AVL tree that farleyknight wrote and a btree that I wrote that need to be integrated into it.
How is that word pronounced? If it's pronounced "tree", it clashes with the name of another data structure. But if it's pronounced "try", it clashes with the name of a reserved word in many languages.
The idea was independently described in 1960 by Edward Fredkin, who coined the term trie, pronouncing it /ˈtriː/ (as "tree"), after the middle syllable of retrieval. However, other authors pronounce it /ˈtraɪ/ (as "try"), in an attempt to distinguish it verbally from "tree".
Deleted Comment
(That word breaks my brain, too - the above is a coping mechanism.)
In the past I recall reading/hearing "trie-tree" (pronounced "try tree") as an ambiguity reducer, but I don't know how common that is now, or ever was.
https://github.com/codr7/whirlog
I've found that reinventing wheels is a great way to learn, even if you never use them.
https://github.com/mr-karan/barreldb/
Bitcask is an excellent paper that is not overwhelming to understand and offers a great stepping stone in building your own data stores. The simple yet powerful design of an append only file is eloquent and performant.
I’d love to read about more such implementations, if anyone has any recommendations.
https://en.wikipedia.org/wiki/Log-structured_merge-tree
https://en.wikipedia.org/wiki/Tracing_garbage_collection#Gen...
By the way, the course is built on top of Distributed System's lab: https://github.com/emichael/dslabs
If you need a fast and performant kV store stick to RocksDb or LevelDb. Don't reinvent the wheel. Most of performance comes from optimizing for OS and CPU.
The other storage engine riak could use was google's levellb.
At the time bitcask was as fast or faster than leveldb for most tasks. This was I think around 2010. I have no idea how it compares with a recent version of rocksdb, but at the time it was pretty fast.
I have set up this project in TDD fashion with the tests. So, you start with simple functions, pass the tests, and the difficulty level goes up. There are hints if you get stuck. When all the tests pass, you will have written a persistent key-value store.
go - https://github.com/avinassh/go-caskdb
python - https://github.com/avinassh/py-caskdb
Assuming the root node knows how to find the various things you are looking for (i.e. physical offsets to child nodes), this is how you can address the storage.
The trickiest part is finding the latest root node in adverse scenarios (i.e. plug pulled/partial write to disk). You can develop some 2-tier system where a 32-bit magic cookie is scanned for in blocks from back to front, and once it is encountered the relevant offsets are applied and the candidate attempts deserialization+checksum. No partial writes can be recovered in this scheme and wind up as wasted bytes in the log.
At some point I had intended to use this for a work project, but then SQLite came in and ruined my little party.
Designing Data-Intensive Applications has a VERY clear and understandable chapter about them. Once you've finished reading it, you'll have the tools to whip together a toy implementation for something like this.
Deleted Comment