How FoundationDB works and why it works (2021)

We have run foundationdb in production for roughly 10 years. It is solid, mostly trouble free (with one very important exception: you must NEVER allow any node on the cluster to exceed 90% full), robust and insanely fast (10M+ tx/sec). It is convenient, has a nice programming model, and the client includes the ability to inject random failures.

That said, I think most coders just can't deal with it. For reasons I won't go into, I came to fdb already fully aware of the compromises that software transactional memories have, and fdb roughly matches the semantics of those: retry on failure, a maximum transaction size, a maximum transaction time, and so on. For those who haven't used it, start here: https://apple.github.io/foundationdb/developer-guide.html ; especially the section on transactions.

These constraints _very_ inconvenient for many kinds of applications so, ok, you'd like a wrapper library that handles them gracefully and hides the details (for example count of range).

This seems like it should be easy to do - after all, the expectation is that _application developers_ do it directly - but it isn't actually so in practice and introduces a layering violation into the data modeling if you have any part of your application doing direct key access. I recommend people try it. It can surely be done, but that layer is now as critical as the DB itself, and that has interesting risks.

At heart, the problem is, the limits are low enough that normal applications can and do run into them, and they are annoying. It would be really nice if the FDB team would build this next layer themselves with the same degree of testing but they themselves have not, and I think it's pretty clear that it turns out a small-transaction KV store is not enough to build complex layers in actuality.

Emphasis on the tested part - it's all well and good for fdb to be rock solid, but what needs to be there is that the actual interfact used by 90% of applications is rock solid, and if you exceed basic small-size keys or time, that isn't really true.

Dave_Rosenthal · 2 years ago

I think that’s a good and really fair summary.

- If you’re a developer wanting to build an application, you should really use a well designed layer between yourself and FDB. A few are out there.

- If you’re a dev thinking you want to build a database from scratch you probably should just use FDB as the storage engine and work on the other parts. To start, at very least!

(One last thing that I think is a bit overlooked with FDB is how easy it is to build basically any data structure you can in memory in FDB instead. Not that it solves the transaction timeout stuff, etc. but if you want to build skip list, or a quad tree, or a vector store, or whatever else, you can quite easily use FDB to build a durable, distributed, transactional, multi-user version of the same. You don’t have to stick to just boring tables.)

foobiekr · 2 years ago

I think i would say that for me the biggest issue is that what little there is in well-written layers are all java. Nothing against java but I'd be looking for Go and Rust, not Java.

We did use fdb to backend more complex data structures (b+ and a kind of skiplist) and it's very cool. fdb basically presents the model of a software transactional memory and it's kind of wonderful, but it's not wonderful enough.

Another issue that I forgot to mention is that comprehensibility of keys is your own problem. Keys and values are just bytes - if you don't start day one with a shared envelope for all writers, you _will_ be in pain eventually. This can get kind of ugly if you somehow end up leaking keys that you can't identify.

ptttr · 2 years ago

> you should really use a well designed layer between yourself and FDB. A few are out there.

Any recommendations? All I could find is https://github.com/FoundationDB/awesome-foundationdb#layers - not sure how complete and up-to-date that list is.

Keyframe · 2 years ago

Not to take away from your main point, and I appreciate it, but I am interested in one minor you made which is - you wrote, "and insanely fast (10M+ tx/sec)"; When you say that, what does it mean, what's the context? Is it for the cluster, is it for one machine (what kind of cluster and networking, which machines, what machine), size of transactions, is there acknowledge after each, are they truly transactional or batch in one go..

foobiekr · 2 years ago

Medium size multi-key transactions of substantial read-dependency complexity (many dependent keys loaded as part of the tx) and moderate key size; each tx of modest total size on the set side; This is an own-AWS cluster which means crap networking vs. a purpose-built on-prem network, instances with NVME storage.

fdb transactions are real transactions. These aren't batches.

pajep · 2 years ago

How is foundationdb compare to tidb and cockroachdb?

foobiekr · 2 years ago

CockroachDB is many cool things but not even remotely in a the same category as fdb in terms of the transaction rates it can deal with per unit cost and you wouldn’t be mapping complex data structures into cdb; it’s not what it is for.

They just aren't the same thing. It’s like comparing a binary tree to json. If you squint you can see how they could be similar but really aren’t.

We have seriously looked at FoundationDB to replace our SQL-based storage for distributed writes. We decided not to proceed unless we are about to overgrow the existing deploy, a standard leader-follower setup on the off-the-shelf hardware. The limiting factor for the latter would be a number of NMVMe drives we could put into a single machine. It gives us couple dozen Tb of structured data (we don't store blobs in the database) before we have to worry.

fdb is best when your workload is pretty well-defined and will stay such for a decade or so. It is not usually the case for new products which evolve fast. Two most famous installations of fdb are iTunes and Snowflake metadata. When you rewrite petabyte-size database in fdb, you transform continuous SRE/devops opex costs into developers capex investment. It comes with reduced risks for occasional data loss. For me it's mostly a financial decision, not really a technical one.

Jgrubb · 2 years ago

> transform continuous SRE/devops opex costs into developers capex investment

Would you mind expanding/educating me on this point? When I think of capex I think of “purchasing a thing that’s depreciated over a time window”. If you’d said “transform SRE/COGS costs into developer/R&D/opex costs” I would’ve understood, but eventually the thing leaves development and goes back into COGS.

foobiekr · 2 years ago

Basically the SREs don't have anything to do with fdb for the most part. You add a node, quiesce a node, delete a node. Otherwise it's self-balancing and trouble-free from an SRE pov.

See my other message for the developer issues, though. IMHO fdb as it is today is too hard for most developers if their use case is anything beyond redis simple keys.

psd1 · 2 years ago

AIUI:

- developer time is approximately fungible with money - project delivery is building a thing, that you own, and that has value, and that you will use to produce other value... - ...which can therefore be entered on the balance sheet.

I've just left a company a little after it floated. In the run-up to the float, we were directed to maximise our capital time logged. That meant any kind of feature delivery. Bugfixes were opex.

I believe this was done to grow the balance sheet and maximise market cap.

mcsoft · 2 years ago

I assume a couple of things here: 1) that SRE costs would be lower with fdb at scale due to its handling outages, i.e. auto-resharding; and 2) that a migration project from *sql to fdb will be finite (hence an investment I hastily called capex).

Would love to hear from anyone with experience in fdb whether these assumptions hold.

endisneigh · 2 years ago

Were you planning on using the Record or Document layer if you went with it? Or maybe making your own layer?

mcsoft · 2 years ago

We'd use the Record layer, but it was Java-only then. It would require us either to rewrite parts of our backend to Java or to implement some wrappers.

Deleted Comment

mike_hearn · 2 years ago

An obvious question you face when deploying something like FDB is how to write your app on top of it. With FDB it's like RocksDB. You get a transactional key/value store, but that's a very low level interface for apps to work with.

FDB provides "layers", such as the Record layer. It helps map data to keys and values. But a more sophisticated solution that I sometimes wish would take off is this library:

https://permazen.io/

It's a small(ish) open source project that implements an ORM-like API but significantly cleaned up, and it can run on any K/V backend. There's an FDB plugin for it, so you can connect your app directly to an FDB cluster using it. And with that you get built-in indexing, derived data, triggers, you can do queries using the Java collections API, there's a CLI, there's an API for making GUIs and everything else you might need for a business CRUD app. It's got a paper of its own and is quite impressive.

There are a few big gaps vs an RDBMS though:

1. There's no query planner. You write your own plans by using functional maps/filters/folds etc in regular Java (or Kotlin or Python or any other language that can run on the JVM).

2. It's weak on analytics, because there's no access control and the ad-hoc query language is less convenient than SQL.

3. There's no network protocol other than FDB itself, which assumes low latency networks. So if there's a big distance between the user generating the queries and the servers, you have a problem and will need to introduce an app specific protocol (or move the code).

dang · 2 years ago

https://permazen.io/ hasn't appeared on HN before*. If you'd be willing to post it and then email hn@ycombinator.com a heads-up, we'll put the submission in the second-chance pool (https://news.ycombinator.com/pool, explained at https://news.ycombinator.com/item?id=26998308), so it will get a random placement on HN's front page.

* and the only previous related submission appears to be https://news.ycombinator.com/item?id=21646037.

pi-r-p · 2 years ago

In my company, we tested FDB for two years, then we wrote a new backend for Warp 10 timeseries database... Performances are really impressive, we dropped HBase backend when we released Warp 10 3.0. Note we can isolate customers easily on the same FDB cluster (tenants are not explained anywhere on internet, it is a quite recent FDB feature).

more info: https://blog.senx.io/introducing-warp-10-3-0/

zinodaur · 2 years ago

If theres just one Sequencer, and every ReadVersion request to the Proxy eventually hits the Sequencer 1-1, how does the Sequencer not get crushed? Or is a scaling limit just "the number of ReadVersion requests a Sequencer machine can handle per second", which admittedly is a cheap request to respond to

richieartoul · 2 years ago

Requests to the sequencer are batched heavily. If the sequencer fails, the cluster goes through a recovery and will be unavailable for 2-3 seconds and then recover.

Good point about the batching! Any idea what kind of ReadVersion qps throughput you can get this way? And yeah, 2-3s unavailability seems fine.

whizzter · 2 years ago

Not used FDB but reading the article and considering the semantics, it shouldn't matter too much if the Sequencer "just" distributes recent read-versions to the proxy frequently enough(unless that proxy has received a recent read-version "recently").

Worst case if there is heavy contention on the same keys then resolvers will eventually fail more transaction writes but for read-only transactions most applications should be fine with a slightly "old" version.

(Yes, all this will start to cause down if there is high key contention and many conflicts)

My understanding of ReadVersion is that the only point of calling it is to be able to read your own writes - so staleness wouldn't be good. There was another sibling comment that says the ReadVersion requests are batched up before hitting the sequencer, I could definitely believe that would work.

lowbloodsugar · 2 years ago

Yeah that seems like an untenable design choice. Was quite interested until I read that. Max TPS? and MTTR when sequence inevitably shits itself?

You can trivially scale fdb to tens of millions of tx/sec for write-heavy workloads without a hardcore cluster for transactions of reasonable complexity (though with careful design on my part and the part of others for collisions to be unlikely).

MTTR on failure is seconds. Really, there's no system I've used that is as robust and performant as fdb and I include s3 in that list - s3, for example, _routinely_ has operations with orders of magnitude latency variance and huge, correlated spikes.

Replied above

alberth · 2 years ago

Great article.

Demystified a lot about FDB for me.

> ”Summary: FDB is probably the best k/v store for regional deployment out there.”

Why should someone use Memcache or Redis then?

Is it for the data types in Redis?

rapsey · 2 years ago

Because memcache and redis are in-memory. Writing to fdb will be complete once it is fsync'ed to disk.

Memcache is a cache. Fdb is a an ordered kv store.

unmole · 2 years ago

Redis can be configured to persist and fsync every operation.

abhishekjha · 2 years ago

You can configure redis to flush to disk on write operations though you lose on performance.

rullopat · 2 years ago

FDB is more a framework to create your own distributed database creating what they call a "layer".

https://apple.github.io/foundationdb/layer-concept.html

dboreham · 2 years ago

Things like Redis and Memcache are not serious data stores. Don't put any data in them that you really need back out later.

c0balt · 2 years ago

Redis has a few features outside of k/v, like a good pub-sub implementation, that make it very useful in addition to a good DX and mature libraries.

Memcache on the other habd is just solid and mature. It also has some inertia as being a solid k/v cache. For example: NextCloud supports afaik both Redis and Memcache as caching engines but doesn't have FDB support.

leentee · 2 years ago

I believe the author means "the best transactional k/v store"

AtlasBarfed · 2 years ago

The Sequencer:

- does not have a persistent/disk-backed state

- It is a singleton process

- it and only it does order, no logs do ordering

... if the singleton sequencer crashes, I do not see on this high level description how the system recovers, if the sequencer is the only one that knows write order but has no persistent write "log".

What am I missing?

This... does not appear to be something you run outside of a dedicated datacenter, AWS with its awful networking and slow/silently throttling storage would probably muck this thing up under any substantive scale?

What you are missing is that the "tlogs" (transaction logs) actually hold the durable, fault tolerant write log. The sequencer is just a big fast in-memory data structure that checks if the many transactions coming into the system pass isolation checks (the I in ACID). That is, it accepts transaction so long as the keys that the transaction read haven't been modified in the mean time.

The reason it can fail without a correctness issue is that it can just reject all transactions in flight for the clients to retry. This is something the clients need to be prepared to do anyway because of optimistic concurrency.

It can run fine on AWS. Upon a failure, the sequencer role is very fast to re-elect onto another machine in the cluster because there is no persistent state at all.

It runs fine in AWS, Snowflake and many others run it there. The most recent FoundationDB paper goes into a lot more detail on their recovery protocol, it’s a lot more nuanced than you think, but it works extremely well

angio · 2 years ago

How do people deploy FDB to the cloud? Is it possible to deploy it without EBS to take advantage of cheaper VM temporary storage?

manishsharan · 2 years ago

Yes I would think so. FDB is distributed by default and the cluster is very easy to setup. As long as you have sufficient number of VMs in a cluster, the loss of a single vm or disk will not matter as you can spin up a new VM to join the cluster. On AWS, you can set up members of the cluster in different availability zones in the same region .. so the outage of on zone will not impact your database.

I am running this set up in my dev (personal) environment on AWS.

dialogbox · 2 years ago

I won’t do that for production. Regional failure is not impossible although it is rare. You will lose all of your data.