How does database sharding work?

> I suppose the more fundamental question is: why are you not using a database that does sharding for you? Over the past few years the so-called “serverless” database has gotten a lot more traction. Starting with the infamous Spanner paper, many have been thinking about how running a distributed system should be native to the database itself, the foremost of which has been CockroachDB. You can even run cloud Spanner on GCP.

This is the most important paragraph in the article. In a world where things like Dynamo/Cassandra or Spanner/Cockroach exist, manually-sharded DB solutions are pretty much entirely obsolete. Spanner exists because Google got so sick of people building and maintaining bespoke solutions for replication and resharding, which would inevitably have their own set of quirks, bugs, consistency gaps, scaling limits, and manual operations required to reshard or rebalance from time to time. When it's part of the database itself, all those problems just... disappear. It's like a single database that just happens to spread itself across multiple machines. It's an actual distributed cloud solution.

My current employer uses sharded and replicated Postgres via RDS. Even basic things like deploying schema changes to every shard are an unbelievable pain in the ass, and changing the number of shards on a major database is a high-touch, multi-day operation. After having worked with Spanner in the past, it's like going back to the stone age, like we're only a step above babysitting individual machines in a closet. Nobody should do this.

grogers · 3 years ago

All the "auto-sharding" DBs have their own quirks, especially around hot key throughput. You often end up having to add "sharding bits" to the beginning of your keys to get enough throughput. The size of one partition is usually tiny compared to one partition of an RDBMS too. So it ends up being not nearly the panacea that it would seem to be.

bpicolo · 3 years ago

Software side sharding keys seem significantly simpler than managing sharding infrastructure. That's the beauty of memcached right?

dagss · 3 years ago

Would love to be able to use Spanner..

What do you suggest one does if one has to run on Azure though?

PS Azure Cosmos is not a real product, it is a beta toy that Microsoft just has made expensive enough that people think it cannot possibly be as bad as it is..

avinassh · 3 years ago

> PS Azure Cosmos is not a real product, it is a beta toy that Microsoft just has made expensive enough that people think it cannot possibly be as bad as it is..

what kind of problems you ran into with Cosmos?

vinay_ys · 3 years ago

You could consider one of these three –

1. yugabytedb (postgres compatible)

2. cockroachdb (postgres compatible)

3. tidb (mysql compatible)

You could self-host these on azure or subscribe to each of their managed service offering.

Of course, sticker price for managed will be more expensive than self-hosted enterprise product which itself will be more expensive than the self-hosted, community-supported, open-source option (which seems to be a real option with YugabyteDB).

yawgmoth · 3 years ago

Recently joined an Azure shop so new to Cosmos, but it's fairly abstract as a product. It seems to be a front for different database types like Mongo or Postgres or SQL Server with some additional features such as subscriptions.

Would be interested to hear your challenges.

BiteCode_dev · 3 years ago

If it's one more man-week of pain a year, but the spanner solution add a additional cost that is 10 times that, there is an incentive for the employer to keep the painful solution.

That and lock in, open source ecosystem and so on.

It's not a black and white situation.

KaiserPro · 3 years ago

> In a world where things like Dynamo/Cassandra or Spanner/Cockroach exist, manually-sharded DB solutions are pretty much entirely obsolete.

Not really. Cassandra is a write optimised, slow, network and IO heavy beast thats a pain in the arse to administer. We replaced a 6 node m4.4xlarge with a single db.m4.2xlarge. on postgres.

You need to pick your DB to match your data, not the otherway around.

billythemaniam · 3 years ago

Spanner and Cassandra really shouldn't be in the same sentence. They are optimized for very different use cases. The "obsolete" part of that quote does apply to Spanner, TiDB, and CockroachDB in my experience. I haven't used Yugabyte, but the other sharded databases, including Vitess (TiDB is what Vitess is trying to be), actually make life harder.

foobiekr · 3 years ago

Most orgs, including mine, also have the experience of a cassandra cluster simply melting down. That was more common ten years ago, but it is pretty much chaos when it happens.

If the non-SQL distributed databases, I found FoundationDB to be by far the most robust and zero-overhead.

However, at least through version 6, it had absolutely terrible, service-destroying behavior if any of the nodes became low on storage, dropping transaction levels through the floor and since transactions are used for recovery and cluster expansion... Since it does a very nice job auto-sharding, if you're not paying attention this tends to happen to multiple nodes around the same time.

I wish it would get more attention though. It is a really amazing tool.

bsaul · 3 years ago

How does cockroach compares in terms of performance to manual sharded databases ?

My intuition is that a properly sharded database will perform faster-or-same as a non-sharded one in all scenarios. Whereas automatically-sharded database will actually perform worst until you start reaching critical traffic that a single instance won't handle no matter what.

Am i wrong ?

arjunnarayan · 3 years ago

[disclosure, former cockroachdb engineer]

you can get expected "single shard" performance in CockroachDB by manually splitting the shards (called "ranges" in CockroachDB) along the lines of the expected single shard queries (what you call a "properly shared database"). This is easy to do with a single SQL command. (This is what we do today; we use CockroachDB for strongly consistent metadata).

The difference between CockroachDB and a manually sharded database is that when you _do_ have to perform some cross-shard transactions (which you inevitably have to do at some point), in CockroachDB you can execute them (with a reasonable performance penalty) with strong consistency and 2PC between the shards, whereas in your manually sharded database... good luck! Hope you implement 2PC correctly.

crazygringo · 3 years ago

> My intuition is that a properly sharded database will perform faster-or-same as a non-sharded one in all scenarios.

No, because as soon as you have to JOIN across shards performance plummets. I mean, you can't JOIN across shards so you have to simulate that client-side which is very slow.

The secret to performant manual sharding is to figure out a way to minimize JOINs across shards, and that the ones you have to do are very small. Or to start replicating data which quickly becomes its own nightmare of consistency.

jandrewrogers · 3 years ago

Where the line gets blurry is that there can be layers of sharding inside the database even if it is never exposed to the end user. At the limit, disk pages are essentially a shard and treated similarly. Some database kernels do true sharding and resharding within a single server transparently for performance reasons, it isn't just for having data on multiple servers or cores.

There is no intrinsic performance penalty for automatic sharding, it can be as fast or faster than any other scheme. The "faster" cases are when the sharding structure must be adaptive to have uniform distribution of data and load, such as when the key distribution is intrinsically unpredictable or when you are doing multi-attribute sharding. In these cases, effective sharding can only be created by software at runtime.

hilbertseries · 3 years ago

> My intuition is that a properly sharded database will perform faster-or-same as a non-sharded one in all scenarios.

This is workload dependent, if your queries only go to one shard, this is likely true. But if you have cross shard queries, then it’s no longer true.

varsketiz · 3 years ago

> In a world where things like Dynamo/Cassandra or Spanner/Cockroach exist, manually-sharded DB solutions are pretty much entirely obsolete.

Far from it. Plenty of companies that are now running mysql or postgress will shard manually when they will need to scale.

dalyons · 3 years ago

It can still be an obsolete practice even if some continue to choose to do it.

throwawaaarrgh · 3 years ago

distributed decentralized databases are their own special nightmare. they are all full of hidden traps, footguns, pains, limits. the more you push them the more these emerge.

in some ways the stone age was terrible, and in some ways the stone age was far superior to the current age. we've gotten soft, lazy, stupid, comfortable. we used to be feral animals, but now we're more like house plants. if we got more used to doing things like dumb animals, life would be simpler, and we could appreciate what we have more. but it's hard to resist the temptation of feeling like what we have is never enough. the grass is always greener, and how terrible that it isn't ours.

game_the0ry · 3 years ago

I just have to say - thanks for your response.

I am a full stack dev but I end up doing a lot of front end and rarely have to touch SQL. Posts like yours help me to realize there's still a lot I do not know while still being informative. So thanks <3

cornel_io · 3 years ago

Solutions like Spanner are fantastic for some things (and within Google it's a no-brainer, since you're not paying an outrageous markup), but besides being expensive, they don't just let you drop them in as a scalable replacement for an existing usage pattern, and usually sharding starts first coming up when you're at a huge scale and already have a ton of app code and database design but have grown past what a single db can support (that's a massive amount of traffic these days, well over 1M DAU for most apps). See for instance https://cloud.google.com/spanner/docs/schema-design, a lot of what you have to plan for is very different than if you were designing a normal (e.g.) Rails app with typical patterns.

It's a great option, as long as you're aware of the tradeoffs and don't expect it to act (and cost) exactly like a single Postgres server.

My order of ops generally goes:

- Can you get away with pure (or mostly) key/value for your hottest table(s) and mostly avoid queries? Use Dynamo or Google Cloud Datastore, but deeply understand their quirks (e.g. never index a timestamp field on Datastore or you're worse off than a weak-ass SQL server for write-heavy loads). These can scale basically forever with zero effort if you fit their constraints, and are cheap enough

- Can you tolerate the price of Spanner and deal with it not being normal SQL? Go for it, expect some non-trivial migration cost

- If you have to shard/cluster yourself, can you make Redis work? It'll be easier operationally than managing N SQL DBs of any kind. I know, if you could you'd probably have been fine using Dynamo...

- Can you soft-shard, e.g. use different machines for different tables and spread load enough to get each table onto a single server? Do it.

- Can you minimize re-sharding ops? (If you're in a massive growth phase, the answer is no) Ok, fine, do SQL sharding, and make sure you have well-tested code to do it and you mostly use key/value access patterns.

- Consider a beefy AF bare metal SQL server, see how far that gets you.

- If none of this applies, re-evaluate the cost ($ and time) of Spanner, you're paying a big price one way or another

- Only bad ideas from this point on, just do sharding but it'll suck...

Nowhere on my list: running your own cluster for Cassandra, Mongo, or anything similar, it's such an ops nightmare and there are hosted services that are better in every way but cost.

If you do end up writing your own sharding layer, best of luck, I have yet to see any reasonably production-ready ones that work well with pretty much any web framework or ORM. And writing these is so tough, so many edge cases to consider...not something I hope to ever do again.

GeneralMayhem · 3 years ago

> Consider a beefy AF bare metal SQL server, see how far that gets you.

Yep, this is of course the default - although you'll at least want replication for availability and disaster recovery, but that's not too bad with PG/RDS.

> they don't just let you drop them in as a scalable replacement for an existing usage pattern, and usually sharding starts first coming up when you're at a huge scale and already have a ton of app code and database design but have grown past what a single db can support

I think once you get beyond what fits on one primary box, you're almost certainly at the point of rethinking your data access models anyway. Manual sharding is disruptive too.

branko_d · 3 years ago

> Consider a beefy AF bare metal SQL server, see how far that gets you.

Quite far, possibly to millions of users.

Case in point: StackOverflow runs on SQL Server.

eternalban · 3 years ago

Because distributing a database is sensitive to cap semantics (AP or CP) and data dependencies (graph partitioning is hard *) and storage engine choices are driven by use-cases and the general technical solution is thus highly complex**. Spanner uses atomic clocks, for example. Running CockroachDB yourself is [very likely] not the same thing as using a saas varient, either. Sight unseen, it can not be 'trivial'. Same for Spanner. The general solution seems to require paying someone to provide the service. In sum, it is not a clear cut yes/no situation.

btw, [distributed] Postgres iirc was never as stellar the single node (the stuff we sing praises of) vs the distributed deployment. I'm sure it has improved significantly.

> "manual operations required to reshard or rebalance from time to time. When it's part of the database itself, all those problems just... disappear."

Not really correct.

* "Choosing the right keys can help Spanner evenly distribute data and processing to avoid hotspots"

https://cloud.google.com/spanner/docs/schema-design

https://cloud.google.com/blog/topics/developers-practitioner...

** https://static.googleusercontent.com/media/research.google.c...

[Spanner certainly did -not- start off as a distributed RDBMS. Because that project would have never been given a green light. Because it is understood just how complex that system would need to be. It started off as a distributed k/v. That's it.]

"[I]n many ways today’s Spanner is very different from what was described [in original Spanner whitepaper]"

...

"The initial focus of Spanner was on scalability and fault-tolerance, much as was the case for many systems at Google. In the last few years Spanner has increasingly become a fully-fledged database system. Production deployment with internal customers taught us a great deal about the requirements of web scale database applications, and shaped the techniques we presented in this paper. Aggressive focus on horizontal scalability enabled widespread deployment without investing too heavily in single machine performance. The latter is one of the areas where we are making continuous improvements, such as upgrading lower level storage to Ressi."

"The original API of Spanner provided NoSQL methods for point lookups and range scans of individual and interleaved tables. While NoSQL methods provided a simple path to launching Spanner, and continue to be useful in simple retrieval scenarios, SQL has provided significant additional value in expressing more complex data access patterns and pushing computation to the data."

Dead Comment

Thanks for this article planetscale. I've been enamoured with sharding recently but more for multithreaded performance.

I want multimaster postgres.

I started trying to write a postgres synchronizer, by sorting every data by row and column and hashing the data of every column and row, then doing a rolling hash of the data.

In theory, two databases can synchronize by sending the final hash of their data and then doing a binary search backwards until the hashes match. This way you can work out which parts of the databases differ and need to be transmitted. If the databases are identical, very little data gets transferred.

One problem I've not worked out is how to decide which database is the winning database without having to change application queries.

If you synchronize two multimaster Postgres databases that have had independent writes to different sections, how do you identify which database is the source of truth for a column/row combination?

AdieuToLogic · 3 years ago

> One problem I've not worked out is how to decide which database is the winning database without having to change application queries.

I believe this is a case of the "Two Generals' Problem"[0], which implies that there is no provably correct solution to achieve this.

> If you synchronize two multimaster Postgres databases that have had independent writes to different sections, how do you identify which database is the source of truth for a column/row combination?

You can't without a quorum[1] and even that does not guarantee success.

0 - https://en.wikipedia.org/wiki/Two_Generals%27_Problem

1 - https://en.wikipedia.org/wiki/Split-brain_(computing)

setr · 3 years ago

> You can't without a quorum[1] and even that does not guarantee success.

Isn’t split-brain largely resolved by the rules:

1. Must have a strict majority of the total group agree to commit

2. Total should be an odd number

preseinger · 3 years ago

"a binary search backwards" is not well defined and doesn't guarantee any upper bound on consistency

sharding data based on individual rows within tables is tricky, you won't get reliable guarantees for queries in this way

"which database is the winning database" is a function of individual rows, a query that reads data from N different row-owners needs to query N different instances, or else accept that it will work against stale data

myzie · 3 years ago

Spock may be of interest to you: https://pgedge.github.io/spock/

brightball · 3 years ago

You need Citus

yuji · 3 years ago

Citus follows a single-master architecture for replication just like standard postgres. Citus would be a good solution to add additional writable nodes via sharding tables accross multiple databases.

Recently they've been acquired by Microsoft & have started to been integrated into their CosmosDB offering. Anybody have good experience with this DBaaS offering?