I'm at a company that is weighing a very similar decision (we are on RDS Postgres with a rapidly growing database that will require some horizontal partitioning). There really isn't an easy solution. We spoke to people who have done sharding in-house (Figma, Robinhood) as well as others who migrated to natively distributed systems like Cockroach (Doordash).
If you decide to move off of RDS but stay on Postgres, you can run your own Postgres but now lose all the benefits of a managed service. You could move off of AWS (eg to Azure), but moving to a different cloud is a huge lift. That, btw, would also be required if you want to try something like Spanner (move to GCP). Moving off of Postgres to another database is also risky. The migration will obviously take some effort, but you're also probably talking about lots of code changes, schema changes, etc, as well as unknown operational risks that you'll discover on the new system. This applies if you're talking about things like Cockroach or even moving to MySQL.
That said, rolling your own sharding is a MASSIVE undertaking. Limitless looks promising, since it takes care of a lot of the issues that Figma ended up spending time on (basically, you shouldn't need something like Figma's DBProxy, as shard routing, shard splitting, etc will be taken care of). It's still in preview though, and like the article mentioned, the costs may be high.
Overall, no easy decisions on this one, unfortunately.
> That said, rolling your own sharding is a MASSIVE undertaking.
It's a large challenge, but it's absolutely doable. A ton of companies did this 10-15 years ago, basically every successful social network, user generated content site, many e-commerce sites, massively multiplayer games, etc. Today's pre-baked solutions didn't exist then, so we all just rolled our own, typically on MySQL back then.
With DIY, the key thing is to sidestep any need for cross-shard joins. This is easier if you only use your relational DB for OLTP, and already have OLAP use-cases elsewhere.
Storing "association" type relation tables on 2 shards helps tremendously too: for example, if user A follows user B, you want to record this on both user A and user B's shards. This way you can do "list all IDs of everyone user A follows" as well as "list all IDs of users following user B" without crossing shard boundaries. Once you have the IDs, you have to do a multi-shard query to get the actual user rows, but that's a simple scatter-gather by ID and easy to parallelize.
Implementing shard splitting is hard, but again definitely doable. Or avoid it entirely by putting many smaller shards on each physical DB server -- then instead of splitting a big server, you can just move an entire shard to another server, which avoids any row-by-row operations.
Many other tricks like this. It's a lot of tribal knowledge scattered across database conference talks from a decade ago :)
It's definitely doable. I was at Google circa 2006, pre Spanner, with sharded MySQL. Ads ran on top of it. It was a pain.
And yes, there are many tricks like having more logical shards than physical ones, collocating tables by the same shard_id, etc. It's still difficult. You need tooling for everything from shard splitting (even if that is just loving a logical shard), to schema migrations, not to mention if you end up needing cross-shard transactions or cross-shard joins.
Generally, you'd need a team of very strong infrastructure engineers. Most companies don't have the resources for that. There are definitely some engineers out there that could whip this all together.
What I do not understand is they say "we explored CockroachDB, TiDB, Spanner, and Vitess". Those are not compatible with PostgreSQL beyond the protocol and migration would require massive rewrites and tests to get the same behavior. YugabyteDB is using PostgreSQL for the SQL processing, to provide same features and behavior and distributes with a Spanner-like architecture.
I'm not saying that there's no risk and no efforts, but they are limited. And testing is easy as you don't have to change the application code.
I don't understand why they didn't spend a few days on a proof of concept with YugabyteDB and explored only the solutions where application cannot work as-is.
I think Denis addressed this in his post: "Overall, as an engineer, you will never regret taking part in the development of a sharding solution. It’s a complex engineering problem with many non-trivial tasks to solve". In other words, it might be not invented here syndrome (with all due respect to Figma team). Or there might be more nuances we are unaware about.
Maybe it’s just a matter of it being difficult to list all the things they didn’t use. The Figma article itself is a little more clear on their goals…
It’s not really just postgres compatibility they are after, but compatibility with the Amazon RDS version of postgres. They also wanted to have something they could adopt incrementally and back out of when something unanticipated goes wrong.
Also, I think yugabyte uses an older version of the postgres processing engine, which may or may not be a big deal, depending on what they are using.
We use Citus. Very similar performance properties to DIY sharding but much more polished. Currently at 7 TB, self hosted. Growing roughly at 100 % per year, write-heavy. Works fine for us.
Unfortunately this runs into their "have to move off AWS for managed service" point since the managed service for Citus is now on Azure post acquisition, as Azure Cosmos DB for PostgreSQL
Pitching Citus ran into issues where people were hoping it would handle sharding transparently, which isn't the case. But for someone who's evaluating rolling their own sharding, being able to manage sharding keys explicitly is how Citus allows efficient joins based on your workload. So yes, if you're looking to roll an unmanaged sharded postgresql cluster, consider starting with Citus
Curious why you needed to shard at 7TB? I can imagine for some workloads, especially if it's write-heavy, you might start hitting constraints around vacuuming and things like that? But 7TB should be manageable on a (somewhat large and beefy) single machine.
I’m definitely of the opinion that what Figma[0] (and earlier, Notion[1]) did is what I’d call “actual engineering”.
Both of these companies are very specific about their circumstances and requirements
- Time is ticking, and downtime is guaranteed if they don’t do anything
- They are not interested in giving up the massive amount feature AWS supports via RDS, very specially around data recovery (anyone involved with Business Continuity planning would understand the importance of this)
- They need to be able to do it safely, incrementally, and without closing the door on reverting their implementation/rollout
- The impact on Developers should be minimal
“Engineering” at its core is about navigating the solution space given your requirements, and they did it well!
Both Figma and Notion meticulously focused on the minimal feature set they needed, in the most important order, to prevent disastrous downtime (e.g Figma didn’t need to support all of SQL for sharding, just their most used subset).
Both companies call out (rightfully so) that they have extensive experience operating RDS at this point, and that existing solutions either didn’t give them the control they needed (Notion) or required a significant rewrite or their data structures (Figma), which was not acceptable.
I think many people also completely underestimate how important operational experience with your solution is at this scale. Switch to Citus/Vitess? You’ll now find out the hard way all the “gotchas” of running those solutions at scale, and it would guarantedly have resulted in significant downtime as they procured this knowledge.
They’d also have to spend a ton of time catching up to RDS features they were suddenly lacking, which I would wager would take much more time than the time it took implementing their solutions.
the right way to look at it - IMHO - is to interpret "lots of RDS experience" as complete lack of run-your-own postgreSQL experience. and given that it's not surprising that their cost-benefit math give them the answer of "invest into a custom middleware, instead of moving to running our postgreSQL plus some sharding thing on top"
of course it's not cheap, but probably they are deep into the AWS money pit anyway (so running Citus/whatever would be similar TCO)
and it's okay, AWS is expensive for a lot of bootstrapped startups with huge infra requirements for each additional user, but Figma and Notion are virtually on the exact opposite of that spectrum
also it shows that there's no trivial solution in this space, sharding SQL DBs is very hard in general, and the extant solutions have sharp edges
Good points. Although, having worked on many high scale architectures before, I always err on the side of thinking that any technical solution of this magnitude would have far too many nuances for a blog post to capture. And I believe Figma’s post also talks mainly about the common denominator that’s easier to communicate to the external world.
For me to understand their choices, I’ll first have to understand their system, which can be a task in itself. Without that, I’d not become aware of the nuances, only general patterns.
Aren't there any good managed postgres solutions supporting citus? The decision here seems to have been to invent a whole new sharding solution instead of building enough in house DBA to self-host postgres (if you want to stay on Amazon, you can use any extension you want on EC2). Speaks for the state of engineering right now.
It's AGPL so you can self-host. You can even pay someone to host it for you if you don't like Microsoft's offer. I find it easier than rolling your own AND self-hosting.
3rd party solutions can also add complexity you don't want. You need to keep up-to-date with their release schedules to have access to bug and security fixes, even though you would feature wise be happy with the older version.
Also these can add unnecessary complexity by having features you don't need. Or they might be missing features you need. Contributing up-stream can be difficult and there might be conflicts of interest especially for projects which have separate paid version.
If you decide to move off of RDS but stay on Postgres, you can run your own Postgres but now lose all the benefits of a managed service. You could move off of AWS (eg to Azure), but moving to a different cloud is a huge lift. That, btw, would also be required if you want to try something like Spanner (move to GCP). Moving off of Postgres to another database is also risky. The migration will obviously take some effort, but you're also probably talking about lots of code changes, schema changes, etc, as well as unknown operational risks that you'll discover on the new system. This applies if you're talking about things like Cockroach or even moving to MySQL.
That said, rolling your own sharding is a MASSIVE undertaking. Limitless looks promising, since it takes care of a lot of the issues that Figma ended up spending time on (basically, you shouldn't need something like Figma's DBProxy, as shard routing, shard splitting, etc will be taken care of). It's still in preview though, and like the article mentioned, the costs may be high.
Overall, no easy decisions on this one, unfortunately.
It's a large challenge, but it's absolutely doable. A ton of companies did this 10-15 years ago, basically every successful social network, user generated content site, many e-commerce sites, massively multiplayer games, etc. Today's pre-baked solutions didn't exist then, so we all just rolled our own, typically on MySQL back then.
With DIY, the key thing is to sidestep any need for cross-shard joins. This is easier if you only use your relational DB for OLTP, and already have OLAP use-cases elsewhere.
Storing "association" type relation tables on 2 shards helps tremendously too: for example, if user A follows user B, you want to record this on both user A and user B's shards. This way you can do "list all IDs of everyone user A follows" as well as "list all IDs of users following user B" without crossing shard boundaries. Once you have the IDs, you have to do a multi-shard query to get the actual user rows, but that's a simple scatter-gather by ID and easy to parallelize.
Implementing shard splitting is hard, but again definitely doable. Or avoid it entirely by putting many smaller shards on each physical DB server -- then instead of splitting a big server, you can just move an entire shard to another server, which avoids any row-by-row operations.
Many other tricks like this. It's a lot of tribal knowledge scattered across database conference talks from a decade ago :)
And yes, there are many tricks like having more logical shards than physical ones, collocating tables by the same shard_id, etc. It's still difficult. You need tooling for everything from shard splitting (even if that is just loving a logical shard), to schema migrations, not to mention if you end up needing cross-shard transactions or cross-shard joins.
Generally, you'd need a team of very strong infrastructure engineers. Most companies don't have the resources for that. There are definitely some engineers out there that could whip this all together.
It’s not really just postgres compatibility they are after, but compatibility with the Amazon RDS version of postgres. They also wanted to have something they could adopt incrementally and back out of when something unanticipated goes wrong.
Also, I think yugabyte uses an older version of the postgres processing engine, which may or may not be a big deal, depending on what they are using.
Pitching Citus ran into issues where people were hoping it would handle sharding transparently, which isn't the case. But for someone who's evaluating rolling their own sharding, being able to manage sharding keys explicitly is how Citus allows efficient joins based on your workload. So yes, if you're looking to roll an unmanaged sharded postgresql cluster, consider starting with Citus
Yes. It may not fits your need but take a look at PlanetScale. ( Based on MySQL and Vitess but I have seen quite a few people moving from Postgres )
Both of these companies are very specific about their circumstances and requirements
- Time is ticking, and downtime is guaranteed if they don’t do anything
- They are not interested in giving up the massive amount feature AWS supports via RDS, very specially around data recovery (anyone involved with Business Continuity planning would understand the importance of this)
- They need to be able to do it safely, incrementally, and without closing the door on reverting their implementation/rollout
- The impact on Developers should be minimal
“Engineering” at its core is about navigating the solution space given your requirements, and they did it well!
Both Figma and Notion meticulously focused on the minimal feature set they needed, in the most important order, to prevent disastrous downtime (e.g Figma didn’t need to support all of SQL for sharding, just their most used subset).
Both companies call out (rightfully so) that they have extensive experience operating RDS at this point, and that existing solutions either didn’t give them the control they needed (Notion) or required a significant rewrite or their data structures (Figma), which was not acceptable.
I think many people also completely underestimate how important operational experience with your solution is at this scale. Switch to Citus/Vitess? You’ll now find out the hard way all the “gotchas” of running those solutions at scale, and it would guarantedly have resulted in significant downtime as they procured this knowledge.
They’d also have to spend a ton of time catching up to RDS features they were suddenly lacking, which I would wager would take much more time than the time it took implementing their solutions.
Great job to both teams!
[0] https://www.figma.com/blog/how-figmas-databases-team-lived-t...
[1] https://www.notion.so/blog/sharding-postgres-at-notion
of course it's not cheap, but probably they are deep into the AWS money pit anyway (so running Citus/whatever would be similar TCO)
and it's okay, AWS is expensive for a lot of bootstrapped startups with huge infra requirements for each additional user, but Figma and Notion are virtually on the exact opposite of that spectrum
also it shows that there's no trivial solution in this space, sharding SQL DBs is very hard in general, and the extant solutions have sharp edges
For me to understand their choices, I’ll first have to understand their system, which can be a task in itself. Without that, I’d not become aware of the nuances, only general patterns.
https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql...
Deleted Comment
Also these can add unnecessary complexity by having features you don't need. Or they might be missing features you need. Contributing up-stream can be difficult and there might be conflicts of interest especially for projects which have separate paid version.
Obviously it could interfere with demand for aurora limitless
I don't think so because Microsoft bought Citus.