How to Scale PostgreSQL on AWS: Learnings from Citus Cloud

nasalgoat 161 days ago [-] The secret to EBS is to use General SSD, not Provisioned, but use a RAID stripe. The reason this works is because IOPS are provisioned per EBS drive and by the size of the drive. So a RAID0 stripe of, say, ten General SSD drives will outperform the more expensive PIOPS single drive.

We use MemSQL and it has the best replication setup process for any relational database with 1 line:

  REPLICATE DATABASE db_name FROM master_user[:master_password]@master_host[:master_port][/master_db_name]

Why is it in 2017 we still don't have any other database that can come close to this? Basic replication is very well understood and used everywhere but it seems like database creators just don't understand what should be prioritized.

anarazel · 9 years ago

The stuff in postgres 10 isn't that much different (for logical replication). On the master you do

  CREATE PUBLICATION all_tables FOR ALL TABLES;
  or
  CREATE PUBLICATION important_tables FOR TABLE transactions, users;

and then on the node you want to replicate to

  CREATE SUBSCRIPTION backup_important CONNECTION 'dbname=foo host=bar user=repuser' PUBLICATION important_tables;

manigandham · 9 years ago

What about editing the postgresql.conf, pg_hba.conf, recovery.conf files, setting up archiving and doing the first base backup transfer? That's the actual hard part.

Logical replication doesn't replicate DDL commands either from what I know so far so that's another major problem.

chousuke · 9 years ago

When you issue that command in MemSQL, what does it actually do?

Physical replication in PostgreSQL is a bit more work to set up, but its behaviour is rather straightforward. As far as I understand it, all it really does is the same as a standalone database, except that instead of generating WAL segments in response to user queries that are then applied to the database, it gets them by streaming them from the master.

It's easy to see how the system behaves when nodes fail.

Similarly, a PITR is achieved by telling PostgreSQL to (optionally) retrieve WAL at startup from an archive and apply it until the desired point. The mechanism is identical to streaming replication, except non-continuous.

Referring to your comment below, postgresql.conf, pg_hba.conf and recovery.conf are data that can't really exist in the database, because it's needed before the database can even start up.

With logical replication in PostgreSQL 10, you could feasibly implement a command identical to MemSQL's, but the problem is that its behaviour is no longer quite so straightforward to understand compared to physical replication.

manigandham · 9 years ago

It does everything. There's nothing else to configure. MemSQL is a distributed SQL database so a "cluster" is multiple machines supporting a single logical interface to multiple databases.

Another cluster in another datacenter made up of multiple machines can then issue this command which will automatically contact the master cluster, setup connections in parallel between several machines, create the replica database, ask the master to create a new snapshot, stream the snapshot over, build the database, then subscribe to the change log stream including data + schema changes, all async and very fast.

Yes, this is most commonly referred to as "physical replication" but it covers about 90% of use cases where a database needs a replica somewhere else.

Postgres is great but too often the features keep getting blown out into ridiculous complexity instead of just making the basics work well and without tweaking anything, then building from with optional configuration for greater functionality.

Does Citus (Cloud?) have features that offer better high availability and failover functionality than what RDS provides? Managed Patroni and packaged workflows for zero-downtime failover would be quite interesting, but I don't see anything like that mentioned on https://www.citusdata.com/product/cloud.

craigkerstiens · 9 years ago

We don't use Patroni or any of the other off the shelf items. We rolled our own primarily from our years of experience on Heroku Postgres. We're actually working on a detailed post on how HA and disaster recovery works for Citus Cloud, though the core mechanism powering it all under the covers is our state machine. You can read a bit about how it works in this post: https://www.citusdata.com/blog/2016/08/12/state-machines-to-...

jacobscott · 9 years ago

Thanks! I look forward to reading about implementation details.

It does seem like documentation of Citus Cloud HA and disaster recovery behavior is a bit light, compared to e.g.

- http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concep...

- https://devcenter.heroku.com/articles/heroku-postgres-ha

"In addition to continuous protection which is explained above, high availability is available if your application requires less exposure to downtime. We provision stand-bys if you select high availability at provisioning time. This can be for your primary node, or for your distributed nodes." https://docs.citusdata.com/en/v6.1/cloud/features.html

Please let me know if I've missed any resources on this topic!

simonw · 9 years ago

Citus are doing a fantastic job on content marketing. Every single piece they publish on https://www.citusdata.com/blog/ is a case-study in how to write content (and headlines) that appeal to the kinds of developers their product targets.

"How to Scale PostgreSQL on AWS: Learnings from Citus Cloud" - seriously, how am I as a PostgreSQL-liking developer who cares about scalability NOT going to click through to that article?

Simon, thanks so much for the kind words. It's nice to know that the content is useful. In general we're focused on sharing both content about how to use Citus and what it's good at, but also good and interesting engineering content as well as we want to add value to our users and broader community. It's great to hear that we're doing alright from your perspective on that front.

Artemis2 · 9 years ago

Really appreciate the quality of the content, there is some real value to it.

kornish · 9 years ago

Citus Cloud is perhaps most exciting me because it has tremendous momentum: as the combined product of deep technical expertise meeting top-flight open source software meeting tons of end user experience, it's quickly outpacing platforms which are locked-in anachronisms. Take Redshift: Postgres 8.4? After you've used some of the features in 9.6, it's hard to go back. It'd be interesting to see some numbers around Citus Cloud's battle-tested deployments.

As a side note, these blog posts on high-level techniques and open source tools (e.g. PgBouncer, wal-e) are useful for anyone considering deploying an on-prem version of Citus as part of a product – thanks, Ozgun!

Usual disclaimers apply: not an employee, but big fan of the team and technology and it's great to see them gaining well-deserved mindshare.

pjungwir · 9 years ago

I saw the section on EBS, but it didn't offer much advice. Getting good performance on networked storage is the biggest challenge to me. The last time I asked about that here [1], I got this answer:

That sounds like a great approach, although I haven't had time to try it out yet. I'm curious if anyone else has done anything like that.

[1] https://news.ycombinator.com/item?id=12609172

fdr · 9 years ago

Citus Cloud uses that approach. I designed it with the arithmetic from the page about gp2, http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolume....

A useful inflection is the "minimum size for max throughput" at 214GiB. That same disk would also offers 38 minutes at the maximum throughput at 3000PIOPs. After that, performance will drop to about a fifth: that's the baseline for a 214GiB disk. Another important thing to consider that large contiguous reads and writes on EBS are up to 256KiB per IOP: most bulk transfers take up far fewer IOPS than people normally calculate.

For most workloads, Burst Is Good: for a fixed amount of money, you'll be able to obtain meaningfully better responsiveness in a program. Few workloads are pushing the maximum IOPS all the time. If you need more baseline IOPS you can make gp2 disks bigger or add more disk, depending on whether you find bandwidth has been saturated or not.

Are Citus Cloud customers generally IOPS constrained rather than throughput constrained? You're also constrained by the EBS-optimized interconnect, right? http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimi...

user5994461 · 9 years ago

Use multiple big* SSD volumes in RAID 0 on big* instances.

* IOPS and bandwidth are proportional to the size of volumes

* instance types have a hard cap on network and IO, proportional to the size of the instance.

Otherwise, you can try local SSD instances but there are only a few and they are incredibly expensive.

kinghajj · 9 years ago

I'm working on an approach using ZFS on K8s. The idea is to write a custom "flexvolume" driver that takes set of EBS volumes an options, attaches them, create a zpool, set options, etc. In tandem, however, I'm adding to each node host a systemd service to create dm-crypt+LVM devices atop of each NVMe drive (as in the i3 instance types). The flexvolume driver will support adding SLOG and/or L2ARC devices by provisioning LVM volumes from those.

Something I'm still unclear, is whether EBS does any data checksums, in which case I could disable ZFS' to spare CPU cycles (but keep compression, of course.)

illamint · 9 years ago

> I'm adding to each node host a systemd service to create dm-crypt+LVM devices atop of each NVMe drive

That sounds cool; can you describe how you're using systemd for this in a bit more detail? I do this with an Ansible playbook when the machines are provisioned.

weitzj · 9 years ago

Sounds great. Is this approach somehow related to ClusterHQ's?

Deleted Comment

cromulent · 9 years ago

I was looking for the "I want my database to be performant under high random load" question. PIOPS can hurt.

Anyone have any experience running PostgreSQL on the new I3 instances?

https://news.ycombinator.com/item?id=13842222

Not yet, but you're not the only one with that idea!

agentgt · 9 years ago

I have mentioned this on some previous posted articles but we are really happy users of both citus and pipelinedb.

Check out pipelinedb if you are a Postgres fan (obviously it is for a different use case than Citus).

The only thing I don't like about pipeline is that it currently is a fork and not an extension but that is supposed to change.

Consequently we syndicate to citus and pipeline through rabbitmq and Kafka.

We use google cloud as well. I'm contemplating on writing a post on what we have learned (and not :)) but I don't think I could ever match the quality of this article.

And yes invariably some one will mention memsql does both but it is proprietary and not Postgres. I probably should have spent more time investigating it though (and eventually will).

forgotpwtomain · 9 years ago

Why is it seemingly impossible to read a technical blog-post on a company-blog, without some seven-year-old-humor type meme mixed-in?