We did a similar migration (somewhat larger database) with ~20 seconds of downtime and much less work... using the magic of AWS RDS Blue-Green Deployments [1]. Surprised they aren't mentioned in the thread yet.
Basically, you spin up a new Blue Green deployment with any desired changes (in our case, we were upgrading Postgres major from 13 to 15). While your blue configuration continues to serve traffic, AWS uses logical replication to keep the "green" deployment in-sync. You can keep modifying (or testing) the "green" deployment (eg you could load test it if you wanted to), as long as you don't do any writes to it (writes still have to go to your live, blue configuration, and are replicated to green).
When you're ready, you run the "switch" command, and AWS does a few things for you: run checks to ensure blue/green are in sync, stops writes and connections, waits a few seconds to ensure replication is caught up, renames your database, then allows connections/writes again. We had less than 20 seconds of downtime, by our count. And, we had a primary and several read replicas and AWS successfully switched the full configuration over with no hiccups. You don't even need to switch your configuration because AWS swaps the database URLs for you. Green becomes blue, blue becomes old blue, and when you're ready, you delete "old blue".
Highly recommend! They do have some restrictions (for instance, not sure if it would work if you're switching accounts, etc).
+1 for B/G. That said, I imagine they couldn’t use it due to the cross-account shift. I’ve used it for both MySQL (with much higher QPS than TFA, by two orders of magnitude) and Postgres, and they both went flawlessly.
Read the docs, especially the limitations. Re-read them. Do a test run in the dev environment, under load. Do it again in staging.
Or just YOLO into prod ‘cause it’ll probably be fine, I guess.
I also used RDS Blue/Green deployment to apply a MySQL major engine version upgrade from 5.7 to 8.0. With respect to downtime it worked fantastically, I think we measured 13 seconds of observable downtime from the API.
However we did learn the hard way that RDS Blue/Green cannot be used to apply arbitrary changes. In our case, we discovered RDS Blue/Green can only be used to move up engine versions, not down.
We discovered on MySQL 8.0 one of our stored procedures had very occasional failures, and considered the option of using RDS Blue/Green again to move back down to 5.7. Turns out that's not an option.
Database changes are typically one-way. If your new change includes creating or modifying a table, such that there are new additional columns, and you populate those with data, then downgrading would destroy the changed columns and the data in them. Hence you can't downgrade once you upgrade or you'd potentially be breaking things. To downgrade safely you'd need to backup or snapshot the old database, and then restore your database back to the backup/snapshot, but that's not blue/green.
I wonder if that could be because MySQL 8's replication is backwards compatible but MySQL 5.7's isn't forwards compatible. If so, it makes sense that you're only able to move forward.
We did the exact thing not too long ago, but we couldn't do it using blue/green. We were able to minimize downtime quite a bit, but it was on the order of minutes, not seconds. I wrote a little bit about the process here. I spent a lot of time prepping and running the migration, so happy to share any details if it's helpful.
We recently did this on my team over Christmas this year. We opted not to use Blue/Green for this but instead spun up an encrypted snapshot and began replication from the old database to the new database using MySQL’s tools. Once traffic on our platform was sufficiently low, we cut connections to the database, waited for for replica lag to reach 0 and relaunch the servers with the new databases host info. Our downtime was around a minute.
Good question. This was a pain point for my small team (me, myself, and I) a little while back. We had an unencrypted db deployed with CDK, and then tried to get it encrypted without losing data.
How did you go about stopping and restarting applications which reach out to the database? We have a number of tasks running in ECS which can take a minute to spin down and a few minutes to spin back up.
For our web service, we didn't stop anything. They had a few seconds of errors though it seems like some sessions were just buffered or paused and experienced high latency.
We also had background worker services. For the very high throughput ones, we spun down the # of tasks to a bare minimum for <5 minutes and let the queue build up, rather than have a massive amount of errors and retries. For the other ones where throughput wasn't high, we just let them be, and during the downtime they errored and retried and the retries mostly succeeded.
+1 for Route53 Groups and B/G setups. We did something similar with PG upgrades, no downtime with AWS R53 groups & retry inflight queries with a custom Rails ActiveRecord transaction patch.
Trade off: For a few seconds some requests were slower.
DNS Groups w/ retries is a nifty mechanism for these things.
I'm planning to do this soon. Is there any problem skipping versions when upgrading? I would like to go from Postgres 12 to 16. Does Postgres handle that without worries?
There are various ways to 'pause' incoming postgres queries, for example using pgbouncer, - ie. don't fail them, simply delay them until the replication has caught up and then let them continue on the new database.
If anything goes wrong and replication doesn't catch up, you can unpause and let those queries happen on the old database.
Therefore, your 11 seconds of downtime becomes 0 to 11 seconds of added page load time. But more importantly, of the thousands of users of the database who have never seen a query fail before and might have buggy error handling codepaths or have a single failed query ruin a whole batch job, this approach leads to a lot less collateral damage.
A lot of the discussion boiled down to 'this is a lot of complexity to avoid a few minutes of downtime'. I guess this is the proof, just use AWS Data Migration Service, swap the DNS entries to go live and live with 11 seconds of downtime.
There is no "just" about it. The absolute key takeaway is in "what we learned":
> We chose to use DMS because it was well supported by the GOV.UK PaaS and we could also get support from AWS. If we were doing a PostgreSQL to PostgreSQL database migration in the future, we would invest more time in trying alternative tools such as pglogical. DMS potentially added more complexity, and an unfamiliar replication process than what we may have found with other tools. This backs up what AWS say themselves on PostgreSQL to PostgreSQL migrations.
Even AWS in their own docs says to use the native tools when migrating from postgres to postgres[1]. They don't go into the details to much and points to pg_dump rather than pg_logical, but interesting to see that they don't recommend using DMS for it
There are a lot of gotchas with using DMS (which seems to use pglogical under the hood). Since it’s not hardware-level replication, you can run into issues with large rows/columns/tables and it doesn’t really handle foreign keys. It may not handle some special data types at all. You also need to update the sequences after the migration or you’ll get errors about duplicate primary keys. You can also have issues if you don’t have proper primary keys, because it doesn’t always copy the entire row at once.
If the databases are within the same AWS account, it’s likely easier to use hardware-level replication with global database or snapshots to do migrations if you’re ok with 4-5 mins of downtime.
There are many options available with PostgreSQL you could also do a physical full backup + WAL level replication to keep key AND a get low downtime.
What might have oriented theirs choice is that they wanted to upgrade from major version 11 to 15 during the migration process. This is only available using logical replication. Otherwise you'd have to chain upgrade process of each major version (and possibly OS because 11 is EOL on some arch) and this is nor trivial nor quick.
We recently completed the migration of a self-hosted 3 TB PostgreSQL database from version 12 to 16, transitioning from Ubuntu 18 to Ubuntu 22. Concurrently, we had to upgrade various extensions, most notably Timescale, for which a compatible version did not exist across all scenarios. We performed the upgrade by updating a replica in the following sequence:
- Start: PG12, Ubuntu 18, TS2.9
- Step 1: Set up a read-only replica with PG12 on Ubuntu 22, maintaining TS2.9.
- Step 1.5: Enter maintenance mode and halt all services.
- Step 2: Detach the the read-only replica, upgrading from PG12 to PG15 on Ubuntu 22 with TS2.9.
- Step 3: Upgrade from PG15 with TS2.9 to TS2.13 on Ubuntu 22.
- Step 4: Upgrade from PG15 to PG16 on Ubuntu 22 with TS2.13.
- Step 4.5 : Reconnect services to the new database server, resume all services, and exit maintenance mode.
All the database upgrade steps were well-tested and automated using Ansible. Nonetheless, we did encounter an issue that had not arisen during testing. This extended our downtime to approximately half an hour, which, for our use case, was perfectly acceptable.
Employing logical replication could have mitigated the last-minute surprise. So we will consider this approach for our next upgrade cycle.
We recently followed an almost identical upgrade path (however at the time PG16 wasn't yet supported by Timescale, so stopped at PG15 + TS 2.12).
We did look into using logical replication to reduce the downtime of the upgrade, but because database schema and DDL commands aren't replicated, it seems that it isn't recommended with Timescale in the loop.. (I suppose the underlying schema changes that Timescale needs to make under the hood are mostly a function of your hypertable chunk sizing and what your incoming writes look like, so this could be planned around / timed well, but we felt it added too much potential complexity & risk compared to simply opting for a small maintenance window while pg_upgrade completed).
For a software engineering project, you probably want to limit your transactions to much less than that (set statement_timeout is your friend). If you've got extremely long transactions, you can probably avoid doing the switch-over when they run (hopefully they are not a random occurrence but a result of a scheduled job or similar).
In combination with transaction time limit and fail-over configuration (where you fail the old primary), you can control the slowdown (instead of downtime, eg. with pgbouncer) very precisely.
I would be more concerned with the DNS TTL being respected in the entire stack (and external caching DNS servers you rely on), tbh.
But it is usually not critical to avoid a dozen seconds of downtime for an app, so whatever is simpler for you should be your go to solution.
Well you’ve certainly introduced a teaching moment to me! What are the nature of writes you deal with that last 30+ minutes? What kind of data/what kind of people are involved with such DB writes where you need to rely on the DB engine to work so hard instead of something more split up by queues at a higher layer?
Postgres queries are streaming. Ie. If the query result has 1 billion rows, but the client only has enough ram for 1 million rows, then the query will be slowed down until the client is ready to accept more rows.
If the client is slowly processing through the result set, then the query can take many minutes/hours.
It's usually badly engineered things. However, just because it's badly engineered doesn't mean it's fine for it to break :-P
Things like a super complex n^3 complexity query to find all users who were 3 hops away from a known fraudster, where a 'hop' means 'shared any piece of account data in common'
> The second change was to create a DNS record in AWS Route53 for `database.notifications.service.gov.uk` with a 1 second TTL (time to live) [..] our migration script just needed to update the DNS weighting in AWS to 100% of results being sent to the target database location and wait 1 second for the TTL to expire. Then, when our apps next try to query our database they will be querying our target database.
Wait. Their (or Python's default?) db orm blocks while it performs a DNS lookup for each and every query!? It doesn't cache resolved addresses for any length of time? No connections are pooled and reused?
It would probably be the OS' `getaddrinfo` or `gethostname` that does this: Python rarely reimplements system level calls, which means it relies on the system's configuration.
If TTL of 1s was respected, they would be cached for 1s, but it's not uncommon for DNS query libraries and especially caching DNS servers to not fully respect TTL anyway: tbh, that might explain some of the downtime they've seen.
I didn't mean it was directly implementing the networking call to the dns server -- just that it wasn't directly caching the result.
getaddrinfo(3) and getnameinfo(3) (guessing that's what you meant) don't implement caching, at least not explicitly in the spec and not normally in practice. On Windows, DNS results are cached by the OS but on Linux that would be distro-dependent behavior and usually requires setting up a local caching dns server (Ubuntu uses unbound out-of-the-box, iirc. Other choices include ncsd and dnsmasq).
Even if they implemented caching at the syscall level, this still assumes no connection stays open for more than 1s or is reused except per query. It seems like a big assumption (at least I hope it is, because I certainly wouldn't want my app to initialize a new db connection, let alone perform a DNS lookup, for every query).
Lovely! We just migrated from postgres 14 to 16 for 3 postgres clusters (servers) on RDS containing about 2TB of data across 8 databases. We were down from 00:00 to 04:00. Steps we took:
* enabled our fallback "maintenance mode" site. It's a super lightweight version of our site running on CF workers.
* scaled down all apps using the db to 0 in terraform
* hit the upgrade button in the aws web ui, which runs pg_upgrade. 14->15
* waited for it to finish
* hit the upgrade button again. 15->16
* waited for the dbs to start accepting connections (they do before they're marked as ready, I think aws does more stuff than pg_upgrade)
* Started `VACUUM ANALYZE; REINDEX DATABASE CONCURRENTLY`. The idea is to avoid performance issues between versions and make use of performance improvements from new versions.
* Started spinning up the apps again
* Waited until all apps had a handful of containers running
* Started accepting traffic (disabled maintenance site)
* Went to bed
The REINDEX CONCURRENTLY happily chugged along for the biggest db for another 18 hours without blocking anything. Next time we're doing aws blue/green deploys to avoid downtime. We didn't this time since we weren't on 14.9 yet (the minimum minor version of 14 supported by blue green).
If I was doing this myself I wouldn't pay the AWS tax, instead do blue/green myself with logical replication and a load balancer.
FWIW GOV.UK Notify is part of a suite of services offered by GDS to UK public sector bodies (along with GOV.UK Pay and GOV.UK PaaS) that was originally known as "Government As A Platform".
Basically, you spin up a new Blue Green deployment with any desired changes (in our case, we were upgrading Postgres major from 13 to 15). While your blue configuration continues to serve traffic, AWS uses logical replication to keep the "green" deployment in-sync. You can keep modifying (or testing) the "green" deployment (eg you could load test it if you wanted to), as long as you don't do any writes to it (writes still have to go to your live, blue configuration, and are replicated to green).
When you're ready, you run the "switch" command, and AWS does a few things for you: run checks to ensure blue/green are in sync, stops writes and connections, waits a few seconds to ensure replication is caught up, renames your database, then allows connections/writes again. We had less than 20 seconds of downtime, by our count. And, we had a primary and several read replicas and AWS successfully switched the full configuration over with no hiccups. You don't even need to switch your configuration because AWS swaps the database URLs for you. Green becomes blue, blue becomes old blue, and when you're ready, you delete "old blue".
Highly recommend! They do have some restrictions (for instance, not sure if it would work if you're switching accounts, etc).
1. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/blue-...
Read the docs, especially the limitations. Re-read them. Do a test run in the dev environment, under load. Do it again in staging.
Or just YOLO into prod ‘cause it’ll probably be fine, I guess.
However we did learn the hard way that RDS Blue/Green cannot be used to apply arbitrary changes. In our case, we discovered RDS Blue/Green can only be used to move up engine versions, not down.
We discovered on MySQL 8.0 one of our stored procedures had very occasional failures, and considered the option of using RDS Blue/Green again to move back down to 5.7. Turns out that's not an option.
https://phizzle.space/dbadmin/aws/postgres/2023/12/30/rds-en...
We also had background worker services. For the very high throughput ones, we spun down the # of tasks to a bare minimum for <5 minutes and let the queue build up, rather than have a massive amount of errors and retries. For the other ones where throughput wasn't high, we just let them be, and during the downtime they errored and retried and the retries mostly succeeded.
Trade off: For a few seconds some requests were slower.
DNS Groups w/ retries is a nifty mechanism for these things.
Tool used: https://github.com/shayonj/pg_easy_replicate
Deleted Comment
Am waiting for them to support upgrades from RDS Postgres to Aurora.
If anything goes wrong and replication doesn't catch up, you can unpause and let those queries happen on the old database.
Therefore, your 11 seconds of downtime becomes 0 to 11 seconds of added page load time. But more importantly, of the thousands of users of the database who have never seen a query fail before and might have buggy error handling codepaths or have a single failed query ruin a whole batch job, this approach leads to a lot less collateral damage.
A lot of the discussion boiled down to 'this is a lot of complexity to avoid a few minutes of downtime'. I guess this is the proof, just use AWS Data Migration Service, swap the DNS entries to go live and live with 11 seconds of downtime.
> We chose to use DMS because it was well supported by the GOV.UK PaaS and we could also get support from AWS. If we were doing a PostgreSQL to PostgreSQL database migration in the future, we would invest more time in trying alternative tools such as pglogical. DMS potentially added more complexity, and an unfamiliar replication process than what we may have found with other tools. This backs up what AWS say themselves on PostgreSQL to PostgreSQL migrations.
The message here is not "just use DMS".
[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source...
If the databases are within the same AWS account, it’s likely easier to use hardware-level replication with global database or snapshots to do migrations if you’re ok with 4-5 mins of downtime.
What might have oriented theirs choice is that they wanted to upgrade from major version 11 to 15 during the migration process. This is only available using logical replication. Otherwise you'd have to chain upgrade process of each major version (and possibly OS because 11 is EOL on some arch) and this is nor trivial nor quick.
- Start: PG12, Ubuntu 18, TS2.9
- Step 1: Set up a read-only replica with PG12 on Ubuntu 22, maintaining TS2.9.
- Step 1.5: Enter maintenance mode and halt all services.
- Step 2: Detach the the read-only replica, upgrading from PG12 to PG15 on Ubuntu 22 with TS2.9.
- Step 3: Upgrade from PG15 with TS2.9 to TS2.13 on Ubuntu 22.
- Step 4: Upgrade from PG15 to PG16 on Ubuntu 22 with TS2.13.
- Step 4.5 : Reconnect services to the new database server, resume all services, and exit maintenance mode.
All the database upgrade steps were well-tested and automated using Ansible. Nonetheless, we did encounter an issue that had not arisen during testing. This extended our downtime to approximately half an hour, which, for our use case, was perfectly acceptable.
Employing logical replication could have mitigated the last-minute surprise. So we will consider this approach for our next upgrade cycle.
We did look into using logical replication to reduce the downtime of the upgrade, but because database schema and DDL commands aren't replicated, it seems that it isn't recommended with Timescale in the loop.. (I suppose the underlying schema changes that Timescale needs to make under the hood are mostly a function of your hypertable chunk sizing and what your incoming writes look like, so this could be planned around / timed well, but we felt it added too much potential complexity & risk compared to simply opting for a small maintenance window while pg_upgrade completed).
Ie. a single update query which takes 30 mins.
You either have to kill and roll back that query, or suffer 30 mins of unavailability.
As far as I know, there is no way to migrate a currently in progress query.
In combination with transaction time limit and fail-over configuration (where you fail the old primary), you can control the slowdown (instead of downtime, eg. with pgbouncer) very precisely.
I would be more concerned with the DNS TTL being respected in the entire stack (and external caching DNS servers you rely on), tbh.
But it is usually not critical to avoid a dozen seconds of downtime for an app, so whatever is simpler for you should be your go to solution.
There are a lot of the former in the wild to be sure. I've had a lot of run turning minutes-hours into milliseconds. :)
If the client is slowly processing through the result set, then the query can take many minutes/hours.
Things like a super complex n^3 complexity query to find all users who were 3 hops away from a known fraudster, where a 'hop' means 'shared any piece of account data in common'
Wait. Their (or Python's default?) db orm blocks while it performs a DNS lookup for each and every query!? It doesn't cache resolved addresses for any length of time? No connections are pooled and reused?
If TTL of 1s was respected, they would be cached for 1s, but it's not uncommon for DNS query libraries and especially caching DNS servers to not fully respect TTL anyway: tbh, that might explain some of the downtime they've seen.
getaddrinfo(3) and getnameinfo(3) (guessing that's what you meant) don't implement caching, at least not explicitly in the spec and not normally in practice. On Windows, DNS results are cached by the OS but on Linux that would be distro-dependent behavior and usually requires setting up a local caching dns server (Ubuntu uses unbound out-of-the-box, iirc. Other choices include ncsd and dnsmasq).
Even if they implemented caching at the syscall level, this still assumes no connection stays open for more than 1s or is reused except per query. It seems like a big assumption (at least I hope it is, because I certainly wouldn't want my app to initialize a new db connection, let alone perform a DNS lookup, for every query).
If I was doing this myself I wouldn't pay the AWS tax, instead do blue/green myself with logical replication and a load balancer.
Have done 2 TB dbs in less than a minute.
We were running our own Postgres instances on-prem.