Zero-Downtime PostgreSQL Cutovers

Brad Nicholson
tech-at-instacart
Published in
9 min readJul 17, 2023

--

Authors: Brad Nicholson, Andrew Tanner, Marco Montagna

Instacart runs one of the largest PostgreSQL fleets on top of AWS RDS, and part of managing a fleet like this is to keep it up to date. Taking our systems offline for extended maintenance operations like upgrading or switching to new instance classes is unacceptable. Downtime must be seconds and be predictable to keep orders flowing and Instacart shoppers shopping.

We are going to talk about how we have solved this problem at Instacart with a Blue/Green approach using Logical Replication. We will then dive into the tooling we built to simplify the process.

Challenges with Upgrades and other Maintenance Operations

Maintaining a fleet of PostgreSQL databases with stringent downtime requirements presents numerous challenges. At Instacart, we strive to ensure that maintenance operations on our core databases are completed within seconds, rather than minutes or hours.

One of the key challenges is performing major version upgrades of PostgreSQL. Historically such upgrades involve prolonged periods of downtime and provide limited rollback paths. Rollback often necessitates restoring from a backup taken prior to the upgrade.

Logical replication is commonly used to implement a Blue/Green approach to upgrades in order to provide a seamless transition from the old version to the new version, ensuring the database remains accessible throughout the upgrade process. The replica can be promoted with minimal downtime, typically a few seconds to accomplish the upgrade. By reversing the replication at cutover time, we have a quick path back to the old version should any issues be encountered.

It is not just upgrades that pose a problem. Many other infrastructure-level operations like changing instance classes or applying system-level patches also can’t be done without downtime. Several user-level concerns can block access to tables or consume a lot of resources which can cause problems for large, busy databases. Examples of such cases are table migrations, such as converting integer columns to bigints, or running pg_repack on bloated tables. Blue/Green can also be used to minimize downtime in such cases.

While Blue/Green solves the downtime problem, there are numerous issues to overcome. It requires a high level of PostgreSQL expertise, highly privileged database access, and careful orchestration of many pieces. This results in a process that is complex and error-prone.

The biggest challenge with logical replication — and thus Blue/Green rollouts — for large, active databases like many in the Instacart fleet is actually setting up the logical replica in the first place. When initializing logical replication, PostgreSQL copies the data from the primary to the logical replica. This process can take many hours, even days. During this initial copy, transactions are held open on the primary for hours on end. The WAL files (transaction logs) required to sync the replica after the initial copy completes are accumulated on the primary and can put pressure on disk space. This WAL file accumulation can be large enough for high-write databases that the replica may never be able to sync post-copy.

Introducing: Zero Downtime Cutover

Historically Instacart has leveraged the techniques outlined in this blog post to create a logical replica from an RDS database snapshot, bypassing the problems with the initial COPY mentioned above. Once the snapshot is restored, logical replication can be set up by manipulating the LSN (Log Sequence Number, or pointer to the location in the transaction log) in order to have replication start at the correct point. In the past we have accomplished this with complex cutover plans and highly manual execution.

We have taken lessons learned across many past upgrades and built Zero Downtime Cutover, or ZDC, a comprehensive Blue/Green tooling solution to simplify all aspects of the process. ZDC allows us to easily create a logical replica of any production PostgreSQL database using the snapshot approach as long as it meets the underlying requirements of logical replication (such as all tables have primary keys, and no schema changes are made during the process).

The replica can be on a new major (or minor) version of PostgreSQL. We can perform maintenance operations and migrations on the logical replica without impacting the primary. When ready, we can cut over to the logical replica with low downtime, typically 5 to 10 seconds. And when we do, we convert the old primary to a subscriber of the new primary to give us a fast rollback path should any significant problems occur.

We have now utilized ZDC many times for many tasks including major version upgrades, adopting modern instance classes like Graviton, migrating to 64-bit file systems, converting data types from int to bigint, and running pg_repack to address significant table bloat.

System Overview

ZDC is a Python CLI application that streamlines the process by organizing each step into safe, idempotent commands. By abstracting the intricate details of each step, it minimizes the requirement for in-depth PostgreSQL knowledge. Additionally, ZDC ensures safety by implementing a range of guardrails at every stage, mitigating the risk of human error and safeguarding our databases.

Let’s dive into what using ZDC looks like. Each of the blue boxes below represents an individual ZDC command. The purple box is the stage where we can manually run those table migrations and pg_repacks as needed.

Pre-Checks

The pre-checks command runs an extensive set of pre-flight checks to surface problem areas before proceeding with the setup. They check for various conditions such as:

  • Tables missing primary keys or usable unique indexes
  • Schema incompatibilities between the old and new PostgreSQL major versions
  • Ensuring the databases are configured properly for logical replication
  • Ensuring that the schema matches between primary and replica

Setup

The setup command runs all the steps needed to spin up the Green instance and set up logical replication. It also can perform several other optional tasks like upgrading to a new PostgreSQL version and changing the instance class. The steps taken are the following:

  1. Create a logical replication slot on the Blue instance.
  2. Create a logical replication publication for all tables on the Blue instance.
  3. Take a snapshot of the Blue instance.
  4. Create the Green Instance by restoring the snapshot.
  5. Retrieve the last LSN from the Green instance.
  6. Upgrade the Green instance to a new major or minor version if desired. PostgreSQL Extensions are automatically upgraded if needed.
  7. Change the instance class if desired.
  8. Create the logical replication subscription on the Green instance in disabled mode.
  9. Fast-forward the logical replication slot on the Blue instance to the LSN retrieved in step 5, avoiding the initial data copy and the problems outlined Intro above.
  10. Enable the subscription on the Green instance and allow replication to catch up.
  11. Run automated data validation to ensure a clean replication setup.
  12. Optionally re-index the database to take advantage of index de-duplication on the newer PostgreSQL versions.
  13. Create the equivalent Green RDS read replicas subscribed to the Green instance.
  14. Run ANALYZE on the Green instance to re-create the PostgreSQL query planner statistics.

Maintenance Operations

At this stage everything is set up and ready for cutover. We pause here so we can perform any of those other maintenance operations we discussed earlier, like pg_repack, bigint migrations, and so on.

Read Replica Cutovers

The cutover-replicas command is used to cut over the RDS Read Replicas (read-only physical replicas). Because all of our application traffic goes through pgBouncer or PgCat proxies, this command simply repoints the traffic at the proxy layer from the Blue to Green instances one replica at a time. Because the proxies can change configs without disconnecting the clients there is minimal impact on the applications.

Primary Cutover

The final step is the cutover command which cuts over the primary by coordinating and executing numerous things that must happen quickly and safely. If anything goes wrong during the process, we automatically roll everything back to the initial state with traffic served off the Blue instance.

The cutover command does the following:

  1. Run a final set of checks. Abort before starting the cutover if we find problems such as schema differences or replication problems.
  2. Reconfigure the proxies to stage the move to the Green instance but do not reload them. We don’t want to repoint the traffic just yet.
  3. Pause writes on the Blue instance by revoking the LOGIN privilege from all application users.
  4. Cancel all running application queries on the Blue instance.
  5. Ensure that replication is up-to-date. Write to a marker table on the Blue instance and wait for that row to replicate to the Green instance.
  6. Increment the primary key sequences. Set the values for each sequence on the Green instance to be one greater than the value on the Blue instance. Failing to do this will result in primary key collisions.
  7. Set up reverse replication from the Green Instance to the Blue instance enabling fast, zero data loss rollback if needed.
  8. Reload the proxies to make them direct traffic to the Green Instance, completing the upgrade.

Final Steps: Teardown or Rollback

At this point, if the application is healthy and performance is good, all that is left is to tear down the Blue instances. We do this with yet another ZDC command called reset. This removes the reverse replication setup and deletes the Blue instances.

In the event of problems such as an unforeseen incompatibility with the new PostgreSQL version or a major performance regression, swift rollback becomes crucial. Fortunately, our reverse replication setup allows us to seamlessly revert back to the Blue instances. And we have a set of ZDC rollback commands to make this process quick and easy. Rollback can typically be done in minutes.

To begin the rollback, we first adjust the Read Replicas to point back to the Blue instances. This is achieved by following the same reconfiguration and proxy reloading steps used during the initial replica cutover.

Next we retrace steps 1 through 6 outlined in the primary cutover section above to smoothly shift write traffic from the Green instance back to the Blue instance. Finally, we reload the proxies, redirecting all applications to once again utilize the Blue instance.

Impact

We have performed dozens of cutovers using ZDC over the past year and a half, and it is now a core tool in our arsenal for keeping our fleet up-to-date and healthy. We have upgraded to new major PostgreSQL versions, performed bigint migrations on massive tables, switched to more performant instance classes like Graviton, reclaimed significant bloat left behind from inefficiencies in older PostgreSQL versions, and migrated off 32-bit filesystems.

It has reduced our downtime from close to an hour in many cases to a handful of seconds, which is almost imperceptible to our applications. It has given us confidence as we improve our fleet of databases, knowing rollback is quick and easy.

When we consider how databases are central to everything we do, ZDC has improved life for customers, Instacart shoppers, partners, and engineers across Instacart.

Where are we going?

Though the availability of such tooling is incredibly valuable, executing it demands careful planning and a substantial amount of time. The engineer responsible for the cutover must allocate significant attention to the task at hand and possess privileged system access. An end-to-end ZDC cutover typically necessitates several dedicated days of engineering effort.

To address these challenges, we have developed automated tooling that will allow engineers across Instacart to execute the entire process easily and safely. Through our internal UI interface, engineers can efficiently perform the ZDC Cutover with a few clicks powered by Temporal workflows. This will save time and effort, ideally bringing the engineering time involved down to a few hours. Watch out for a future blog on this.

Beyond this, we have even more big plans for ZDC. We plan to add Aurora PostgreSQL support with the same snapshot-based setup and upgrade capabilities that we have for RDS. We plan to add support to downsize allocated storage size on RDS Instances. And as we continue to shard PostgreSQL instances at Instacart, we plan to extend ZDC to simplify the process of creating sharded members.

One more thing…

Last but certainly not least, we need to extend a huge thank you to Praveen Burgu, Peter Chen, Lev Kokotov and Matt Larraz who have all been key contributors to the ZDC project.

--

--