Multi-Region Disaster Recovery Strategy on AWS

Most organizations these days look at strategies and ways of implementing a working disaster recovery (DR) solution. Often a DR plan of action is a manual process based on multiple approvals at each level from Incident Management Teams to Product Ownership teams.

What if there was a way to keep an active DR running where failure of a primary site would seamlessly transfer the traffic to the secondary site. In fact if the DR is active or warm then the only manual bit would be disabling the primary site via DNS.

A few tips and tricks that would help in a simple implementation of a warm DR for a stateless application hosted on AWS.

Machine Images (AMIs)

  1. Make sure that the AMIs are built across both the primary and secondary region.
  2. The fastest way to build multi-region AMIs would be to use different builds for each region using the same configuration management code altering the REGION parameter each time. This would make the AMIs consistent and available in each region.

Package Repository

  1. The application packages/RPMs should be available and ready to be used across both the regions.
  2. If the packages are hosted in private S3 buckets then the application packages need to be uploaded to the S3 buckets in each region.

Deployment

  1. As a part of deployment, the application deployment code should take REGION as one of the parameters.
  2. Each region should be deployed to during the scheduled deployment. A deployment would be successful only if both regions are successfully deployed to.

DNS

  1. The best way to implement DNS in this case would be weighted.
  2. Equal weights should be applied for both the site, ideally. But this could be left to the discretion of the business owners.
  3. Each DNS entry should have a health check which would dictate whether the site is healthy for traffic or not. This would allow traffic to be shifted across sites automatically based on the health check.
  4. If health checks aren’t configured then the DNS switch would be a manual process.

DATABASE REPLICATION

  1. Last but not the least, keeping the databases replicated would be the most important step in the DR strategy.
  2. Create cross region replicas of your databases or maintain database backups in your secondary region such that your Recovery Point Objectives (RPO) are not breached.