To DR or not to DR

Most organizations these days look at strategies and ways of implementing a "working" DR solution. Often a DR plan of action is a manual process based on multiple approvals at each level from Incident Management Teams to Product Ownership teams.

What if there was a way to keep an active DR running where failure of a primary site would seamlessly transfer the traffic to the secondary site. In fact if the DR is active or warm then the only manual bit would be disabling the primary site via DNS.

A few tips and tricks that would help in a simple implementation of a warm DR for a stateless application hosted on AWS.

  • Make sure that the AMIs are built across both the primary and secondary region.
  • The fastest way to build multi-region AMIs would be to use different builds for each region using the same configuration management code altering the "REGION" parameter each time. This would make the AMIs consistent and available in each region.
  • The application packages/RPMs should be available and ready to be used across both the regions.
  • If the packages are hosted in private S3 buckets then the application packages need to be uploaded to the S3 buckets in each region.
  • As a part of deployment, the application deployment code should take REGION as one of the parameters.
  • Each region should be deployed to during the scheduled deployment. A deployment would be successful only if both regions are successfully deployed to.
  • The best way to implement DNS in this case would be weighted.
  • Equal weights should be applied for both the site, ideally. But this could be left to the discretion of the business owners.
  • Each DNS entry should have a health check which would dictate whether the site is healthy for traffic or not. This would allow traffic to be shifted across sites automatically based on the health check.
  • If health checks aren't configured then the DNS switch would be a manual process.
  • Last but not the least, keeping the databases replicated would be the most important step in the DR strategy.
  • MySQL databases are easier to replicate as AWS provides auto-replication for MySQL. This makes it easier to create an active DR.
  • PostgreSQL is tricky business. Since AWS does not provide auto replication for PostgreSQL DBs, the replication would need to be manual in an event of DR. This needs to be a two way process. First between the primary and secondary sites to update the secondary site and when the primary site comes back up, to update the primary site.