Using Amazon Simple Workflow (SWF) to Orchestrate Cross-region Deployment of RDS Resources

Summary: In this article we look at how Amazon Simple Workflow (SWF) can be used to orchestrate the cross region deployment of RDS resources.

Recently, I have been investigating highly available (multi-AZ) cross-region configurations of RDS. The RDS API makes the task quite simple. It’s little more than:

master_db = conn.create_db_instance(db_instance_identifier='master_db_name',
                                        allocated_storage=10,
                                        db_instance_class='db.t2.micro',
                                        engine='MySQL',
                                        multi_az= True,
                                        master_username='my_user_name',
                                        master_user_password='secret')

followed by polling the status of the database, waiting for it to become ‘available'

while (conn.describe_db_instances(db_instance_identifier='master_db_name')
  ['DescribeDBInstancesResponse']['DescribeDBInstancesResult']['DBInstances'][0]
  ['DBInstanceStatus'] != 'available')

then creating the asynchronous replica in the secondary region

rr_db = conn.create_db_instance_read_replica(db_instance_identifier='rr_db_name',
                                                source_db_instance_identifier=master_db_arn,
                                                db_instance_class='db.t2.micro')

and once again waiting until the status changes to ‘available'

while (conn.describe_db_instances(db_instance_identifier='rr_db_name')
  ['DescribeDBInstancesResponse']['DescribeDBInstancesResult']['DBInstances'][0]
  ['DBInstanceStatus'] != 'available')

This approach is a good start, but there are a number of improvements we can make to ensure that deployments are successful and in the event of failure at any stage, we recover automatically. The first thing we might consider is using Amazon CloudFormation for the highly available instance in the primary region. CloudFormation allows us to define what an environment should look like using a JSON-formatted text file. We then submit the template to CloudFormation and ask it to create a stack based on the template. CloudFormation hides all the orchestration mess and can be used to deploy very complex multi-tier environments.

However, CloudFormation does not currently support the creation of a read-replica RDS instance in a different region to the master, so unfortunately we cannot use it to manage the entire orchestration. We could however, still use it for the highly available instance in the primary region if we wanted to.

While CloudFormation's DSL is a powerful tool that allows us to version control our infrastructure, I particularly find its failure detection and rollback features just as valuable. If any resource in a stack fails to create or be updated, then CloudFormation can be configured to rollback all changes. This ensures changes are discrete and well understood.

Thankfully, Amazon’s Simple Workflow (SWF) service allows us to create our own complex orchestration work flows and manage state. If you have never heard of SWF, Amazon describes it as:

"Amazon SWF helps developers build, run, and scale background jobs that have parallel or sequential steps. You can think of Amazon SWF as a fully-managed state tracker and task coordinator in the Cloud."

I believe that SWF is a hidden gem within the AWS services catalogue and is quite possibly underutilised compared to other AWS services.

Instead of placing our four blocks of code in a single python script and executing it on a single host with all the state in memory, we define a series of activity tasks:

create_multiaz_rds_instance
wait_for_rds_instance_to_become_available
create_read_replica_rds_instance

define the decider logic, in this case sequentially execute each task and create a workflow. Taking the state out of an instance becomes increasingly important as the workflow execution time increases. Deploying a multi-region RDS configuration can take upwards of 10 minutes so it is a good candidate for using SWF, especially if we need to perform the task frequently or execute multiple parallel tasks.

The Amazon SWF team provide a useful workflow tutorial that can be easily modified to accommodate our highly available multi-AZ cross-region example. In particular, see how they define WaitForConfirmationActivity, the activity that we can base our wait_for_rds_instance_to_become_available on.