Experimenting with Hadoop MapReduce on the AWS Platform - Part II

Summary: In part one of Experimenting with Hadoop MapReduce on the AWS Platform, we developed an Amazon VPC environment suitable as the foundation to deploy Hadoop MapReduce. In part two, we present an Amazon CloudFormation template that can be used to automate the creation of the aforementioned environment. We also present a collection of bash shells script commands that can be used to achieve the necessary prerequisites before installing Hortonworks Data Platform.

In part one of Experimenting with Hadoop MapReduce on the AWS Platform, we developed an Amazon VPC environment suitable as the foundation to deploy Hadoop MapReduce. In part two, we present an Amazon CloudFormation template that can be used to automate the creation of the aforementioned environment. We also present a collection of bash shells script commands that can be used to achieve the necessary prerequisites before installing Hortonworks Data Platform.

Amazon CloudFormation

Amazon CloudFormation is one of Amazon's deployment and management tools. Put simply, it allows you to create and update a collection of AWS resources (a CloudFormation Stack) using a text file containing JSON objects as input. Each JSON object is one of an Input Parameter,Mapping Function, Resource or Output object. Refer to the AWS CloudFormation User Guide for a comprehensive list of supported objects and their associated properties.

While Amazon CloudFormation is a useful web service, it can take some effort to develop templates that define an entire environmental stack. It therefore doesn't make sense to use it as the source of every environment you may create on AWS. However, If you expect to deploy a particular environment, or variation e.g. Development, Test, Production, more that a couple of times and value either efficiency, consistency or speed of deployment, then investing the time developing an Amazon CloudFormation template makes sense.

Before we explore some of the key sections of the Amazon CloudFormation template hortonworksDemo.template, some background understanding of the web service will be helpful. If you are new to Amazon CloudFormation, I recommend that you take a look at the Getting Started Guide before reading on.

The following sections are not intended to be a detailed walkthrough of each section of the template, but rather highlight the more pertinent snippets of JSON.

Input Parameters

I wanted the template to be flexible and configurable at stack creation time. Therefore, each of the network configuration details are configurable as input parameters with defaults that reflect those presented in Experimenting with Hadoop MapReduce on the AWS Platform. For example, we define an input parameter VpcId that represents the CIDR address of our VPC. The default value is set to 10.100.0.0/16 and we restrict the AllowedPattern to a generous superset of octets as follows:

"VpcId": {
  "Type": "String",
  "Description": "VPC CIDR address space",
  "AllowedPattern": "^[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}/16$",
  "Default": "10.100.0.0/16"
}

With security in mind, we want the environment to be locked down to a single public IP address at creation time. We achieve this by allowing you to define a trusted public IP address, typically the IP address of your firewall, as an input parameter.

"MyPublicIP": {
  "Type": "String",
  "Description": "The source IP address used to access to the hadoop cluster manager",
  "AllowedPattern": "^[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}/32$",
  "Default": "42.43.44.45/32",
  "ConstraintDescription": "must be a valid IP"
}

Mapping Functions

Our use of Mapping Functions is very simple. We only use them to specify the correct Amazon Machine Image (AMI) for Red Hat Enterprise Linux 6.4 x86 64 bit and the NAT instance in each AWS region. Below is the relevant JSON used to perform the mapping for the RHEL AMI.

"Mappings": {
  "RHELRegionMap": {
    "us-east-1": {
      "AMI": "ami-a25415cb"
      },
    "us-west-1": {
      "AMI": "ami-6283a827"
      },
    "us-west-2": {
      "AMI": "ami-b8a63b88"
      },
    "sa-east-1": {
      "AMI": "ami-fd73d7e0"
      },
    "eu-west-1": {
      "AMI": "ami-75342c01"
      },
    "ap-southeast-1": {
      "AMI": "ami-80bbf3d2"
      },
    "ap-southeast-2": {
      "AMI": "ami-1d62f027"
      },
    "ap-northeast-1": {
      "AMI": "ami-5769f956"}
}

Using a mapping function to determine the AMI id allows the template to be used to deploy resources in any region we choose.

Resources

There are few surprises within the Resources section of the template, with exception to possibly some of the Security Group rules. You will notice that some of the rules are defined when the Security Group is defined, as illustrated below:

"GroupDescription": "Security Group containing the Hadoop cluster manager resources",
"SecurityGroupIngress": [
{
  "IpProtocol": "tcp",
  "FromPort": "22",
  "ToPort": "22",
  "CidrIp": {
    "Ref": "MyPublicIP"
  }
},
{
  "IpProtocol": "tcp",
  "FromPort": "8080",
  "ToPort": "8080",
  "CidrIp":
    {
    "Ref": "MyPublicIP"
    }
},
{
 "IpProtocol": "tcp",
  "FromPort": "0",
  "ToPort": "65535",
  "SourceSecurityGroupId": {
    "Ref" : "HadoopClusterSecurityGroup"
    }
}
]

whereas some of the rules are defined as distinct resources of type AWS::EC2::SecurityGroupIngress.

"HadoopClusterSecurityGroupIngress01": {
  "Type": "AWS::EC2::SecurityGroupIngress",
  "Properties": {
    "GroupId": {
      "Ref": "HadoopClusterSecurityGroup"
    },
    "IpProtocol": "tcp",
    "FromPort": "0",
    "ToPort": "65535",
    "SourceSecurityGroupId": {
      "Ref": "HadoopClusterManagerSecurityGroup"
    }
  }
}

My preference is to use the former approach when defining Security Groups and their rules, however, the latter is necessary to define cyclic dependencies between Security Groups e.g. Security Group A has a rule that refers to Security Group B and Security Group B has a rule that refers to Security Group A.

Output Parameters

Output parameters are a useful way to capture pertinent information that you will want to know upon stack creation . Common examples include EIP addresses created and assigned to hosts during the creation process; other examples include subnet ids and host Fully Qualified Domain names (FQDNs). As described below, I am interested in obtaining subnet ids and EIP address assigned to the cluster manager instance.

"Outputs": {
  "HadoopClusterSubnetAz1": {
    "Description": "Subnet ids for hadoop cluster subnets",
    "Value": {
      "Fn::Join": [ ",", [ {
      "Ref": "HadoopClusterSubnetAz1"
      },
      {
      "Ref": "HadoopClusterSubnetAz2"
      } ] ]
      }
    },
    "HadoopClusterManagerSubnetAz2": {
      "Description": "Subnet ids for hadoop manager subnet",
      "Value": {
        "Ref": "HadoopClusterManagerSubnetAz2"
      }
    },
    "HadoopClusterManagerEIP": {
      "Value": {
        "Ref": "HadoopClusterManagerEIP"
      },
      "Description": "Public IP address of hadoop cluster manager"
    }
}

Hortonworks Environmental Prerequisites

If you create a CloudFormation stack using the template located at hortonworksDemo.template, you are almost ready to start the installation of Hortonworks Data Platform. All the remains is to perform the necessary host OS prerequisite steps. Conveniently, the steps can be automated with a series of shell script commands. I have created an accompanying shell script post installation script with commands to be executed on each of the three EC2 instances {cluster manager, master, slave}.

I have purposely omitted using cloud-init as part of the CloudFormation template and left that as an exercise for the reader. For those wishing to automate the entire installation, you can either look to break down the accompanying post installation script resources into their constituent parts such as installation and configuration of Services, files and execution of shell commands and use cloud-init or simply pass in the entire shell script as user-data. What is more likely however, is for you to move the post installation configuration into your lifecycle management tool of choice, e.g. Puppet or Chef.

As CloudFormation templates are nothing more than JSON, you can use your favourite text editor to develop them. I typically use the Cloudformation plugin for Eclipse. It provides syntax highlighting, built-in IDE autocompletion and the ability to launch and update a CloudFormation stack from within the development environment.

Whether you choose to create your CloudFormation stack directly from Eclipse, from the AWS Management Console, using the AWS CLI tools or via an API call using your language of choice, after applying the respective post installation script commands to each host, you are now ready to open a web browser at http://HadoopClusterManagerEIP:8080, where HadoopClusterManagerEIP is the IP address returned as an Output of running the CloudFormation template. You can then follow the installation instructions for Hortonworks as described in article and blog post.