Summary: In this article, we present a solution architecture suitable as a foundation for a Hadoop MapReduce pilot on the AWS platform. We start by considering under what circumstances an organisation may want to manage their own Hadoop MapReduce platform on EC2 instead of simply using Amazon Elastic Map Reduce (EMR). We then present a network architecture consisting of a Virtual Private Cloud (VPC), subnets, security groups & network ACLs. We conclude by considering compute and storage requirements.

In this article, we present a solution architecture suitable as a foundation for a Hadoop MapReduce pilot on the AWS platform. We start by considering under what circumstances an organisation may want to manage their own Hadoop MapReduce platform on EC2 instead of simply using Amazon Elastic Map Reduce (EMR). We then present a network architecture consisting of a Virtual Private Cloud (VPC), subnets, security groups & network ACLs. We conclude by considering compute and storage requirements.

Deploying Amazon Elastic Map Reduce (EMR) vs Managing Your Own Stack

Let us begin by considering why an organisation might choose to use the Amazon Elastic Map Reduce (EMR) web service instead of managing their own Hadoop MapReduce stack. Hadoop MapReduce is a complex software stack that can be challenging to optimise for performance, operational efficiency and stability all while achieving a desired security posture. While there are a number of Enterprise Data Platforms based on Hadoop, such as Hortonworks Data Platform, Cloudera Enterprise and MapR Enterprise Database Edition, it is not a case of deploy and forget. Often the operational overhead contributes a significant cost to the business. Amazon allows you to fully automate deployment of a secure, stable, performant Hadoop MapReduce environment (network, compute, storage & Hadoop MapReduce platform) programatically in only a few minutes. These are just a few reasons to consider a managed platform such as Amazon EMR. So why would you possibly consider the alternative?

At the time of writing, Amazon EMR provides a managed Hadoop MapReduce platform for either Amazon's distribution of MapReduce or MapR's Enterprise Database Edition. If your MapReduce stack of choice is say Hortonworks or Cloudera, then you may decide the consistency of management across environments outweighs the additional overhead of managing your own stack. Another consideration that may sway your decision to manage your own environment, is if you want to federate multiple remote clusters or provide a consistent secure gateway across each cluster using Apache Knox. You may also prefer the ability to install one of a myriad of related Hadoop MapReduce services such as Ambari, Hue, Spark, Shark, Storm etc on a self-managed OS stack rather than using Amazon EMR bootstrap scripts. As a final consideration, maybe you want to explore the Apache Mesos resource manager instead of YARN to compare how the Berkley Data Analytics Stack (BDAS) performs against Apache Hadoop MapReduce.

Whatever your motivation for experimenting with deploying your own Hadoop MapReduce stack on AWS, in light of a rapidly evolving Hadoop ecosystem, a melting pot of innovation, consolidation and investment, keep a close eye on managed platforms to see whether your reasons for managing your own stack still make sense as managed platforms evolve over time.

Amazon Virtual Private Cloud (VPC) Solution Architecture

The solution architecture presented below consists of a flexible design using the Amazon Virtual Private Cloud (VPC) web service. If you have little or no understanding of Amazon VPC, now would be a good time to read the Getting Started Guide before proceeding. The network architecture presented below is a little more complex than simply deploying your resources into a single subnet within a default VPC as described in article and blog post. The additional complexity provides additional levels of control and therefore security within your environment. While I would improve the security posture even further for a production deployment consisting of sensitive data, it should be sufficient for most pilot purposes.

Virtual Private Cloud (VPC) Architecture

VPC Subnets

To keep the network subnets math simple and easy to follow, we will create an Amazon VPC and subnets with the following properties:

  • HadoopClusterVpc (10.0.0.0/16)
  • HadoopClusterManagerSubnetAz2 (10.0.0.0/24) - this subnet will contain our management services and Network Address Translation (NAT) host
  • HadoopClusterSubnetAz1 (10.0.1.0/24) - this subnet will contain our Hadoop master server
  • HadoopClusterSubnetAz2 (10.0.2.0/24) - this subnet will contain our Hadoop slave server

While an instance of an Amazon VPC can be connected to an Enterprise data centre using an IPSEC/VPN connection (hardware or software) or Amazon Direct Connect, for simplicity, we will assume that access to the VPC will be via SSH to the cluster manager host residing in subnet HadoopClusterManagerSubnetAz2.

We place the Hadoop MapReduce master and slave resources into separate subnets from the cluster manager to remove the need to associate each with a public IP address. Access to the Internet, either for the purpose of installing and managing software packages or accessing other Amazon services, notably Amazon S3, is provided by a Network Address Translation (NAT) instance. As shown in the diagram, the NAT instance resides within the only public subnet, HadoopClusterManagerSubnetAz2, along with the cluster manager.

At the time of writing, Amazon do not provide a managed NAT resource, similar to their Internet GateWay (IGW) resource. While a NAT host consists of little more than an EC2 instance with a few simple firewall rules, it does introduce additional management and operational overhead, e.g. High-Availability (HA), security, performance management and running costs. For example, imagine the case where our Hadoop MapReduce cluster needs to download gigabytes of data per second from S3 into the cluster. While Amazon S3 can easily meet these scaling requirements, a single NAT instance will be insufficient. We would therefore need to consider using multiple NATs and the architecture becomes more complex and costly. For clusters with demanding network throughput requirements, a compromise will need to be made between additional security controls provided by multiple subnets and resources with only private IP addresses vs assigning hosts public IP addresses and removing the need for the NAT instance. For the purpose of this pilot, we will assume that the additional security controls outweigh the restriction of using a NAT instance and hope that Amazon provides a fully managed NAT resource in the future.

Network Access Control Lists (ACL)s and Security Groups

The next two components to consider are Security Groups and network Access Control Lists (ACL)s. Another benefit of subdividing our environment into multiple subnets is we can use stateless network ACLS to provide course grain restrictions between subnets. Once the outer perimeter is secured with network ACLs, we use Security Groups to provide stateful access control between logical groups of hosts, in this case MapReduce Hadoop nodes, the management node and the NAT instance. The allowed traffic flows between resources are presented in the solution architecture and are based on based on article and blog post. For example, access to the Hadoop cluster manager is restricted to TCP port 8080 and TCP port 22 from a single public IP address corresponding to a single trusted public IP, e.g. the public IP address of your organisation's firewall.

Compute and Storage

For the purpose of this pilot, we keep the number of Hadoop MapReduce components to a management host, a master server and a single slave server; this can easily be extended with little additional effort. Vanilla Hadoop MapReduce runs on both Linux and Microsoft Windows OS platforms, however, Enterprise versions such as Hortonworks Data Platform understandably do not support every possible distribution of Linux. While Amazon Linux may be the logical choice for our pilot, it is not supported by Hortonworks and we therefore use Red Hat Enterprise Linux 6.4 x86 64 bit for the master, slave and management host.

Choosing a suitable Amazon Elastic Compute (EC2) instance for the cluster manager, master, slave and NAT instance depends on the tests you intend to perform during your pilot along with the minimum requirements recommended by Hortonworks. For the purpose of the pilot, we choose an m1.large instance for the master and slave hosts. It's common for Hadoop MapReduce clusters to utilise EC2 instances from the high-cpu instance family, however this depends on the cluster's IO storage requirements. An m1.large instance is often a good starting point. The choice of resource for the NAT instance depends on the bandwidth to which the management, master and slave nodes require access to the Internet. For the purpose of the initial installation of software, a t1.micro is often sufficient, however, if you intend to conduct performance tests as part of the pilot, you will want to change the instance type to one with a higher IO throughput such as an m1.large. The requirements of a cluster manager node are typically far less than a Hadoop master or slave instance and therefore we will use an m1.small instance.

Finally, we consider storage, in particular the storage medium used to contain the HaDoop File System (HDFS). Given typical IO patterns within a Hadoop cluster and the expected duration of the cluster, we can often store the source data on a highly durable data store such as Amazon S3 and use local ephemeral storage on each of the data nodes to store the HDFS volumes. It should be noted that local ephemeral volumes have different performance characteristics to both Amazon Elastic Block Storage (EBS) and Provisioned IOPs volumes. We will assume that the local ephemeral storage volumes will be sufficient during the pilot.

Summary

In this article we showed you how to deploy an Amazon VPC environment suitable as the foundation to deploy Hadoop MapReduce. We started by considering under what circumstances an organisation may want to deploy their own Hadoop platform on AWS instead of simply using Amazon Elastic Map Reduce (EMR). We then presented a high-level solution architecture of the virtual resources (network, compute and storage) required before installing the Hadoop MapReduce stack. In the next article, we will show you how to automate the deployment of this environment in a consistent and efficient manner using Amazon CloudFormation.