how to run your hadoop cluster in 10 minutes
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Vladimir Simek, Solutions Architect @ AWS
22/03/2016
Amazon Elastic MapReduceHow to run your Hadoop Cluster in 10 minutes
Agenda
• Two different companies – 2 stories• Challenges with Big Data on premises• Technical introduction to Amazon EMR• Amazon EMR features and benefits• Use case of AOL – moving 2 PB on-prem Hadoop
cluster to the AWS cloud• Short demos
In the beginning – 2 different stories
• In 2007 New York Times has decided create a digital archive on the web – all articles from 1851-1922
• 11 million articles (4 TB of data) composed of:• 405,000 large TIFF images• 405,000 XML files• 3.3 million SGML files
• Used Amazon EC2 and Hadoop to process the data
Time to process?Less than 24 hours
Costs?About $240
(Undisclosed international company) – subsidiary in France
• In 2014 - has decided to run a POC on Big Data analytics
• What was the 1st step they did? Invested €7M into server purchase
“Want to increase innovation?Lower the cost of failure.”
Joi Ito, Director of MIT Media Lab
How many big ticket technology ideas can your budget tolerate?
€7M
(Big) Data for Competitive Advantage
Customer segmentation
Marketing spend optimization
Financial modeling & forecasting
Ad targeting & real-time bidding
Clickstream analysis
Fraud detection
Security threat detection
Challenges with In-House Infrastructure
Fixed Cost
Slow DeploymentCycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
What is Amazon EMR and how it addresses such issues?
Amazon EMR • Managed platform• MapReduce, Apache Spark, Presto • Launch a cluster in minutes • Open source distribution and MapR
distribution• Leverage the elasticity of the cloud• Baked in security features• Pay by the hour and save with Spot• Flexibility to customize
Make it easy, secure, and cost-effective to run data-processing frameworks on the AWS cloud
What Do I Need to Build a Cluster ?
1. Choose instances2. Choose your software3. Choose your access method
Choice of Multiple Instances
CPUc3 family
cc1.4xlargecc2.8xlarge
Memorym2 familyr3 family
Disk/IOd2 familyi2 family
Generalm1 familym3 family
Machine Learning
Batch Processing
In-memory (Spark & Presto)
Large HDFS
Select an Instance
Choose Your Software (Quick Bundles)
Choose Your Software – Custom
Hadoop Applications Available in Amazon EMR
Choose Security and Access Control
You Are Up and Running!
You Are Up and Running!
Master Node DNS
You Are Up and Running!
Information about the software you are running, logs and features
You Are Up and Running!
Infrastructure for this cluster
You Are Up and Running!
Security Groups and Roles
Use the CLI
aws emr create-cluster --release-label emr-4.0.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK
Demo – Build EMR cluster
Now that I have a cluster, I need to process some data
Amazon EMR can process data from multiple sources
Hadoop Distributed File System (HDFS)Amazon S3 (EMRFS)Amazon DynamoDBAmazon Kinesis
Amazon EMR can process data from multiple sources
Hadoop Distributed File System (HDFS)Amazon S3 (EMRFS)Amazon DynamoDBAmazon Kinesis
On an On-premises Environment
Tightly coupled
Compute and Storage Grow Together
Tightly coupled
Storage grows along with computeCompute requirements vary
Underutilized or Scarce Resources
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260
20
40
60
80
100
120Re-processingWeekly peaks
Steady state
Underutilized or Scarce Resources
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260
20
40
60
80
100
120
Underutilized capacity
Provisioned capacity
Contention for Same Resources
Compute bound Memory
bound
Separation of Resources Creates Data Silos
Team A
Replication Adds to Cost
3x
Single datacenter
So how does Amazon EMR solve these problems?
Decouple Storage and Compute
Amazon S3 is Your Persistent Data Store
Designed for 11 9’s durability$0.03 / GB / month in Ireland Lifecycle policiesVersioning Distributed by default EMRFS
Amazon S3
The Amazon EMR File System (EMRFS)
• Allows you to leverage Amazon S3 as a file-system• Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than
open source components• Consistent view – consistency for read after write• Support for encryption • Fast listing of objects
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION ‘samples/pig-apache/input/'
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'
Benefit 1: Switch Off Clusters
Amazon S3Amazon S3 Amazon S3
Auto-Terminate Clusters
You Can Build a Pipeline
Run Transient or Long-Running Clusters
Benefit 2: Resize Your Cluster
Resize the Cluster
Scale Up, Scale Down, Stop a resize, issue a resize on another
How do you scale up and save cost ?
Spot Instance
Bid Price
OD Price
Spot Integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1,InstanceGroupType=CORE,BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK,BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
Spot Integration with Amazon EMR
• Can provision instances from the Spot market• Impact of interruption
• Master node – Can lose the cluster • Core node – Can lose intermediate data • Task nodes – Jobs will restart on other nodes (application
dependent)
Scale up with Spot Instances
10 node cluster running for 14 hoursCost = 1.0 * 10 * 14 = $140
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
Resize Nodes with Spot Instances
20 node cluster running for 7 hoursCost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35
Total $105
Resize Nodes with Spot Instances
50 % less run-time ( 14 7)
25% less cost (140 105)
Intelligent Scale Down
Effectively Utilize Clusters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260
20
40
60
80
100
120
Benefit 3: Logical Separation of Jobs
Hive, Pig,Cascading
Prod
Presto Ad-Hoc
Amazon S3
Benefit 4: Disaster Recovery Built In
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
Demo 2 – Word Count Example
Case study: How AOL moved a 2 PB cluster to the AWS cloud
AOL Data Platforms Architecture 2014
AOL
Source Systems In-house Hadoop Cluster
Database
Reporting Tools
Users
Data Stats & Insights
Cluster Size2 PB
In-House Cluster
100 Nodes
RawData/Day 2-3 TB
DataRetention
13-24 Months
Challenges with In-House Infrastructure
Fixed Cost
Slow DeploymentCycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
AOL Data Platforms Architecture 2015
1 2
2
34
56
Source Systems
Amazon S3
Amazon EMR
Cluster Watchdog
Amazon SNS
Amazon IAM
AOL
AWS Direct Connect
Reporting Tools
Database
Users
EMR Design Options
TransientAmazon S3Elastic ClusterOn-Demand vs. Reserved vs. Core NodesAmazon EMR
vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes
AWS vs. In-House Cost
Service
0 1 2 3 4 5
Cost Comparison
AWSIn-House
Service
Cost Comparison
0 2 4 6
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Cores...
Restatement Use Case• Restate historical data going back 6 months
Availability Zones10
550EMR Clusters
24,000Spot EC2 Instances
010203040506070
Timing Comparison
In-HouseAWS
Any questions?
Thank you!