how to run your hadoop cluster in 10 minutes

73
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Vladimir Simek, Solutions Architect @ AWS 22/03/2016 Amazon Elastic MapReduce How to run your Hadoop Cluster in 10 minutes

Upload: vladimir-simek

Post on 21-Mar-2017

239 views

Category:

Software


0 download

TRANSCRIPT

Page 1: How to run your Hadoop Cluster in 10 minutes

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Vladimir Simek, Solutions Architect @ AWS

22/03/2016

Amazon Elastic MapReduceHow to run your Hadoop Cluster in 10 minutes

Page 2: How to run your Hadoop Cluster in 10 minutes

Agenda

• Two different companies – 2 stories• Challenges with Big Data on premises• Technical introduction to Amazon EMR• Amazon EMR features and benefits• Use case of AOL – moving 2 PB on-prem Hadoop

cluster to the AWS cloud• Short demos

Page 3: How to run your Hadoop Cluster in 10 minutes

In the beginning – 2 different stories

Page 4: How to run your Hadoop Cluster in 10 minutes

• In 2007 New York Times has decided create a digital archive on the web – all articles from 1851-1922

• 11 million articles (4 TB of data) composed of:• 405,000 large TIFF images• 405,000 XML files• 3.3 million SGML files

• Used Amazon EC2 and Hadoop to process the data

Page 5: How to run your Hadoop Cluster in 10 minutes

Time to process?Less than 24 hours

Costs?About $240

Page 6: How to run your Hadoop Cluster in 10 minutes
Page 7: How to run your Hadoop Cluster in 10 minutes

(Undisclosed international company) – subsidiary in France

• In 2014 - has decided to run a POC on Big Data analytics

• What was the 1st step they did? Invested €7M into server purchase

Page 8: How to run your Hadoop Cluster in 10 minutes

“Want to increase innovation?Lower the cost of failure.”

Joi Ito, Director of MIT Media Lab

Page 9: How to run your Hadoop Cluster in 10 minutes

How many big ticket technology ideas can your budget tolerate?

€7M

Page 10: How to run your Hadoop Cluster in 10 minutes

(Big) Data for Competitive Advantage

Customer segmentation

Marketing spend optimization

Financial modeling & forecasting

Ad targeting & real-time bidding

Clickstream analysis

Fraud detection

Security threat detection

Page 11: How to run your Hadoop Cluster in 10 minutes

Challenges with In-House Infrastructure

Fixed Cost

Slow DeploymentCycle

Always On Self Serve

Static : Not Scalable Outages Impact Production Upgrade

Storage Compute

Page 12: How to run your Hadoop Cluster in 10 minutes

What is Amazon EMR and how it addresses such issues?

Page 13: How to run your Hadoop Cluster in 10 minutes

Amazon EMR • Managed platform• MapReduce, Apache Spark, Presto • Launch a cluster in minutes • Open source distribution and MapR

distribution• Leverage the elasticity of the cloud• Baked in security features• Pay by the hour and save with Spot• Flexibility to customize

Page 14: How to run your Hadoop Cluster in 10 minutes

Make it easy, secure, and cost-effective to run data-processing frameworks on the AWS cloud

Page 15: How to run your Hadoop Cluster in 10 minutes

What Do I Need to Build a Cluster ?

1. Choose instances2. Choose your software3. Choose your access method

Page 16: How to run your Hadoop Cluster in 10 minutes

Choice of Multiple Instances

CPUc3 family

cc1.4xlargecc2.8xlarge

Memorym2 familyr3 family

Disk/IOd2 familyi2 family

Generalm1 familym3 family

Machine Learning

Batch Processing

In-memory (Spark & Presto)

Large HDFS

Page 17: How to run your Hadoop Cluster in 10 minutes

Select an Instance

Page 18: How to run your Hadoop Cluster in 10 minutes

Choose Your Software (Quick Bundles)

Page 19: How to run your Hadoop Cluster in 10 minutes

Choose Your Software – Custom

Page 20: How to run your Hadoop Cluster in 10 minutes

Hadoop Applications Available in Amazon EMR

Page 21: How to run your Hadoop Cluster in 10 minutes

Choose Security and Access Control

Page 22: How to run your Hadoop Cluster in 10 minutes

You Are Up and Running!

Page 23: How to run your Hadoop Cluster in 10 minutes

You Are Up and Running!

Master Node DNS

Page 24: How to run your Hadoop Cluster in 10 minutes

You Are Up and Running!

Information about the software you are running, logs and features

Page 25: How to run your Hadoop Cluster in 10 minutes

You Are Up and Running!

Infrastructure for this cluster

Page 26: How to run your Hadoop Cluster in 10 minutes

You Are Up and Running!

Security Groups and Roles

Page 27: How to run your Hadoop Cluster in 10 minutes

Use the CLI

aws emr create-cluster --release-label emr-4.0.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

Or use your favorite SDK

Page 28: How to run your Hadoop Cluster in 10 minutes

Demo – Build EMR cluster

Page 29: How to run your Hadoop Cluster in 10 minutes

Now that I have a cluster, I need to process some data

Page 30: How to run your Hadoop Cluster in 10 minutes

Amazon EMR can process data from multiple sources

Hadoop Distributed File System (HDFS)Amazon S3 (EMRFS)Amazon DynamoDBAmazon Kinesis

Page 31: How to run your Hadoop Cluster in 10 minutes

Amazon EMR can process data from multiple sources

Hadoop Distributed File System (HDFS)Amazon S3 (EMRFS)Amazon DynamoDBAmazon Kinesis

Page 32: How to run your Hadoop Cluster in 10 minutes

On an On-premises Environment

Tightly coupled

Page 33: How to run your Hadoop Cluster in 10 minutes

Compute and Storage Grow Together

Tightly coupled

Storage grows along with computeCompute requirements vary

Page 34: How to run your Hadoop Cluster in 10 minutes

Underutilized or Scarce Resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

20

40

60

80

100

120Re-processingWeekly peaks

Steady state

Page 35: How to run your Hadoop Cluster in 10 minutes

Underutilized or Scarce Resources

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

20

40

60

80

100

120

Underutilized capacity

Provisioned capacity

Page 36: How to run your Hadoop Cluster in 10 minutes

Contention for Same Resources

Compute bound Memory

bound

Page 37: How to run your Hadoop Cluster in 10 minutes

Separation of Resources Creates Data Silos

Team A

Page 38: How to run your Hadoop Cluster in 10 minutes

Replication Adds to Cost

3x

Single datacenter

Page 39: How to run your Hadoop Cluster in 10 minutes

So how does Amazon EMR solve these problems?

Page 40: How to run your Hadoop Cluster in 10 minutes

Decouple Storage and Compute

Page 41: How to run your Hadoop Cluster in 10 minutes

Amazon S3 is Your Persistent Data Store

Designed for 11 9’s durability$0.03 / GB / month in Ireland Lifecycle policiesVersioning Distributed by default EMRFS

Amazon S3

Page 42: How to run your Hadoop Cluster in 10 minutes

The Amazon EMR File System (EMRFS)

• Allows you to leverage Amazon S3 as a file-system• Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than

open source components• Consistent view – consistency for read after write• Support for encryption • Fast listing of objects

Page 43: How to run your Hadoop Cluster in 10 minutes

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION ‘samples/pig-apache/input/'

Page 44: How to run your Hadoop Cluster in 10 minutes

Going from HDFS to Amazon S3

CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'

Page 45: How to run your Hadoop Cluster in 10 minutes

Benefit 1: Switch Off Clusters

Amazon S3Amazon S3 Amazon S3

Page 46: How to run your Hadoop Cluster in 10 minutes

Auto-Terminate Clusters

Page 47: How to run your Hadoop Cluster in 10 minutes

You Can Build a Pipeline

Page 48: How to run your Hadoop Cluster in 10 minutes

Run Transient or Long-Running Clusters

Page 49: How to run your Hadoop Cluster in 10 minutes

Benefit 2: Resize Your Cluster

Page 50: How to run your Hadoop Cluster in 10 minutes

Resize the Cluster

Scale Up, Scale Down, Stop a resize, issue a resize on another

Page 51: How to run your Hadoop Cluster in 10 minutes

How do you scale up and save cost ?

Page 52: How to run your Hadoop Cluster in 10 minutes

Spot Instance

Bid Price

OD Price

Page 53: How to run your Hadoop Cluster in 10 minutes

Spot Integration

aws emr create-cluster --name "Spot cluster" --ami-version 3.3InstanceGroupType=MASTER,InstanceType=m3.xlarge,InstanceCount=1,InstanceGroupType=CORE,BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK,BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3

Page 54: How to run your Hadoop Cluster in 10 minutes

Spot Integration with Amazon EMR

• Can provision instances from the Spot market• Impact of interruption

• Master node – Can lose the cluster • Core node – Can lose intermediate data • Task nodes – Jobs will restart on other nodes (application

dependent)

Page 55: How to run your Hadoop Cluster in 10 minutes

Scale up with Spot Instances

10 node cluster running for 14 hoursCost = 1.0 * 10 * 14 = $140

Page 56: How to run your Hadoop Cluster in 10 minutes

Resize Nodes with Spot Instances

Add 10 more nodes on Spot

Page 57: How to run your Hadoop Cluster in 10 minutes

Resize Nodes with Spot Instances

20 node cluster running for 7 hoursCost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35

Total $105

Page 58: How to run your Hadoop Cluster in 10 minutes

Resize Nodes with Spot Instances

50 % less run-time ( 14 7)

25% less cost (140 105)

Page 59: How to run your Hadoop Cluster in 10 minutes

Intelligent Scale Down

Page 60: How to run your Hadoop Cluster in 10 minutes

Effectively Utilize Clusters

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

20

40

60

80

100

120

Page 61: How to run your Hadoop Cluster in 10 minutes

Benefit 3: Logical Separation of Jobs

Hive, Pig,Cascading

Prod

Presto Ad-Hoc

Amazon S3

Page 62: How to run your Hadoop Cluster in 10 minutes

Benefit 4: Disaster Recovery Built In

Cluster 1 Cluster 2

Cluster 3 Cluster 4

Amazon S3

Availability Zone Availability Zone

Page 63: How to run your Hadoop Cluster in 10 minutes

Demo 2 – Word Count Example

Page 64: How to run your Hadoop Cluster in 10 minutes

Case study: How AOL moved a 2 PB cluster to the AWS cloud

Page 65: How to run your Hadoop Cluster in 10 minutes

AOL Data Platforms Architecture 2014

AOL

Source Systems In-house Hadoop Cluster

Database

Reporting Tools

Users

Page 66: How to run your Hadoop Cluster in 10 minutes

Data Stats & Insights

Cluster Size2 PB

In-House Cluster

100 Nodes

RawData/Day 2-3 TB

DataRetention

13-24 Months

Page 67: How to run your Hadoop Cluster in 10 minutes

Challenges with In-House Infrastructure

Fixed Cost

Slow DeploymentCycle

Always On Self Serve

Static : Not Scalable Outages Impact Production Upgrade

Storage Compute

Page 68: How to run your Hadoop Cluster in 10 minutes

AOL Data Platforms Architecture 2015

1 2

2

34

56

Source Systems

Amazon S3

Amazon EMR

Cluster Watchdog

Amazon SNS

Amazon IAM

AOL

AWS Direct Connect

Reporting Tools

Database

Users

Page 69: How to run your Hadoop Cluster in 10 minutes

EMR Design Options

TransientAmazon S3Elastic ClusterOn-Demand vs. Reserved vs. Core NodesAmazon EMR

vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes

Page 70: How to run your Hadoop Cluster in 10 minutes

AWS vs. In-House Cost

Service

0 1 2 3 4 5

Cost Comparison

AWSIn-House

Service

Cost Comparison

0 2 4 6

AWS

In-House

Source : AOL & AWS Billing Tool

4xIn-House / Month

1xAWS / Month

** In-House cluster includes Storage, Power and Network cost.

Page 71: How to run your Hadoop Cluster in 10 minutes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Cores Nodes Demand - 06/01/2015 Cores...

Restatement Use Case• Restate historical data going back 6 months

Availability Zones10

550EMR Clusters

24,000Spot EC2 Instances

010203040506070

Timing Comparison

In-HouseAWS

Page 72: How to run your Hadoop Cluster in 10 minutes

Any questions?

Page 73: How to run your Hadoop Cluster in 10 minutes

Thank you!