hadoop in the cloud with aws' emr

8
Hadoop in the Cloud: AWS Elastic Map Reduce What is EMR? How does EMR compare to Hadoop? Use cases

Upload: rich-morrow

Post on 17-Jan-2015

249 views

Category:

Technology


3 download

DESCRIPTION

Quick intro to and walkthrough of the AWS Elastic Map Reduce (EMR) service. Part of a larger course at http://bit.ly/get-hadoop

TRANSCRIPT

Page 1: Hadoop in the cloud with AWS' EMR

Hadoop in the Cloud: AWS Elastic Map Reduce

• What is EMR?• How does EMR compare to Hadoop?• Use cases

Page 2: Hadoop in the cloud with AWS' EMR

EMR is an AWS Service

• AWS review helpful to understand• Infiniteskills offers a course!

– http://bit.ly/learn-aws

• AWS constantly changing and evolving

http://aws.amazon.com/documentation/elasticmapreduce/

Page 3: Hadoop in the cloud with AWS' EMR

EMR Overview

• Abstracts out cluster setup & management– Integrated provisioning, tooling, debug, monitoring– AWS constantly tuning and optimizing– Failed nodes automatically re-provisioned by AWS

• Reduced costs– Clusters shut down automatically by default– Excellent for sporadic MapReduce needs

• Integration to AWS– Leverage cost-effective EC2 instances for processing, S3 for storage– Monitoring done via CloudWatch

Page 4: Hadoop in the cloud with AWS' EMR

EMR Architecture

Master Instance Group

EC2

S3

Core Instance Group

EC2EC2

HDFS HDFS

Task Instance Group

EC2 EC2

EC2 EC2

• Master group controls cluster• Core group runs DataNode &

TaskTracker daemons• Task group runs tasks

• Can be added & removed• S3 can be used for data input / output• Master group coordinates core + task

activities and manages cluster state• Core + task instances read / write to /

from S3

Page 5: Hadoop in the cloud with AWS' EMR

EMR AWS Integration

• Datastore pull / push to– RDS– DynamoDB– S3

• Derived data can be stored in RedShift– Via AWS DataPipelines– Further post-processing

• Data can be pre-processed with Kinesis

Page 6: Hadoop in the cloud with AWS' EMR

What you give up with EMR

• Control– Always 2-3 months behind Hadoop releases– Cannot use CDH or HDP releases (although MapR is supported)

• Speed (if you’re not an AWS customer)• Vendor lock-in

Page 7: Hadoop in the cloud with AWS' EMR

EMR Use Cases

• Already AWS customer– Lots of data in S3 / DynamoDB / RDS

• Sporadic MapReduce needs• Proof-of-concepting Hadoop• Ease of use

– Seamless, near-infinite scale– Simple administration

Page 8: Hadoop in the cloud with AWS' EMR

Hadoop in the Cloud: AWS Elastic Map Reduce

• What is EMR?• How does EMR compare to Hadoop?• Benefits & downsides• Use cases