deep dive: amazon elastic mapreduce

51
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Deep Dive: Amazon EMR Rahul PathakSr. Mgr. Amazon EMR (@rahulpathak) Jason TimmesAVP Software Development, Nasdaq

Upload: amazon-web-services

Post on 06-Aug-2015

741 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Deep Dive: Amazon Elastic MapReduce

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Deep Dive: Amazon EMR

Rahul Pathak—Sr. Mgr. Amazon EMR (@rahulpathak)

Jason Timmes—AVP Software Development, Nasdaq

Page 2: Deep Dive: Amazon Elastic MapReduce

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManaged firewalls

FlexibleYou control the cluster

Page 3: Deep Dive: Amazon Elastic MapReduce

Easy to deploy

AWS Management Console Command Line

or use the EMR API with your favorite SDK

Page 4: Deep Dive: Amazon Elastic MapReduce

Easy to monitor and debug

Monitor Debug

Integrated with Amazon CloudWatch

Monitor cluster, node, and I/O

Page 5: Deep Dive: Amazon Elastic MapReduce

Try different configurations to find your optimal architecture

CPU

c3 family

cc1.4xlarge

cc2.8xlarge

Memory

m2 family

r3 family

Disk/IO

d2 family

i2 family

General

m1 family

m3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Page 6: Deep Dive: Amazon Elastic MapReduce

Easy to add and remove compute

capacity on your cluster.

Match compute

demands with

cluster sizing.

Resizable clusters

Page 7: Deep Dive: Amazon Elastic MapReduce

Spot for

task nodes

Up to 90%

off EC2

on-demand

pricing

On-demand for

core nodes

Standard

Amazon EC2

pricing for

on-demand

capacity

Easy to use Spot Instances

Meet SLA at predictable cost Exceed SLA at lower cost

Page 8: Deep Dive: Amazon Elastic MapReduce

Read data directly into Hive,

Pig, streaming and cascading

from Amazon Kinesis streams

No intermediate data

persistence required

Simple way to introduce real time sources into

batch oriented systems

Multi-application support & automatic

checkpointing

Amazon EMR integration with Amazon Kinesis

Page 9: Deep Dive: Amazon Elastic MapReduce

The Hadoop ecosystem can run in Amazon EMR

Page 10: Deep Dive: Amazon Elastic MapReduce

Hue

Amazon S3 and HDFS

Page 11: Deep Dive: Amazon Elastic MapReduce

Hue

Query Editor

Page 12: Deep Dive: Amazon Elastic MapReduce

Hue

Job Browser

Page 13: Deep Dive: Amazon Elastic MapReduce

Leverage Amazon S3 with EMRFS

Page 14: Deep Dive: Amazon Elastic MapReduce

Amazon S3 as your persistent data store

• Separate compute and storage

• Resize and shut down Amazon

EMR clusters with no data loss

• Point multiple Amazon EMR

clusters at the same data in

Amazon S3

EMR

EMR

Amazon

S3

Page 15: Deep Dive: Amazon Elastic MapReduce

EMRFS makes it easier to use Amazon S3

• Read-after-write consistency

• Very fast list operations

• Error handling options

• Support for Amazon S3 encryption

• Transparent to applications: s3://

Page 16: Deep Dive: Amazon Elastic MapReduce

EMRFS client-side encryption

Amazon S3

Am

azon S

3 e

ncry

ption c

lients

EM

RF

S e

nable

d fo

r

Am

azon S

3 c

lient-s

ide e

ncry

ptio

n

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Page 17: Deep Dive: Amazon Elastic MapReduce

HDFS is still there if you need it

• Iterative workloads

– If you’re processing the same dataset more than once

– Consider using Spark & RDDs for this too

• Disk I/O intensive workloads

• Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing

Page 18: Deep Dive: Amazon Elastic MapReduce

Amazon EMR—Design Patterns

Page 19: Deep Dive: Amazon Elastic MapReduce

EMR example #1: Batch Processing

GB of logs pushed to

S3 hourlyDaily EMR cluster

using Hive to process

data

Input and output

stored in S3

250 Amazon EMR jobs per day, processing 30 TB of data

http://aws.amazon.com/solutions/case-studies/yelp/

Page 20: Deep Dive: Amazon Elastic MapReduce

EMR example #2: Long-running cluster

Data pushed to S3 Daily EMR cluster

ETL data into

database

24/7 EMR cluster running

HBase holds last 2 years of

data

Front-end service uses

HBase cluster to power

dashboard with high

concurrency

Page 21: Deep Dive: Amazon Elastic MapReduce

TBs of logs sent

dailyLogs stored in

Amazon S3

Hive Metastore

on Amazon EMR

EMR example #3: Interactive query

Interactive query using Presto on multi-petabyte warehouse

http://nflx.it/1dO7Pnt

Page 22: Deep Dive: Amazon Elastic MapReduce

EMR example #4: Streaming data processing

TBs of logs sent

dailyLogs stored in

Amazon Kinesis

Amazon Kinesis

Client Library

AWS Lambda

Amazon EMR

Amazon EC2

Page 23: Deep Dive: Amazon Elastic MapReduce

Optimizations for storage

Page 24: Deep Dive: Amazon Elastic MapReduce

File formats

• Row oriented– Text files

– Sequence files

• Writable object

– Avro data files

• Described by schema

• Columnar format– Object record columnar (ORC)

– Parquet

Logical table

Row oriented

Column oriented

Page 25: Deep Dive: Amazon Elastic MapReduce

Choosing the right file format

• Processing and query tools– Hive, Impala, and Presto

• Evolution of schema– Avro for Schema and Presto for storage

• File format “splittability”– Avoid JSON/XML Files. Use them as records.

• Compression—block or file

Page 26: Deep Dive: Amazon Elastic MapReduce

File sizes

• Avoid small files

– Anything smaller than 100 MB

• Each mapper is a single JVM

– CPU time is required to spawn JVMs/mappers

• Fewer files, matching closely to block size

– Fewer calls to S3

– Fewer network/HDFS requests

Page 27: Deep Dive: Amazon Elastic MapReduce

Dealing with small files

• Reduce HDFS block size, e.g. 1 MB (default is 128 MB)

– --bootstrap-action s3://elasticmapreduce/bootstrap-

actions/configure-hadoop --args “-m,dfs.block.size=1048576”

• Better: Use S3DistCP to combine smaller files together

– S3DistCP takes a pattern and target path to combine smaller

input files to larger ones

– Supply a target size and compression codec

Page 28: Deep Dive: Amazon Elastic MapReduce

Compression

• Always compress data files on Amazon S3

– Reduces network traffic between Amazon S3 and

Amazon EMR

– Speeds up your job

• Compress mappers and reducer output

Amazon EMR compresses inter-node traffic with LZO with

Hadoop 1, and Snappy with Hadoop 2

Page 29: Deep Dive: Amazon Elastic MapReduce

Choosing the right compression

• Time-sensitive, faster compressions are a better choice

• Large amount of data, use space efficient compressions

• Combined workload, use gzip

Algorithm Splittable? Compression ratioCompress +

Decompress speed

Gzip (DEFLATE) No High Medium

bzip2 Yes Very high Slow

LZO Yes Low Fast

Snappy No Low Very fast

Page 30: Deep Dive: Amazon Elastic MapReduce

Cost saving tips for Amazon EMR

• Use S3 as your persistent data store; query it using Presto, Hive, Spark, etc.

• Only pay for compute when you need it

• Use Amazon EC2 Spot Instances to save > 80%

• Use Amazon EC2 Reserved Instances for steady workloads

• Use Amazon CloudWatch alerts to notify you if a cluster is underutilized, then shut it down. E.g. 0 mappers running for > N hours

Page 31: Deep Dive: Amazon Elastic MapReduce

OPTIMIZING DATAWAREHOUSING COSTSWITH S3 AND EMRJuly 9, 2015

Jason Timmes, AVP of Software DevelopmentNate Sammons, Principal Architect

Page 32: Deep Dive: Amazon Elastic MapReduce

Document Title32

We make the

world’s capital markets

move faster more efficientmore transparent

Public company

in S&P 500

Develop and run markets globally in

all asset classes

We provide technology, trading,

intelligence and listing services

Intense Operational Focus

on Efficiency and Competitiveness

We provide the infrastructure, tools and strategic

insight to help our customers navigate the complexity of global capital markets and realize their capital ambitions.

Get to know usWe have uniquely transformed our business from predominately a U.S. equities exchange to a global provider of corporate, trading, technology and information solutions.

Page 33: Deep Dive: Amazon Elastic MapReduce

Document Title33 33

LEADING INDEX PROVIDER WITH

41,000+ INDEXES ACROSS ASSET CLASSES AND

GEOGRAPHIES

Over 10,000 Corporate Clients in

60 countries

Our technology

powers over

70

MARKETPLACES,

regulators, CSDs

and clearing-

houses

in over

50 COUNTRIES

100+ DATA

PRODUCT OFFERINGS

supporting 2.5+ millioninvestment professionals

and users

IN 98 COUNTRIES

26 Markets

3 Clearing Houses

5 Central Securities

Depositories

Lists more than 3,500

companies in 35 countries,

representing more than $8.8

trillion in total market value

IGNITE YOUR AMBITION

Page 34: Deep Dive: Amazon Elastic MapReduce

CURRENT STATE

• Replaced on-premises data warehouse with Amazon Redshift in 2014• On-premises warehouse held 1 year of data• Amazon Redshift migration yielded 57% cost savings over legacy

solution

• Business now wants more than 1 year of data in warehouse• Currently have Jan. 2014 to present in Amazon Redshift (18 dw1.8xl

nodes)

Optimizing data warehousing costs with S3 and EMR34

Page 35: Deep Dive: Amazon Elastic MapReduce

THE PROBLEM

• Historical year archives are queried much less frequently• Amazon Redshift is a very competitive price point, but for rarely needed data,

it’s still expensive

• Need a solution that stores historical data once (cheaply), and can use elastic compute directly on that data only as needed

Optimizing data warehousing costs with S3 and EMR35

Page 36: Deep Dive: Amazon Elastic MapReduce

REQUIREMENTS

• Decouple storage and compute resources• All data is encrypted in flight and at rest• SQL interface to all archived data• Parallel ingest to Amazon Redshift & new (historical years)

warehouse• Transparent usage-based billing to internal departments (clients)• Isolate workloads between clients (no resource contention)• Ideally deliver a platform that can enable different kinds of

compute paradigms over the same data (SQL, stream processing, etc.)

Optimizing data warehousing costs with S3 and EMR36

Page 37: Deep Dive: Amazon Elastic MapReduce

ARCHITECTURE DIAGRAM

Optimizing data warehousing costs with S3 and EMR37

Page 38: Deep Dive: Amazon Elastic MapReduce

SEPARATE STORAGE & COMPUTE RESOURCES

• Single source of truth; highly durable encrypted files in S3• Avoids read hotspots; very high concurrent read capability

• Multiple compute clusters isolate workloads for users• Each client runs their own EMR clusters in their own Amazon VPCs

• No contention on compute resources, or even IP address range

• Scale compute up/down as needed (and manages costs)

• Cost allocation using multiple AWS accounts (1 per internal budget)• S3/Hive Metastore costs are only shared infrastructure; extremely cheap

• Consolidated billing makes cost allocations easy and fully transparent

• Run multiple query layers & experiment with new projects• Spark, Drill, etc…

Optimizing data warehousing costs with S3 and EMR38

Page 39: Deep Dive: Amazon Elastic MapReduce

DATA SECURITY & ENCRYPTION

Current state:• Nasdaq KMS for encryption keys• Key hierarchy uses MySQL, rooted in an HSM cluster• Using the S3 “Encryption Materials Provider” interface

Future State:• Working with InfoSec to evaluate AWS KMS• Nasdaq KMS keys rooted in AWS KMS• Move MySQL DB to Amazon Aurora (encrypted with AWS KMS)• AWS KMS support in AWS services is growing

Optimizing data warehousing costs with S3 and EMR39

Page 40: Deep Dive: Amazon Elastic MapReduce

ENCRYPTED DATA ACCESS MONITORING/CONTROL

Optimizing data warehousing costs with S3 and EMR40

Page 41: Deep Dive: Amazon Elastic MapReduce

S3 FILE FORMAT: PARQUET

• Parquet file format: http://parquet.apache.org

• Self-describing columnar file format

• Supports nested structures (Dremel “record shredding” algo)

• Emerging standard data format for Hadoop• Supported by: Presto, Spark, Drill, Hive, Impala, etc.

Optimizing data warehousing costs with S3 and EMR41

Page 42: Deep Dive: Amazon Elastic MapReduce

PARQUET VS ORC• Evaluated Parquet and ORC (competing open columnar formats)

• ORC encrypted performance is currently a problem• 15x slower vs. unencrypted (94% slower)

• 8 CPUs on 2 nodes: ~900 MB/sec vs. ~60 MB/sec encrypted

• Encrypted Parquet is ~27% slower vs. unencrypted

• Parquet: ~100MB/sec from S3 per CPU core (encrypted)

• DATE field support in Parquet is not ready yet• DATE + Parquet not supported in Presto at this time

• Currently writing DATEs as INTs

– 2015-06-22 => 20150622

Optimizing data warehousing costs with S3 and EMR42

Page 43: Deep Dive: Amazon Elastic MapReduce

PARQUET DATA CONVERSION & MANAGEMENT

• Generic schema management system• Supports multiple versions of a named table schema

• Data conversion API• Simple Java API to read/write records in Parquet

• Encodes schema version, data transformations in metadata

• Automatically converts DATE to INT for now

– Migrate to new native DATE schema version when available!

• Parquet doesn’t really handle “ALTER TABLE…” migrations• Schema changes require building new Parquet files (via EMR Map-Reduce jobs)

• CSV to Parquet conversion tools• Given a JSON schema definition and CSV, produce Parquet files

• Writes files into a Hive-compliant directory structure

Optimizing data warehousing costs with S3 and EMR43

Page 44: Deep Dive: Amazon Elastic MapReduce

SQL QUERY LAYER: PRESTO

• A distributed SQL engine developed by Facebook• https://prestodb.io

• Used by Facebook, Netflix, AirBNB, others

• Reads schema definition from Hive, data files from S3 or HDFS

• Recent enterprise support and contributions from Teradata• http://www.teradata.com/Presto

• Data encryption support added by Nasdaq• Working on contributing this back to Presto through github

• Uses Hive Metastore to impose table definitions on S3 files

Optimizing data warehousing costs with S3 and EMR44

Page 45: Deep Dive: Amazon Elastic MapReduce

STREAM PROCESSING LAYER: SPARK

• One of our clients wants to test stream processing order data

• Spark supports Parquet

• Transparent file decryption prior to Spark data processing layer

• Needs to be tested, but should “just work”

• …we’re hoping to support Hadoop projects today and in the future

Optimizing data warehousing costs with S3 and EMR45

Page 46: Deep Dive: Amazon Elastic MapReduce

DATA INGEST

• Daily ingest into Amazon Redshift will also write data to S3

• Averaging ~6 B rows/day currently (1.9 TB/day uncompressed)

• Build Parquet files directly from CSV files loaded into Amazon Redshift

• Limit Amazon Redshift to a 1-year rolling window of data• Effectively caps Amazon Redshift costs

• Future enhancement: • Use Presto as a unified SQL access layer to Amazon Redshift and EMR/S3

Optimizing data warehousing costs with S3 and EMR46

Page 47: Deep Dive: Amazon Elastic MapReduce

New Features

Page 48: Deep Dive: Amazon Elastic MapReduce

Amazon EMR is now HIPAA-eligible

Page 49: Deep Dive: Amazon Elastic MapReduce

Planned for next month: EMR 4.0

• Directly configure Hadoop application settings

• Hadoop 2.6, Spark 1.4.0, Hive 1.0, Pig 0.14

• Standard ports and paths (Apache Bigtop standards)

• Quickly configure and launch clusters for standard use cases

Page 50: Deep Dive: Amazon Elastic MapReduce

Planned for next month: EMR 4.0

• Use the configuration parameter to directly change default settings

for Hadoop ecosystem applications instead of using the “configure-

hadoop” bootstrap action

Page 51: Deep Dive: Amazon Elastic MapReduce

NEW YORK

Rahul Pathak—Sr. Mgr. Amazon EMR (@rahulpathak)

Jason Timmes—AVP Software Development, Nasdaq