aws re:invent re:cap - 데이터 분석: amazon ec2 c4 instance + amazon ebs - 김일호

December 10, 2014 | Korea

김 일호, Solutions Architect

BDT201 - Big Data and HPC State of the Union

BDT202 - HPC Now Means 'High Personal Computing'

BDT203 - From Zero to NoSQL Hero: Amazon DynamoDB Tutorial

BDT204 - Rendering a Seamless Satellite Map of the World with AWS and NASA Data

BDT205 - Your First Big Data Application on AWS

BDT206 - See How Amazon Redshift is Powering Business Intelligence in the Enterprise

BDT207 - Use Streaming Analytics to Exploit Perishable Insights

BDT208 - Finding High Performance in the Cloud for HPC

BDT209 - Intel’s Healthcare Cloud Solution Using Wearables for Parkinson’s Disease Research

BDT302 - Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift

BDT303 - Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift

BDT305 - Lessons Learned and Best Practices for Running Hadoop on AWS

BDT306 - Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesis

BDT307 - Running NoSQL on Amazon EC2

BDT308 - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse

BDT308-JT - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse - Japanese Track

BDT309 - Delivering Results with Amazon Redshift, One Petabyte at a Time

BDT309-JT - Delivering Results with Amazon Redshift, One Petabyte at a Time - Japanese Track

BDT310 - Big Data Architectural Patterns and Best Practices on AWS

BDT311 - MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production

Workloads

BDT312 - Using the Cloud to Scale from a Database to a Data Platform

BDT401 - Big Data Orchestra - Harmony within Data Analysis Tools

BDT402 - Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm

BDT403 - Netflix's Next Generation Big Data Platform

Redshift EMR EC2

Process & Analyze

AWS Direct Connect

Amazon Kinesis

Glacier

AWS Import/Export

DynamoDB

Collect

Automate AWS Data Pipeline

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools

Amazon data pipeline

EMR-Kinesis Connector

Hive with

Amazon S3 Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Launch a 3-instance Hadoop 2.4 cluster with Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

YOUR-BUCKET-NAME

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 2

CHOOSE-A-REDSHIFT-PASSWORD

YOUR-IAM-ACCESS-KEY YOUR-IAM-SECRET-KEY

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS

YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME

Start Hive:

YOUR-IAM-ACCESS-KEY

YOUR-IAM-SECRET-KEY;

YOUR-AWS-REGION

STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'

TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

-- return the first row in the stream

-- return count all items in the Stream

-- return count of all rows with given host hive>

http://127.0.0.1:19026/cluster

http://127.0.0.1:19101

YOUR-S3-BUCKET/emroutput

-- set up Hive's "dynamic partioning"

-- splits output files when writing to Amazon S3

-- compress output files on Amazon S3 using Gzip

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

Hive with

Amazon S3

YOUR-S3-BUCKET

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL

8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

YOUR-S3-BUCKET

YOUR-IAM-ACCESS_KEY

YOUR-IAM-SECRET-KEY

-- show all requests from a given IP address

-- count all requests on a given day

-- show all requests referred from other sites

Hive with

parallel COPY from

Amazon S3

Bonus:

-- Create an external table on Amazon S3

-- to hold query results.

-- Partition (split files on Amazon S3) by iteration

YOUR-S3-BUCKET

-- set up a first iteration

-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

-- set up a second iteration over the data in the Kinesis Stream

-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

Hive with

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

YOUR-S3-BUCKET

YOUR-S3-BUCKET YOUR-PREFIX.gz .

YOUR-PREFIX.gz

DataXu

DataXu Records

tx_id: "AFTfN0uAWZ"

exchange: “APPNEXUS"

request_id:"bb656107-3bf7-47a7-8548-8229563e9dc9”

adslot: {slot_id: "2686449714718898993”, uuid: "9d2403f1-fc6c-4d38-b6b1-

839fe4b42455”, price_micro_cpm: 661385, currency: "USD”, seat_id: "12-914”,

campaign_id: "C0513n7”, creative_id: “R53a537”}

time_stamp: 1415393474434

serviced_by_host: "cr02.us-east-01”

Confirmation Record

[- 69.120.26.172 - - [08/Nov/2014:21:59:54 -0500] "GET

/rs?id=fc6f2106175a43df8ae4f3b7e6fa8c37&t=marketing&cbust=14155020001916

62 HTTP/1.1" 302 - "http://ads-

by.madadsmedia.com/tags/25628/10217/iframe/728x90.html" "Mozilla/5.0

(compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)" "wfivefivec=c876d00e-

1831-4eba-b78d-cd99188e951a" "OWW=-"

Fraud Record

Continuous

Processing

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Amazon Kinesis Event Replay Amazon S3

Producers Aggregator Continuous

Processing Storage Analytics

Redshift

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

https://github.com/awslabs/kinesis-log4j-appender

Amazon Kinesis storage is replicated across

Availability Zones

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

200000

400000

600000

800000

1000000

1200000

0 100 200 300 400 500 600 700 800 900 1000 1100

Shards

TCO for average 1M events/second:

with 50:1 packing and 10:1 compression: $6351/month

raw: $28610/month

Amazon Kinesis

14 17 18 21 23

Shard-i

2 3 5 8 10

Lock Seq

Shard-i

Host A

Host B

Shard ID Last Archived

Shard-i

18 X 2

Host A Host B

{Event 10, …}

Real Time

Bidding

Retargetin

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Kinesis Event Replay S3

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Amazon

Redshift

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Redshift

• Unordered processing

– Randomize partition key to distribute events over

many shards and use multiple workers

• Exact order processing

– Control the partition key to ensure events are

grouped onto the same shard and read by the

same worker.

• Need both? Get global sequence number Producer

Get Global

Sequence Unordered

Stream

Campaign Centric

Stream

Fraud Inspection

Stream

Get Event

Metadata

Id event Stream – partition key

1 confirmation Campaign-centric stream - UUID

2 fraud Unordered Stream

Fraud-inspection stream – sessionid

AWS SDK

Fluentd

Get* APIs

Apache

Amazon

Elastic

MapReduce

Sending Reading

Amazon EMR

Playback Amazon S3

Archiver

http://bit.ly/aws-bdt205

General Purpose: M1, M3 (, T2)

Compute Optimized: C1, CC2, C3, C4

Memory Optimized: M2, CR1, R3

Storage Optimized: HI1, HS1, I2

GPU: CG1, G2

Micro: T1, T2

2006 2007 2008 2009 2010 2011 2012-2013 December, 2014

m1.small

m1.xlarge

m1.large

m1.small

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c3.large

c3.xlarge

c3.2xlarge

c3.4xlarge

c3.8xlarge

cr1.8xlarge

hs1.8xlarge

m3.xlarge

m3.2xlarge

hi1.4xlarge

m1.medium

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c3.large

c3.xlarge

c3.2xlarge

c3.4xlarge

c3.8xlarge

hs1.8xlarge

m3.xlarge

m3.2xlarge

hi1.4xlarge

m1.medium

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

existing

g2.2xlarge

m3.medium

m3.large

i2.large

i2.xlarge

i2.4xlarge

i2.8xlarge

r3.large

r3.xlarge

r3.2xlarge

r3.4xlarge

r3.8xlarge

t2.micro

t2.small

t2.medium

c4.large

c4.xlarge

c4.2xlarge

c4.4xlarge

c4.8xlarge

introducing now

The next generation of Amazon EC2 Compute-optimized instances • Based on Intel Xeon E5-2666 v3 (Haswell) processors

• 2.9 GHz – peaking at 3.5 GHz with Turbo Boost

Ideal for running tier 1 applications, gaming and web servers, transcoding, and high performance computing workloads.

EBS-optimized by default… and at no additional cost!

Instance Name vCPU Count RAM Network Performance

c4.large 2 3.75 GiB Moderate

c4.xlarge 4 7.5 GiB Moderate

c4.2xlarge 8 15 GiB High

c4.4xlarge 16 30 GiB High

c4.8xlarge 36 60 GiB 10 Gbps

Preliminary specifications. May change prior to release

Increases to the performance and capacity of General Purpose

(SSD) and Provisioned IOPS (SSD) volumes.

EBS Name Capacity IOPS Throughput

Amazon EBS General Purpose (SSD) 16 TB (up from 1TB)

10000 IOPS (up from 3000 IOPS)

160 MBps *

Amazon EBS Provisioned IOPS (SSD) 16 TB (up from 1TB)

20000 IOPS (up from 4000 IOPS)

320 MBps *

* When attached to EBS Optimized instances

aws re:invent re:cap - 데이터 분석: amazon ec2 c4 instance + amazon ebs - 김일호

Technology

education on aws - nii.ac.jp · pdf filedynamodb amazon...

amazon machine learning 게임에서 활용해보기 ::...

amazon to amazon flips

aws re:invent re:cap - 종단간 보안을 위한...

update re:cap...amazon transcribe •...

찾아가는 aws 세미나(구로,가산,판교) - aws...

introducing amazon rekognition, amazon polly and amazon lex

aws re:invent re:cap - 자동화된 반응형 코드 구동:...

aws re:invent re:cap - 비용 최적화 - 모범사례와...

csi 3540 - cours, examens · • amazon web services [...

re:cap 2015 - aws · pdf filere:cap 2015...

amazon quantum ledger database (amazon qldb) …...amazon...

오토스케일링 제대로 활용하기 (김일호) - aws...

€¦ · animoto inc amazon amazon money claim online...

introduction to amazon directory services, amazon...

amazon s3 · amazon s3

aws re:invent 2015 re:cap

대용량 데이타 쉽고 빠르게 분석하기 ::...

aws ops service re:cap!!! - re:invent 2016...

amazon elastic file system · amazon elastic file system...