bdt303 data science with elastic mapreduce - aws re: invent 2012

Post on 05-Dec-2014

2.059 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

In this talk, we dive into the Netflix Data Science & Engineering architecture. Not just the what, but also the why. Some key topics include the big data technologies we leverage (Cassandra, Hadoop, Pig + Python, and Hive), our use of Amazon S3 as our central data hub, our use of multiple persistent Amazon Elastic MapReduce (EMR) clusters, how we leverage the elasticity of AWS, our data science as a service approach, how we make our hybrid AWS / data center setup work well, and more.

TRANSCRIPT

What is Netflix’s data warehouse?

a) Cassandra

b) Teradata

c) Hive

d) S3

DSE Platform

DSE Platform

S3

Chukwa

Aegisthus

DSE Platform

S3

Chukwa

Aegisthus

Sting

DSE Platform

S3

Chukwa

Aegisthus

Sting

What is Netflix’s data warehouse?

a) Cassandra

b) Teradata

c) Hive

d) S3

DSE Platform

S3

Chukwa

Aegisthus

Sting

S3

S3

99.999999999%

S3

S3

High SLA

Query

HDFS ?

“Data Science as a Service”

• Execution Service / Genie

• Event Service

• Metadata Service

High SLA Cluster Job

High SLA

S3

Query Cluster Job

Query

High SLA

S3

Query Cluster Job

Query

High SLA Cluster Job

High SLA

S3

Query Cluster Job

Query

High SLA Cluster Job

High SLA

S3

Query Cluster Job

Query

Super SLA Cluster Job

Super SLA

High SLA Cluster Job

High SLA

S3

Query Cluster Job

Query

Super SLA Cluster Job

Questions?

http://jobs.netflix.com

kurtbrown@netflix.com

top related