analyzing big data with aws - …d36cz9buwru1tt.cloudfront.net/aws-gov-summit-2011/govsummit... ·...

AWS Gov Cloud Summit II

Analyzing Big Data with AWS

Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota


What is Big Data?


Computer generated data – Application server logs (web sites, games)

– Sensor data (weather, water, smart grids)

– Images/videos (traffic, security cameras)


• Human generated data

– Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)

– Blogs/Reviews/Emails/Pictures

• Social graphs – Facebook, linked-in, contacts


Big Data is full of valuable, unanswered questions!


Why is Big Data Hard (and Getting Harder)?


• Data Volume – Unconstrained growth

– Current systems don’t scale




• Data Structure – Need to consolidate data from multiple data

sources in multiple formats across multiple businesses



• Changing Data Requirements – Faster response time of fresher data

– Sampling is not good enough

– Increasing complexity of analytics

– Users demand inexpensive experimentation


We need tools built specifically for Big Data!


Innovation #1:

Apache Hadoop The MapReduce computational paradigm

Open source, scalable, fault‐tolerant, distributed system

Hadoop lowers the cost of developing a distributed system for data processing


Innovation #2:

Amazon Elastic Compute Cloud (EC2)

“provides resizable compute capacity in the cloud.”

Amazon EC2 lowers the cost of operating a distributed system for data processing


Amazon Elastic MapReduce =

Amazon EC2 + Hadoop


Elastic MapReduce applications • Targeted advertising / Clickstream analysis

• Security: anti-virus, fraud detection, image recognition

• Pattern matching / Recommendations

• Data warehousing / BI

• Bio-informatics (Genome analysis)

• Financial simulation (Monte Carlo simulation)

• File processing (resize jpegs, video encoding)

• Web indexing


Clickstream Analysis –

• Big Box Retailer came to Razorfish

3.5 billion records

71 million unique cookies

1.7 million targeted ads required per day

Problem: Improve Return on Ad Spend (ROAS)


Targeted Ad

User recently

purchased a

sports movie and

is searching for

video games (1.7 Million per day)



• Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop



Processing time dropped from 2+ days to 8 hours (with lots more data)



Increased Return On Ad Spend by 500%



• World’s largest handmade marketplace

– 8.9 million items

– 1 billion page view per month

– $320MM 2010 GMS


• Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes

Production DB snapshots

Web event logs

ETL – Step 1 ETL – Step 2

Job

Job

Job


Recommendations The Taste Test http://www.etsy.com/tastetest

http://www.etsy.com/tastetest


Recommendations

etsy.com/gifts

Gift Ideas for Facebook Friends


Yelp’s Business Generates a Lot of Data

400 GB of logs per day ~12 Terabytes per month


They Frequently Analyze this Data to Power Key Features of their Site


Autocomplete Search


Recommendations


Automatic spelling corrections


Automatic spelling corrections

Let’s take a Look at how this works


Amazon S3

1) Load log file data for

six months of user search

history into Amazon S3

Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451


Amazon S3

Amazon EMR Log Files

2) Spin up a 200 node

cluster of virtual servers

in the cloud

Hadoop Cluster


Amazon S3

Amazon EMR

3) 200 nodes simultaneously

analyze this data looking for

common misspellings

… this takes a few hours

Hadoop Cluster


Amazon S3

Amazon EMR

4) New common

misspellings and

suggestions loaded back

into S3

Hadoop Cluster

Log Files


Amazon S3

Amazon EMR

5) When the job is done,

the cluster is shut down.

Yelp only pays for the time

they used.

Log Files


Each of their 80 developers can do this whenever they have a big data problem to analyze

Log file

data

250 clusters spun up and down every week


Data size

• Global reach

• Native app for almost every smartphone, SMS, web, mobile-web

• 10M+ users, 15M+ venues, ~1B check-ins

• Terabytes of log data


Stack

Ap

plic

atio

n S

tack

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs

Dat

a St

ack Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume


Computing venue-to-venue similarity

• Spin up 40 node cluster

• Submit Ruby streaming job

– Invert User x Venue matrix

– Grab Co-occurrences

– Compute similarity

• Spin down cluster

• Load data to app server


Who is checking in?

0

0.1

0.2

0.3

0.4

0.5

0.6

Female Male

Gender

0 20 40 60 80

Age


What are people doing?


Where are our users?


When do people go to a place?

Gorilla Coffee

Gray's Papaya

Amorino

Thursday Friday Saturday Sunday


Why are people checking in? • Explore their city, discover new places

• Find friends, meet up

• Save with local deals

• Get insider tips on venues

• Personal analytics, diary

• Follow brands and celebrities

• Earn points, badges, gamification of life

• The list grows…


Over 1000’s customers using EMR


RDBMS vs. MapReduce/Hadoop

• RDBMS Predefined schema

Strategic data placement for query tuning

Exploit indexes for fast retrieving

SQL only

Doesn’t scale linearly

• MapReduce/Hadoop No schema is required

Random data placement

Fast scan of the entire dataset

Uniform query performance

Linearly scales for reads and writes

Support many languages including SQL

Complementary technologies


AWS Data Warehousing Architecture


Elastic Data Warehouse • Customize cluster size to support varying resource needs (e.g. query

support during the day versus batch processing overnight)

• Reduce costs by increasing server utilization

• Improve performance during high usage periods

Expand to 25 instances

Data Warehouse

(Steady State)

Data Warehouse

(Batch Processing)

Shrink to 9 instances

Data Warehouse

(Steady State)


Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption

#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28

Job Flow

14 Hours

Duration:

Reducing Costs with Spot Instances

Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing

#2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75

Scenario #1

Duration:

Job Flow

7 Hours

Scenario #2

Time Savings: 50% Cost Savings: ~22%


Big Data Ecosystem And Tools

We have a rapidly growing ecosystem

• Business Intelligence – MicroStrategy, Pentaho

• Analytics – Datameer, Karmasphere, Quest

• Open source – Ganglia, Squirrel SQL


Thank You!! http://aws.amazon.com/elasticmapreduce/

analyzing big data with aws - …d36cz9buwru1tt.cloudfront.net/aws-gov-summit-2011/govsummit... ·...

Documents