analyzing big data with aws - …d36cz9buwru1tt.cloudfront.net/aws-gov-summit-2011/govsummit... ·...
TRANSCRIPT
AWS Gov Cloud Summit II
Analyzing Big Data with AWS
Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota
AWS Gov Cloud Summit II
Computer generated data – Application server logs (web sites, games)
– Sensor data (weather, water, smart grids)
– Images/videos (traffic, security cameras)
AWS Gov Cloud Summit II
• Human generated data
– Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)
– Blogs/Reviews/Emails/Pictures
• Social graphs – Facebook, linked-in, contacts
AWS Gov Cloud Summit II
• Data Volume – Unconstrained growth
– Current systems don’t scale
Why is Big Data Hard (and Getting Harder)?
AWS Gov Cloud Summit II
Why is Big Data Hard (and Getting Harder)?
• Data Structure – Need to consolidate data from multiple data
sources in multiple formats across multiple businesses
AWS Gov Cloud Summit II
Why is Big Data Hard (and Getting Harder)?
• Changing Data Requirements – Faster response time of fresher data
– Sampling is not good enough
– Increasing complexity of analytics
– Users demand inexpensive experimentation
AWS Gov Cloud Summit II
Innovation #1:
Apache Hadoop The MapReduce computational paradigm
Open source, scalable, fault‐tolerant, distributed system
Hadoop lowers the cost of developing a distributed system for data processing
AWS Gov Cloud Summit II
Innovation #2:
Amazon Elastic Compute Cloud (EC2)
“provides resizable compute capacity in the cloud.”
Amazon EC2 lowers the cost of operating a distributed system for data processing
AWS Gov Cloud Summit II
Elastic MapReduce applications • Targeted advertising / Clickstream analysis
• Security: anti-virus, fraud detection, image recognition
• Pattern matching / Recommendations
• Data warehousing / BI
• Bio-informatics (Genome analysis)
• Financial simulation (Monte Carlo simulation)
• File processing (resize jpegs, video encoding)
• Web indexing
AWS Gov Cloud Summit II
Clickstream Analysis –
• Big Box Retailer came to Razorfish
3.5 billion records
71 million unique cookies
1.7 million targeted ads required per day
Problem: Improve Return on Ad Spend (ROAS)
AWS Gov Cloud Summit II
Targeted Ad
User recently
purchased a
sports movie and
is searching for
video games (1.7 Million per day)
Clickstream Analysis –
AWS Gov Cloud Summit II
• Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop
Clickstream Analysis –
AWS Gov Cloud Summit II
Processing time dropped from 2+ days to 8 hours (with lots more data)
Clickstream Analysis –
AWS Gov Cloud Summit II
• World’s largest handmade marketplace
– 8.9 million items
– 1 billion page view per month
– $320MM 2010 GMS
AWS Gov Cloud Summit II
• Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes
Production DB snapshots
Web event logs
ETL – Step 1 ETL – Step 2
Job
Job
Job
AWS Gov Cloud Summit II
Recommendations The Taste Test http://www.etsy.com/tastetest
AWS Gov Cloud Summit II
Yelp’s Business Generates a Lot of Data
400 GB of logs per day ~12 Terabytes per month
AWS Gov Cloud Summit II
Amazon S3
1) Load log file data for
six months of user search
history into Amazon S3
Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451
AWS Gov Cloud Summit II
Amazon S3
Amazon EMR Log Files
2) Spin up a 200 node
cluster of virtual servers
in the cloud
Hadoop Cluster
AWS Gov Cloud Summit II
Amazon S3
Amazon EMR
3) 200 nodes simultaneously
analyze this data looking for
common misspellings
… this takes a few hours
Hadoop Cluster
AWS Gov Cloud Summit II
Amazon S3
Amazon EMR
4) New common
misspellings and
suggestions loaded back
into S3
Hadoop Cluster
Log Files
AWS Gov Cloud Summit II
Amazon S3
Amazon EMR
5) When the job is done,
the cluster is shut down.
Yelp only pays for the time
they used.
Log Files
AWS Gov Cloud Summit II
Each of their 80 developers can do this whenever they have a big data problem to analyze
Log file
data
250 clusters spun up and down every week
AWS Gov Cloud Summit II
Data size
• Global reach
• Native app for almost every smartphone, SMS, web, mobile-web
• 10M+ users, 15M+ venues, ~1B check-ins
• Terabytes of log data
AWS Gov Cloud Summit II
Stack
Ap
plic
atio
n S
tack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat Files
Databases Logs
Dat
a St
ack Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump Flume
AWS Gov Cloud Summit II
Computing venue-to-venue similarity
• Spin up 40 node cluster
• Submit Ruby streaming job
– Invert User x Venue matrix
– Grab Co-occurrences
– Compute similarity
• Spin down cluster
• Load data to app server
AWS Gov Cloud Summit II
Who is checking in?
0
0.1
0.2
0.3
0.4
0.5
0.6
Female Male
Gender
0 20 40 60 80
Age
AWS Gov Cloud Summit II
When do people go to a place?
Gorilla Coffee
Gray's Papaya
Amorino
Thursday Friday Saturday Sunday
AWS Gov Cloud Summit II
Why are people checking in? • Explore their city, discover new places
• Find friends, meet up
• Save with local deals
• Get insider tips on venues
• Personal analytics, diary
• Follow brands and celebrities
• Earn points, badges, gamification of life
• The list grows…
AWS Gov Cloud Summit II
RDBMS vs. MapReduce/Hadoop
• RDBMS Predefined schema
Strategic data placement for query tuning
Exploit indexes for fast retrieving
SQL only
Doesn’t scale linearly
• MapReduce/Hadoop No schema is required
Random data placement
Fast scan of the entire dataset
Uniform query performance
Linearly scales for reads and writes
Support many languages including SQL
Complementary technologies
AWS Gov Cloud Summit II
Elastic Data Warehouse • Customize cluster size to support varying resource needs (e.g. query
support during the day versus batch processing overnight)
• Reduce costs by increasing server utilization
• Improve performance during high usage periods
Expand to 25 instances
Data Warehouse
(Steady State)
Data Warehouse
(Batch Processing)
Shrink to 9 instances
Data Warehouse
(Steady State)
AWS Gov Cloud Summit II
Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption
#1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Reducing Costs with Spot Instances
Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing
#2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50% Cost Savings: ~22%
AWS Gov Cloud Summit II
Big Data Ecosystem And Tools
We have a rapidly growing ecosystem
• Business Intelligence – MicroStrategy, Pentaho
• Analytics – Datameer, Karmasphere, Quest
• Open source – Ganglia, Squirrel SQL