November 12, 2014 | Las Vegas, NV
Ben Butler, Sr. Solutions Marketing Mgr., Big Data and HPC
Big data on AWS
Big data customer success stories
HPC on AWS
HPC Customer Presentation: Honda
AWS resources to get started
Big data and HPC track review: where to go next
Big data on AWS
Big data customer success stories
HPC on AWS
HPC Customer Presentation: Honda
AWS resources to get started
Big data and HPC track review: where to go next
Big data on AWS
Big data customer success stories
HPC on AWS
HPC Customer Presentation: Honda
AWS resources to get started
Big data and HPC track review: where to go next
Big data on AWS
Big data customer success stories
HPC on AWS
HPC Customer Presentation: Honda
AWS resources to get started
Big data and HPC track review: where to go next
Big data on AWS
Big data customer success stories
HPC on AWS
HPC Customer Presentation: Honda
AWS resources to get started
Big data and HPC track review: where to go next
Big data on AWS
Big data customer success stories
HPC on AWS
HPC Customer Presentation: Honda
AWS resources to get started
Big data and HPC track review: where to go next
Generation
Collection and storage
Analytics and computation
Collaboration and sharing
Generation
Collection and storage
Analytics and computation
Collaboration and sharing
• IT/Application server logs
• Websites/Mobile apps/Ads
• Sensor data/IoT
• Social media, user content
GBTB
PB
ZB
EB
Lower cost,
higher throughput Generation
Collection and storage
Analytics and computation
Collaboration and sharing
Highly
constrained
Lower cost,
higher throughput Generation
Collection and storage
Analytics and computation
Collaboration and sharing
What is Big Data?
collect, store, organize, analyze and share it
Technologies and techniques for working
productively with data, at any scale.
Big Data
Accelerated
Generation
Collection and storage
Analytics and computation
Collaboration and sharing
AWS CloudBig Data
AnalyzeIngest
Amazon Kinesis
AWS Import/Export
AWS Direct Connect
Collect
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Store
Amazon
Elastic
MapReduce
Amazon
EC2
Kinesis
Amazon
S3
Share
Amazon
Redshift
Amazon
Redshift
AWS Data
Pipeline
Real-time processing
High throughput; elastic
Easy to use
Amazon EMR, S3, Redshift,
DynamoDB Integrations
Amazon
Kinesis
Store anything
Object storage
Scalable
99.999999999% durability
Amazon
S3
NoSQL Database
Seamless scalability
Zero admin
Single digit millisecond latency
Amazon
DynamoDB
Relational data warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon
Redshift
Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
Scale to thousands of nodes
Amazon
Elastic
MapReduce
Corporate Data
Center
Elastic Data
Center
Corporate Data
Center
Elastic Data
Center
Application data
and logs for
analysis pushed
to Amazon S3
Corporate Data
Center
Elastic Data
Center
Amazon Elastic
MapReduce name
node to control
analysis
Corporate Data
Center
Elastic Data
Center
Hadoop cluster
started by Elastic
MapReduce
N
Corporate Data
Center
Elastic Data
Center
N
Add hundreds to
thousands of
nodes
Corporate Data
Center
Elastic Data
Center
N
Disposed of when
job completes
Corporate Data
Center
Elastic Data
Center
Results of
analysis pulled
back into your
systems
Sets new large-scale
sort record with AWS
● Databricks, founders of
Apache Spark
● Why AWS?
● EC2—fast access to large
compute, SSD, 10Gbs
network
● Agility
Mobile / Cable Telecom
Oil and Gas Industrial
Manufacturing
Retail/Consumer Entertainment
Hospitality
Life Sciences Scientific
Exploration
Financial Services
Publishing Media Advertising
Online Media Social Network
Gaming
Sling uses AWS to Store and Analyze Terabytes of Data
By using AWS, we can
make decisions about new
features and offers very
quickly and very easily.
• Needed to leverage terabytes of usage
data to generate user insights and
innovate to capture market share
• Using AWS made it possible for Sling to
offer value-add product to its partners
• Stored terabytes of analytics data
• Enabled near real-time ad hoc analytics
• Capacity to scale database immediately
Dmitry Dimov
Director, Online Services,
Sling Media
”
“
By Amazon Redshift, we can process
petabytes of data from thousands of
marketing campaigns simultaneously while
reducing operating expenses by 75%
Zhong Hong, VP,
Infrastructure and Operations, VivaKi
”
“
NDN Uses AWS to Serve 600 Million Videos to Worldwide Users
Using AWS has enabled us to build
a solid platform that has scaled
quickly while becoming a source of
profit for our customers.
• NDN, a global media exchange for
publishers and content creators,
enables 146 million users a month
to see videos online
• Ingested and stored more than
100,000 video titles per month and
served 600 million content plays a
month
• Uses Amazon Kinesis to analyze
over a billion user generated events
and page loads per day
Eric Orme
NDN COO and CTO
”
“
Financial Times Uses AWS to Reduce Infrastructure Costs by 80%
When our analysts first started
to do queries on Amazon
Redshift, they thought it was
broken because it was working
so fast.
• Needed a way to increase speed,
performance and flexibility of data
analysis at a low cost
• Using AWS enabled FT to run
queries 98% faster than
previously—helping FT make
business decisions quickly
• Easier to track and analyze trends
• Reduced infrastructure costs by
80% over traditional data center
model
John O’Donovan
CTO, Financial Times
”
“
NTT DOCOMO Delivers Voice Recognition Services to Over 60
Million Customers by Using AWS
I cannot imagine NTT
DOCOMO without the
AWS Cloud
Minoru Etoh
Senior VP, NTT DOCOMO
”
“• NTT DOCOMO, Inc. is the predominant
mobile phone operator in Japan
• DOCOMO launched a popular voice
recognition service and experienced
large traffic spikes in its mobile network
that impacted performance
• DOCOMO decided to migrate their whole
environment to AWS last June
• The company built a voice recognition
architecture able to scale easily to handle
spikes in traffic and serve over 60 million
customers
Kellogg Uses AWS to Save $900K Over 5 Years Over Using On-
premises Infrastructure
Using AWS saves us $900,000 in
infrastructure costs alone, and lets
us run dozens of simulations a day
so we can reduce trade spend. It’s
a win-win.
• Needed a better way to track and model
promotional costs (“trade spend”) to
improve the bottom line—and needed to
be able to run more than 1 trade-spend
simulation/day
• By using SAP HANA on AWS, Kellogg
estimates it will save $900,000 over 5
years versus traditional on-premises
infrastructure alternatives
• As well, the company can run dozens of
trade spend simulations each day, and
decreases deployment time by 30x
Stover McIlwain
Senior Director
IT Infrastructure Engineering
”
“
Baylor College of Medicine Uses AWS to Accelerate Analysis and
Discovery
We are able to power ultra large-
scale clinical studies that require
computational infrastructure in a
secure and compliant environment
at a scale not previously possible.
• Stores more than 430 TB of
genomic result data
• Analyzes the genome sequences of
more than 14,000 individuals—5
times faster than with the previous
infrastructure
• Enables more than 200 scientists
worldwide to share tools and data
quickly
Omar Serang
DNAnexus Chief Cloud Officer,
DNAnexus
”
“
”
“We used Amazon EMR to make
running Hadoop clusters easy,
and now we can de-dupe 10+
billion documents.
Victor Moreira,
CTO, HG Data
HG Data uses AWS to process billions of documents for BI monthly
Internet
Hadoop
Document
Crawler
Java
Document
Crawler on
EC2
Packaging on
EC2
Amazon S3MongoDB
Cluster on
EC2
Hadoop
ETL and
Analytics
ElasticSearch
Cluster on
EC2
Hadoop
Analytics
Java/PythonAnalytics
MySQL on
RDSHG API
HG WebApp
Direct Clients Enterprise
Partners
End Users
Client
Take a typical big computation task…
Big Job
…that an average cluster is too small
(or simply takes too long to complete)…
Big Job
Cluster
…optimization of algorithms can give some leverage…
Big Job
Cluster
…and complete the task in hand…
Big Job
Cluster
Applying a large cluster…
Big Job
Cluster
…can sometimes be overkill and too expensive
Big Job
Cluster
AWS instance clusters can be balanced to
the job in hand…
Big Job
…not too large…
Small Job
…nor too small…
Bigger Job
…with multiple clusters running at the same time
Why AWS for HPC?
Low cost with flexible pricing Efficient clusters
Unlimited infrastructure
Faster time to results
Concurrent clusters on-demand
Increased collaboration
Popular HPC workloads on AWS
Transcoding and
Encoding
Monte Carlo
Simulations
Computational
Chemistry
Government and
Educational Research
Modeling and
SimulationGenome Processing
Scalability on AWS
Time:+00h
Scale Using Elastic Capacity
<10 cores
Scalability on AWS
Time: +24h
Scale Using Elastic Capacity
>1500
cores
Scalability on AWS
Time:+72h
Scale Using Elastic Capacity
<10 cores
Time: +120h
Scale Using Elastic Capacity
>600 cores
Scalability on AWS
Schrodinger and CycleComputing: computational
chemistry
Simulation by Mark
Thompson of the University
of Southern California to see
which of 205,000 organic
compounds could be used
for photovoltaic cells for solar
panel material.
Estimated computation time
264 years completed in 18
hours.
• 156,314 core cluster
across 8 regions
• 1.21 petaflops (Rpeak)
• $33,000 or 16¢ per
molecule
Cost Benefits of HPC in the Cloud
Pay As You Go Model
Use only what you need
Multiple pricing models
On-Premises
Capital Expense Model
High upfront capital cost
High cost of ongoing support
Reserved
Make a low, one-
time payment
and receive a
significant
discount on the
hourly charge
For committed
utilization
Free Tier
Get started on
AWS with free
usage and no
commitment
For POCs and
getting started
On-Demand
Pay for compute
capacity by the
hour with no
long-term
commitments
For spiky
workloads,
or to define
needs
Spot
Bid for unused
capacity,
charged at a
Spot price that
fluctuates based
on supply and
demand
For time-
insensitive or
transient
workloads
Dedicated
Launch
instances within
Amazon VPC
that run on
hardware
dedicated to a
single customer
For highly
sensitive or
compliance
related
workloads
Many pricing models to support different workloads
When to consider running HPC workloads on AWS
New ideas
New HPC project
Proof of concept
New application features
Training models
Benchmarking algorithms
Remove the queue
Hardware refresh cycle
Reduce costs
Collaboration of results
Increase innovation speed
Reduce time to results
Improvement
EBS
Submit jobs, orchestrate
HPC clusters over VPC
Run 1 Million drive head
designs = 70.75 core-years
90x throughput:
Ran in 8 hours, not 30 days
3 days from idea to running
70,908 cores, 729 TFLOPS
c3, r3 with Intel E5-2670 v2
Cost: $5,594
Spot Instances
New Drive
Head
Design
Workloads
World’s Largest F500 Cloud RunTransforming drive design to store the world’s data
Encrypt, route data to
AWS, return results
Cluster
70,908 Cores
with
Spot
Instances
New
Motorcycle
Products
ASIMOPowerProducts
Honda Jet
UNI-CUB
MC-β
Automobile
Honda Smart HomeSystem (HSHS)
Dreams are the source of our courage and energy
to meet every challenge without fear of failure.
FCX
(as of March 31, 2014)
(April 2013 to March 2014)
North America
South America
Europe
China
Asia/Oceania
We had individual HPC resources
at every RandD.
Japan
Motorcycle
Power
products
Fundamental research
Aircraft ENG
Automobile
Others
Europe JapanNorth America
Asia/Oceania
China
South America
Honda DC
We consolidated HPC resources.
Overall
OptimizationGlobalization
Use forcertain period
Parallel Transient clusters
Trial use
Need a lot of cores
High memory
Lead timeNo complicated procedures and screening
Don’t have to worry about the availability of resources.
AgilityUse the AWS API and start EC2 instances quickly
Stop it anytime you want with pay-as-you-go
ServiceChoose from several EC2 instance types (including the
new types)
EC2 Spot instances
Cluster manager
Data
Spot or
On demand
Computing nodesAttached
Instance Usage
Time
Insta
nce
nu
mb
er
Anyway, use cloud
Accumulate knowledge
andplan next step
Suggest improvement to
providers
Release new services
Anyway, use cloud
Accumulate knowledge
andplan next step
Suggest improvement to
providers
Release new
services
Solution
Architects
Professional
ServicesPremium
Support
AWS Partner
Network (APN)
AWS Architectures
Reference architecture
diagrams
aws.amazon.com/architecture
http://aws.amazon.com/marketplace
Big Data Case Studies
Learn from other AWS customers
aws.amazon.com/solutions/case-studies/big-data
AWS Partner Network – Big Data Competency
Partner with an AWS Big Data expert
AWS Marketplace
AWS Online Software Store
aws.amazon.com/marketplace
Shop the big data and HPC categories
AWS Public Data Sets
Free access to big data sets
aws.amazon.com/publicdatasets
AWS Big Data
and HPC Test
Drives
APN Partner-provided labs
aws.amazon.com/testdrive
Learn from AWS big data experts
Learn how to use Apache Storm
and Amazon Kinesis to process
streaming real-time data
blogs.aws.amazon.com/bigdata
aws.amazon.com/training
Big Data Technology Fundamentals
Online Training
Big Data on AWS
Instructor-Led Training
Visit the Big Data Kiosk at the AWS Booth in the Expo Room
http://bit.ly/awsevals
Learn more about Big Data
and HPC on AWS:
aws.amazon.com/big-data
aws.amazon.com/hpc
Thank you!