AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel
Guy Ernest
Solutions Architecture, Amazon Web Services
Data Warehouse on AWS
DATAWAREHOUSE
ERP
ANALYST CRM
DB
DATAWAREHOUSE
ERP
ANALYST CRM
DB
OLTP
OLTP
OLTP
OLAP
Transactional Processing Analytical Processing
Transactional context Global context
Latency Throughput
Indexed access Full table scans
Random IO Sequential IO
Disk seek times Disk transfer rate
OLTP
OLAP
DATAWAREHOUSE ANALYST
BUSINESS INTELLIGENCE REPORTS, DASHBOARD, …
PRODUCTION OFFLOAD DIFFERENT DATA STRUCTURE, USING ETLs, …
BIG ENTREPRISES
VERY EXPENSIVE (ROI)
DIFFICULT TO MAINTAIN
NOT SCALABLE
BIG ENTREPRISES SME
WAY TOO EXPENSIVE !
VERY EXPENSIVE (ROI)
DIFFICULT TO MAINTAIN
NOT SCALABLE
Jeff Bezos
Data Sources
Queries
Value
+ ELASTIC CAPACITY + NO CAPEX + PAY FOR WHAT YOU USE + DISPOSE ON DEMAND
= NO CONTRAINTS
COLLECT STORE ANALYZE SHARE
ACCELERATION
AMAZON REDSHIFT
AMAZON REDSHIFT
DWH that scales to petabyte and…
AMAZON REDSHIFT
… WAY LESS EXPENSIVE
… WAY FASTER
…WAY SIMPLER
AMAZON REDSHIFT RUNNING ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan
HS1.XL: 16 GB RAM, 2 Cores, 2 TB Compressed Data
Extra Large Node
(HS1.XL)
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
10 GigE (HPC)
Ingestion Backup
Restoration
JDBC/ODBC
…WAY SIMPLER
LOADING DATA
Parallel Loading Data sorted and distributed automatically Linear Growth
DATA SNAPSHOTS
Automatic and Incremental snapshots in Amazon S3 Configurable Retention Period Manual Snapshots “Streaming” Restore
REPLICATION IN CLUSTER +
AUTOMATIC SNAPSHOT IN AMAZON S3 +
MONITORING OF CLUSTER NODES
AUTOMATIC RESIZING
Read-only mode while resizing
New cluster is created in the
background
Parallel node-to-node data copy
Only charged for a single cluster
Automatic DNS based endpoint cut-over
Deletion of source cluster
CREATE A DATAWAREHOUSE IN MINUTES
…WAY FASTER
MEMORY CAPACITY AND CPU ERFORMANCE DOUBLE EVERY 2 YEARS
DISK PERFORMANCE
DOUBLE EVERY 10 YEARS
Progress is not evenly distributed
1980 Today
14,000,000$/TB 100MB 4MB/s
30$/TB 3TB
200MB/s 30,000 X
50 X
450,000 ÷
I/O IS THE MAIN FACTOR FOR PERFORMANCE
• COLUMNAR STORAGE
• COMPRESSION PER COLUMN
• ZONE MAPS
• HARDWARE OPTIMIZE
• LARGE DATA BLOCK SIZE
Id Age State 123 20 CA 345 25 WA 678 40 FL
TEST:
2 BILLION RECORDS
6 REPRESENTATIVE REQUETS
AMAZON REDSHIFT 2xHS1.8XL
Vs.
32 NODES, 4.2TB RAM, 1.6PB
12x - 150x FASTER
30 MINUTES
12 SECONDES
…WAY LESS EXPENSIVE
2x HS1.8XL 3.65$ / HOUR
32 000$ / YEAR
Instance HS1.XL per hour
Hourly Price per TB Yearly Price per TB
On-Demand 0.850 $ 0.425 $ 3 723 $
1 Year Reservation
0.500 $ 0.250 $ 2 190 $
3 Years Reservation
0.228 $ 0.114 $ 999 $
Intel Confidential
Intel Analytics on AWS
Assaf Araki
October, 2013
Intel Confidential
Agenda
• Advanced Analytics @ Intel
• Enterprise on the Cloud
• Use Case
Intel Confidential
Advanced Analytics
• Vision: Make analytics a competitive advantage for Intel
• Mission:
• Solve strategic high value business line problems
• Leverage analytics to grow Intel revenue
• About the team:
• ~100 employees - corporate ownership of advanced analytics
• Big data and Machine Learning are key focus areas
• Skills: Software Engineering / Decision Science / Business Acumen
• Value driven – ROI>$10M and/or key corporate problem as defined by VPs
• Part of the Israel Academy Computational research center
Intel AA Team
Intel Confidential
Big Data Analytics Platform
• Highly scalable, hybrid platform to support a range of business use cases
MPP High Speed Data Loader
Rich advanced analytics and real-
time, in-database data mining
capabilities
Heterogeneous data, batch oriented
on advanced analytics
Prediction Module
AA Overview
Intel Confidential
Why Cloud ?
• Known reasons
– Reduce cost
– Universal access
– Scale fast
• Additional reasons
– Flexible & Agile platform – no need to certify each tool by
engineering team
– Development accelerator – R&D team can start develop while
engineering teams implement the platform on premise
Enterprise On the Cloud
Intel Confidential
Use Case
• Characteristics:
– CPU behavior data
– Size: 30TB of data per month
– Type: Structured data
– Processing:
• Create aggregation facts and grant ad hoc analysis
• Create ML solutions
• Current Status:
– Data is sampled and processed on SMP RDBMS
– Takes almost 24 hours to process the entire data
• Problem Statement
– Limited ability analyze all data
Use Case
Intel Confidential
Platforms
• On premise
– Hbase – Hadoop platform exists
• No Hbase
– MPP DB – Exists with Machine Learning capabilities
• Lower cost platform evaluate and purchase
• Cloud
– HBase - EMR
– MPP DB - AWS Redshift
Enterprise On the Cloud
Go for POC on the Cloud
Intel Confidential
Evaluation Criteria
• Capabilities
– Create statistics calculations
• Cost of HW per TB
– Replication
– Compression
• Performance
– Load, transformation, querying
• Scalability
• Ability to execute
Enterprise On the Cloud
Intel Confidential
Preliminary Results • Dataset example
– 34GB compressed data divided to files
– ~1,500,000,000 records
– 24B compressed, 240B per record ( ~15 columns )
• Performance & Scalability - 8 x 1XL nodes
– Load time – for 32 files – 2 hours ( 4 files – 5 hours )
– Table size – 202GB (compression rate ~1.5:1)
– SQL aggregation statements
• 38K records – 6 minutes
• 14M records – 7 minutes
• 66M records – 11 minutes ( on 4 x 1XL – 22 minutes )
• 939M records – 34 minutes ( on 4 x 1XL – 77 minutes )
Use Case
Intel Confidential
Capabilities and Cost
• No current ability to write code (Java/C++/Python/R)
– Implement statistics and algorithm in SQL
• Compression is not strait forward
• Cost sensitive for actual compression
– 2.6 : 1 is break even
• 8XL vs. High Storage instance (16 cores 48TB)
• 3 years with 100% utilization
Use Case
Intel Confidential
Intel Confidential
Thank You!
USE CASE
AMAZON ELASTIC
MAPREDUCE
AMAZON
DYNAMODB
AMAZON EC2
AWS STORAGE GATEWAY
AMAZON S3
DATA CENTER
AMAZON RDS
AMAZON REDSHIFT
UPLOAD TO AMAZON S3
AWS IMPORT/EXPORT
AWS DIRECT CONNECT
DATA
INTEGRATION
INTEGRATION
SYSTEMS
2 million
15 million
MEMBRES REGISTRATION
2011 2012 2013
1,500,000+ NEW MEMBRES EACH MONTH
1,200,000,000+ SOCIAL CONNECTIONS IMPORTED
Data Analyst
Raw Data
Get Data
Join via Facebook
Add a Skill Page
Invite Friends
Web Servers Amazon S3 User Action Trace Events
EMR Hive Scripts Process Content
• Process log files with regular expressions to parse out the info we need.
• Processes cookies into useful searchable data such as Session, UserId, API Security token.
• Filters surplus info like internal varnish logging.
Amazon S3
Aggregated Data
Raw Events
Internal Web
Excel Tableau
Amazon Redshift
ELASTIC DATA WAREHOUSE
Monthly Reports on a new cluster
Redshift Reporting
and BI EMR
S3
DynamoDB Redshift
OLTP Web Apps
Reporting and BI
RDBMS Redshift
OLTP ERP
Reporting & BI
+
RDBMS Redshift
OLTP ERP
Reporting & BI
JDBC/ODBC
Amazon Redshift
DATAWAREHOUSE BY AWS
Pay per use, no CAPEX
Low cost for high performances
Open and integrate with existing BI tools
Simple to use and scalable
Speed and Agility
Frequent Experiments
Low Cost of Failure
More Innovation
Fewer Experiments
High Cost of Failures
Less Innovation
“On Premise”
תודה רבה