aws summit tel aviv - enterprise track - data warehouse

AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel

Guy Ernest

Solutions Architecture, Amazon Web Services

Data Warehouse on AWS

DATAWAREHOUSE

ERP

ANALYST CRM

DB

DATAWAREHOUSE

ERP

ANALYST CRM

DB

OLTP

OLTP

OLTP

OLAP

Transactional Processing Analytical Processing

Transactional context Global context

Latency Throughput

Indexed access Full table scans

Random IO Sequential IO

Disk seek times Disk transfer rate

OLTP

OLAP

DATAWAREHOUSE ANALYST

BUSINESS INTELLIGENCE REPORTS, DASHBOARD, …

PRODUCTION OFFLOAD DIFFERENT DATA STRUCTURE, USING ETLs, …

BIG ENTREPRISES

VERY EXPENSIVE (ROI)

DIFFICULT TO MAINTAIN

NOT SCALABLE

BIG ENTREPRISES SME

WAY TOO EXPENSIVE !

VERY EXPENSIVE (ROI)

DIFFICULT TO MAINTAIN

NOT SCALABLE

Jeff Bezos

Data Sources

Queries

Value

+ ELASTIC CAPACITY + NO CAPEX + PAY FOR WHAT YOU USE + DISPOSE ON DEMAND

= NO CONTRAINTS

COLLECT STORE ANALYZE SHARE

ACCELERATION

AMAZON REDSHIFT

AMAZON REDSHIFT

DWH that scales to petabyte and…

AMAZON REDSHIFT

… WAY LESS EXPENSIVE

… WAY FASTER

…WAY SIMPLER

AMAZON REDSHIFT RUNNING ON OPTIMIZED HARDWARE

HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan

HS1.XL: 16 GB RAM, 2 Cores, 2 TB Compressed Data

Extra Large Node

(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)

10 GigE (HPC)

Ingestion Backup

Restoration

JDBC/ODBC

…WAY SIMPLER

LOADING DATA

Parallel Loading Data sorted and distributed automatically Linear Growth

DATA SNAPSHOTS

Automatic and Incremental snapshots in Amazon S3 Configurable Retention Period Manual Snapshots “Streaming” Restore

REPLICATION IN CLUSTER +

AUTOMATIC SNAPSHOT IN AMAZON S3 +

MONITORING OF CLUSTER NODES

AUTOMATIC RESIZING

Read-only mode while resizing

New cluster is created in the

background

Parallel node-to-node data copy

Only charged for a single cluster

Automatic DNS based endpoint cut-over

Deletion of source cluster

CREATE A DATAWAREHOUSE IN MINUTES

…WAY FASTER

MEMORY CAPACITY AND CPU ERFORMANCE DOUBLE EVERY 2 YEARS

DISK PERFORMANCE

DOUBLE EVERY 10 YEARS

Progress is not evenly distributed

1980 Today

14,000,000$/TB 100MB 4MB/s

30$/TB 3TB

200MB/s 30,000 X

50 X

450,000 ÷

I/O IS THE MAIN FACTOR FOR PERFORMANCE

• COLUMNAR STORAGE

• COMPRESSION PER COLUMN

• ZONE MAPS

• HARDWARE OPTIMIZE

• LARGE DATA BLOCK SIZE

Id Age State 123 20 CA 345 25 WA 678 40 FL

TEST:

2 BILLION RECORDS

6 REPRESENTATIVE REQUETS

AMAZON REDSHIFT 2xHS1.8XL

Vs.

32 NODES, 4.2TB RAM, 1.6PB

12x - 150x FASTER

30 MINUTES

12 SECONDES

…WAY LESS EXPENSIVE

2x HS1.8XL 3.65$ / HOUR

32 000$ / YEAR

Instance HS1.XL per hour

Hourly Price per TB Yearly Price per TB

On-Demand 0.850 $ 0.425 $ 3 723 $

1 Year Reservation

0.500 $ 0.250 $ 2 190 $

3 Years Reservation

0.228 $ 0.114 $ 999 $

Intel Confidential

Intel Analytics on AWS

Assaf Araki

October, 2013

Intel Confidential

Agenda

• Advanced Analytics @ Intel

• Enterprise on the Cloud

• Use Case

Intel Confidential

Advanced Analytics

• Vision: Make analytics a competitive advantage for Intel

• Mission:

• Solve strategic high value business line problems

• Leverage analytics to grow Intel revenue

• About the team:

• ~100 employees - corporate ownership of advanced analytics

• Big data and Machine Learning are key focus areas

• Skills: Software Engineering / Decision Science / Business Acumen

• Value driven – ROI>$10M and/or key corporate problem as defined by VPs

• Part of the Israel Academy Computational research center

Intel AA Team

Intel Confidential

Big Data Analytics Platform

• Highly scalable, hybrid platform to support a range of business use cases

MPP High Speed Data Loader

Rich advanced analytics and real-

time, in-database data mining

capabilities

Heterogeneous data, batch oriented

on advanced analytics

Prediction Module

AA Overview

Intel Confidential

Why Cloud ?

• Known reasons

– Reduce cost

– Universal access

– Scale fast

• Additional reasons

– Flexible & Agile platform – no need to certify each tool by

engineering team

– Development accelerator – R&D team can start develop while

engineering teams implement the platform on premise

Enterprise On the Cloud

Intel Confidential

Use Case

• Characteristics:

– CPU behavior data

– Size: 30TB of data per month

– Type: Structured data

– Processing:

• Create aggregation facts and grant ad hoc analysis

• Create ML solutions

• Current Status:

– Data is sampled and processed on SMP RDBMS

– Takes almost 24 hours to process the entire data

• Problem Statement

– Limited ability analyze all data

Use Case

Intel Confidential

Platforms

• On premise

– Hbase – Hadoop platform exists

• No Hbase

– MPP DB – Exists with Machine Learning capabilities

• Lower cost platform evaluate and purchase

• Cloud

– HBase - EMR

– MPP DB - AWS Redshift


Go for POC on the Cloud

Intel Confidential

Evaluation Criteria

• Capabilities

– Create statistics calculations

• Cost of HW per TB

– Replication

– Compression

• Performance

– Load, transformation, querying

• Scalability

• Ability to execute


Intel Confidential

Preliminary Results • Dataset example

– 34GB compressed data divided to files

– ~1,500,000,000 records

– 24B compressed, 240B per record ( ~15 columns )

• Performance & Scalability - 8 x 1XL nodes

– Load time – for 32 files – 2 hours ( 4 files – 5 hours )

– Table size – 202GB (compression rate ~1.5:1)

– SQL aggregation statements

• 38K records – 6 minutes

• 14M records – 7 minutes

• 66M records – 11 minutes ( on 4 x 1XL – 22 minutes )

• 939M records – 34 minutes ( on 4 x 1XL – 77 minutes )

Use Case

Intel Confidential

Capabilities and Cost

• No current ability to write code (Java/C++/Python/R)

– Implement statistics and algorithm in SQL

• Compression is not strait forward

• Cost sensitive for actual compression

– 2.6 : 1 is break even

• 8XL vs. High Storage instance (16 cores 48TB)

• 3 years with 100% utilization

Use Case

Intel Confidential

[email protected]

Intel Confidential

Thank You!

USE CASE

AMAZON ELASTIC

MAPREDUCE

AMAZON

DYNAMODB

AMAZON EC2

AWS STORAGE GATEWAY

AMAZON S3

DATA CENTER

AMAZON RDS

AMAZON REDSHIFT

UPLOAD TO AMAZON S3

AWS IMPORT/EXPORT

AWS DIRECT CONNECT

DATA

INTEGRATION

INTEGRATION

SYSTEMS

2 million

15 million

MEMBRES REGISTRATION

2011 2012 2013

1,500,000+ NEW MEMBRES EACH MONTH

1,200,000,000+ SOCIAL CONNECTIONS IMPORTED

Data Analyst

Raw Data

Get Data

Join via Facebook

Add a Skill Page

Invite Friends

Web Servers Amazon S3 User Action Trace Events

EMR Hive Scripts Process Content

• Process log files with regular expressions to parse out the info we need.

• Processes cookies into useful searchable data such as Session, UserId, API Security token.

• Filters surplus info like internal varnish logging.

Amazon S3

Aggregated Data

Raw Events

Internal Web

Excel Tableau

Amazon Redshift

ELASTIC DATA WAREHOUSE

Monthly Reports on a new cluster

Redshift Reporting

and BI EMR

S3

DynamoDB Redshift

OLTP Web Apps

Reporting and BI

RDBMS Redshift

OLTP ERP

Reporting & BI

+

RDBMS Redshift

OLTP ERP

Reporting & BI

JDBC/ODBC

Amazon Redshift

DATAWAREHOUSE BY AWS

Pay per use, no CAPEX

Low cost for high performances

Open and integrate with existing BI tools

Simple to use and scalable

Speed and Agility

Frequent Experiments

Low Cost of Failure

More Innovation

Fewer Experiments

High Cost of Failures

Less Innovation

“On Premise”

תודה רבה

aws summit tel aviv - enterprise track - data warehouse

Technology

intel confidential

data

hs1

roi

cloud

cluster 2

expensive

32 nodes