(ism303) migrating your enterprise data warehouse to amazon redshift

58
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. John Loughlin, AWS Solutions Architect Kishore Raja, Boingo Wireless, VP Strategy Ajit Zadgaonkar, Edmunds.com Executive Director, Engineering Operations October 2015 ISM303 Migrating Your Enterprise Data Warehouse to Amazon Redshift

Upload: amazon-web-services

Post on 12-Apr-2017

4.847 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

John Loughlin, AWS Solutions Architect

Kishore Raja, Boingo Wireless, VP Strategy

Ajit Zadgaonkar, Edmunds.com Executive Director, Engineering Operations

October 2015

ISM303

Migrating Your Enterprise Data

Warehouse to Amazon Redshift

Page 2: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Relational data warehouse

Massively parallel; Petabyte scale

Fully managed

HDD and SSD Platforms

$1,000/TB/Year; starts at $0.25/hour

Amazon

Redshift

a lot faster

a lot simpler

a lot cheaper

Page 3: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Amazon Redshift works with your analysis tools

JDBC/ODBC

Amazon Redshift

Page 4: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Data loading options

• Parallel upload to Amazon S3

• AWS Direct Connect

• AWS Import/Export

• Amazon Kinesis

• Systems integrators

Data Integration Systems Integrators

Page 5: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Amazon Redshift architecture

Leader Node

Simple SQL end point

Stores metadata

Optimizes query plan

Coordinates query execution

Compute Nodes

Local columnar storage

Parallel/distributed execution of all queries, loads,

backups, restores, resizes

Start at $0.25/hour, grow to 2 PB (compressed)

DC1: SSD; scale from 160 GB to 326 TB

DS2: HDD; scale from 2 TB to 2 PB

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

Page 6: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Amazon Redshift is priced to analyze all your data

DS2 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.250 $ 13,690

1 Year Reservation $ 0.161 $ 8,795

3 Year Reservation $ 0.100 $ 5,500

Pricing is simple

Number of nodes x price/hour

No charge for leader node

No upfront costs

Pay as you go

Page 7: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Common migration patterns

• Data from a variety of relational online transaction

processing (OLTP) systems structure lends itself to SQL

schemas

• Data from logs, devices, sensors,…data is less

structured

Page 8: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Structured data loading

• Data is often being loaded into another warehouse from

an existing ETL process

• Temptation is to “lift and shift” workload

• Resist temptation; instead consider:

• What do I really want to do?

• What do I need?

Page 9: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Ingesting less-structured data

• Some data does not lend itself to a relational schema

• Common pattern is to use Amazon EMR to:

• Impose structure

• Import into Amazon Redshift

• Other solutions are often home-grown scripting

applications

Page 10: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Loading data

• Load to an empty Amazon Redshift database

• Load changes captured in the source system to Amazon

Redshift

Page 11: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Truncate and load

This is by far the easiest option:

• Move the data to Amazon S3

• Multi-part upload

• Import/export service

• AWS Direct Connect

• COPY the data into Amazon Redshift, a table at a time

Page 12: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Load changes

• Identify changes in source systems

• Move data to Amazon S3

• Load changes:

• ‘Upsert process’

• Partner ETL tools

Page 13: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Partner ETL

• Amazon Redshift is supported by a variety of ETL

vendors

• Many simplify the process of data loading

• A variety of vendors offer a free trial of their products,

allowing you to evaluate and choose the one that suits

your needs

• Visit http://aws.amazon.com/redshift/partners

Page 14: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Upsert

• The goal is to insert new rows into and update changed

rows in Amazon Redshift

• Load data into a temporary staging table

• Join the staging table with production and delete the

common rows

• Copy the new data into the production table

• See Updating and Inserting New Data in the Amazon

Redshift Database Developer Guide

Page 15: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

COPY command

• Set COMPUPDATE to ON when running on an empty

table

• Use the COPY command

• Each slice can load one file at a time

• Partition input files so all slices can load in parallel

• Use a manifest file

Page 16: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Use multiple input files to maximize throughput

• Use the COPY command

• Each slice can load one file at

a time

• A single input file means only

one slice is ingesting data

• Instead of 100 MB/s, you’re

getting only 6.25 MB/s

Page 17: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Use multiple input files to maximize throughput

• Use the COPY command

• You need at least as many

input files as you have slices

• With 16 input files, all slices

are working so you maximize

throughput

• Get 100 MB/s per node; scale

linearly as you add nodes

Page 18: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Primary keys and manifest files

• Amazon Redshift doesn’t enforce primary key constraints:• If you load data multiple times, Amazon Redshift won’t complain

• If you declare primary keys in your data manipulation language (DML), the optimizer expects the data to be unique

• Use manifest files to control exactly what is loaded and how to respond if input files are missing:• Define a JSON manifest on Amazon S3

• Ensures that the cluster loads exactly what you want

Page 19: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Kishore Raja

VP, Strategy

Boingo Wireless

October 7, 2015 | Las Vegas, NV

TCO and ROI for Migrating from

Enterprise Database to Amazon Redshift

ISM303

Page 20: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

- Data Architecture

- Success Criteria

- Solutions Evaluated

- Additional Benefits

- Big data Agility

- Summary

Agenda

Page 21: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

90+ MAd engagements/year

100Operator partners

100+ Countries

6 Continents

Media Largest ad networkEngaging mobile audiences via Wi-Fi

Wi-Fi Largest operatorof airport wireless networks in the world

DASLargest operatorof independent indoor cellular networks

in the U.S.

BroadbandLargest providerof wireless high-speed Internet & TV

for the military

1 Million+Hotspots

Nearly

2000Commercial locations

19DAS Locations

Boingo: Reaching 1 Billion Consumers Annually

100+ Worldwide

Page 22: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Boingo on AWS

S3

Datawarehouse

Storage and

Content Delivery

Compute and

Networking Database

RDS

Admin and

Security Deployment App Services

Amazon EC2 AMI Elastic IP

VPC VPN Conn Gateway(s)

Route 53 Route

TableELB

Auto scaling ENI Lambda

EBS

Glacier

CloudFront

ElastiCache

MySQL DB

CloudWatch

Trusted Advisor

IAM

CloudTrail

Elastic Beanstalk

CloudFormation

OpsWorks

MFA Token

SQS

SQS

Oracle 11g(r2)

Page 23: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Data Architecture

SAP Data Services

Eng data

S3

Flat files

Database

Oracle RDS 11g(r2)

Front end Visualization

(Business Objects)

1. ETL 2. Data Storage 3. Reporting

Page 24: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Issues

• Data is growing which is making OLAP slow

• Inefficient Row based approach (mostly)

• Standard Oracle compression

• Mediocre IOPS

• Single DB server (no concurrency)

• Not enough memory (64GB)

• Administration

– Partitioning

– DB patches, updates, OS patches, updates

– Maintenance (backup, snapshots, replication)

– Recovery failure etc.

• Expensive (license, hardware, support etc.)

Page 25: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Success Criteria

What do we need?

• Memory (at least 256GB)

• Parallel Processing

• Plenty of IOPS

• Less Administration

• Low TCO

Growth rate:

• Currently at 15TB

• 2-3TB average growth per year

Nice to have

• Ingest any data type/store

• Realtime Streaming analysis

• Massive Parallel Processing

• Scale (up or down)

• Integrate any (& every) database

• Multiple levels of Security

• Smart Alerts and Monitoring

• Cost Effective

• Lesser (or zero) CAPEX

• Keep up with Industry

Security/Compliances

• Automated audit reporting

Page 26: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Solutions

Exadata

Page 27: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

AWS Data Solutions

• Oracle• SQL Server• PostgreSQL• MySQL• Aurora (MySQL

compatible)

• Small and large scale non-RDS

• Schemaless

• Using open source memcached/Redis

• Works on any database

• Datawarehouse• Petabyte scale• Massive Parallel

processing

RDS NoSQL In Memory

Da

ta W

are

ho

use

Redshift

Fully Managed, No CAPEX, Highly secure, Scalable

• DAT202: Understanding Database Options on AWS (Wednesday, Oct 7, 11:00 AM - 12:00 PM, San Polo 3501B)

• DAT302 - Relational Database Management Systems in the Cloud: Deploying SQL Server on AWS (Thursday, Oct 8, 5:30 PM - 6:30 PM, San Polo 3501B)

• DAT303: Oracle on AWS and Amazon RDS: Secure, Fast and Scalable (Friday, Oct 9 9:00-10AM, Delfino 4102)

Page 28: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Redshift TCO

EaaS

Eng. Data

S3

Flat files

Redshift

Datawarehouse

Front end Visualization

(Business Objects)

1. ETL 2. Data storage 3. BI reports

- Cluster of 50 DB servers

- 100 CPU cores

- 8TB SSD storage

- 750GB Memory

- Self organizing Cluster(s)

- 160GB increments

Annual Cost: $48,500Annual Cost: ~ $6,500

Annual Cost: ~ $55,000

Database installs, patches, OS installs,

patches, backup, replication, server

maintenance, scaling, security etc.

Managed Service

Page 29: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

TCO Comparison

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

Exadata SAP HANA Redshift

TCO Estimates

$400,000

$300,000

$55,000

Page 30: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Performance Results

7,200

2,700

15 15

Query Performance Data Load Performance

1 year of data

1 million records

Late

ncy in s

econds

RedshiftExisting System

7,20055,000

6500

Existing System Redshift

ET

L a

nnual cost

ETL

Page 31: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Migration and Ease Of Use

Database installs, patches, OS installs,

patches, backup, replication, server

maintenance, scaling, security etc.

Administration and Support

0 1 2 3 4

Other Systems

Redshift

Migration Time (in months)

2

4

Page 32: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

TCO

Estimated Cluster

- Cluster of 50 DB servers

- 100 CPU cores

- 8TB SSD storage

- 750GB Memory

- Self organizing Cluster(s)

- 160GB increments

Actual Cluster

$48,500

$12,000

Savings:

• 40% for upto 1 year term

• 60% for upto 3 year term

Options:

• No upfront 20% *

• Partial upfront 41% - 73%

• All upfront 42% - 76%

Cancellation:

• Full refund within 7 days *

• Prorated refund within 30 days *

• Prorated refund within 90 days

Talend ($6500)

* For 1 year term RI

Python Scripts ($0)

Elasticity Reserved Instances ETL

- ISM208 - The Science of Saving with AWS Reserved Instances (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Delfino 4105)

Page 33: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

3. Subnets

Additional Benefits

1. Access Control

• “Deny All” DB cluster

• Firewall rules

• IAM management

2. VPC

• BYOIP

• Ingress access

• Extend to corporate

data center

Cloud

• MFA

• Encryption

• Transit : SSL with TLS v1.2

• Storage : Encryption at rest

• Further isolation inside VPC

• IAM management

• SEC302 - IAM Best Practices to Live By (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Palazzo K)

• NET201 - Creating Your Virtual Data Center: VPC Fundamentals and Connectivity Options (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Titian 2201B)

• ARC403 - From One to Many: Evolving VPC Design (Wednesday, Oct 7, 2:45 PM - 3:45 PM, Palazzo N)

Page 34: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

AES 256-bit AES 256-bit AES 256-bit

AES 256-bit AES 256-bit AES 256-bit

AES 256-bit

AES 256-bit

AES 256-bit

AES 256-bit

Database Key

Cluster Master Key

Customer Master Key

HSM(Data center)

Advanced Encryption

Page 35: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Monitoring and Alerts

Intrusion Detection

• DDoS

• MiTM

• IP Spoofing

• Packet Sniffing

• Port Monitoring

Service

• DVO303 - Scaling Infrastructure Operations with AWS Service Catalog, AWS Config, and AWS CloudTrail (Friday, Oct 9, 9:00 AM - 10:00 AM, Lido 3001B)

• ARC302 - Running Lean Architectures: How to Optimize for Cost Efficiency (Friday, Oct 9, 9:00 AM - 10:00 AM, Palazzo K)

Page 36: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Big Data Agility

Production Datawarehouse

- Cluster of 50 DB servers

- 100 CPU cores

- 8TB SSD storage

- 750GB Memory

- Self organizing Cluster(s)

- 160GB increments

Backup

QA Cluster

Predictive Analysis/Adhoc Cluster

Performance Cluster

< 30mins

< 5/hour

< $5/hour

< $5/hour

DAT311 - Large-Scale Genomic Analysis with Amazon Redshift (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Lando 4306)

DAT308 - How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift (Thursday, Oct 8, 4:15 PM - 5:15 PM, Palazzo C)

BDT401 - Amazon Redshift Deep Dive: Tuning and Best Practices (Thursday, Oct 8, 2:45 PM - 3:45 PM, Marcello 4506)

Page 37: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Summary

• (Very) Cost Efficient

• (Highly) Secure (Enterprise grade Encryption)

• Managed service (Administration)

• Quick(er) Migration time

• 167+ Security and Compliancy features

• Proved to work (NASDAQ, NASA, Financial Times, Pinterest etc.)

• Faster with better performance

• Future proof (Ecosystem, security, new services etc.)

• 2+ years on AWS

• Ease of use

ROI

Page 38: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Related Sessions• DAT311 - Large-Scale Genomic Analysis with Amazon Redshift (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Lando 4306)

• DAT308 - How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift (Thursday, Oct 8, 4:15 PM - 5:15 PM,

Palazzo C)

• BDT401 - Amazon Redshift Deep Dive: Tuning and Best Practices (Thursday, Oct 8, 2:45 PM - 3:45 PM, Marcello 4506)

• DAT202: Understanding Database Options on AWS (Wednesday, Oct 7, 11:00 AM - 12:00 PM, San Polo 3501B)

• DAT302 - Relational Database Management Systems in the Cloud: Deploying SQL Server on AWS (Thursday, Oct 8, 5:30

PM - 6:30 PM, San Polo 3501B)

• DAT303: Oracle on AWS and Amazon RDS: Secure, Fast and Scalable (Friday, Oct 9 9:00-10AM, Delfino 4102)

• SEC302 - IAM Best Practices to Live By (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Palazzo K)

• NET201 - Creating Your Virtual Data Center: VPC Fundamentals and Connectivity Options (Wednesday, Oct 7, 1:30 PM -

2:30 PM, Titian 2201B)

• ARC403 - From One to Many: Evolving VPC Design (Wednesday, Oct 7, 2:45 PM - 3:45 PM, Palazzo N)

• DVO303 - Scaling Infrastructure Operations with AWS Service Catalog, AWS Config, and AWS CloudTrail (Friday, Oct 9,

9:00 AM - 10:00 AM, Lido 3001B)

• ISM208 - The Science of Saving with AWS Reserved Instances (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Delfino 4105)

• ARC302 - Running Lean Architectures: How to Optimize for Cost Efficiency (Friday, Oct 9, 9:00 AM - 10:00 AM, Palazzo K)

Red

sh

iftD

ata

bases

Infra

stru

ctu

reC

ost

Page 39: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ajit Zadgaonkar, Executive Director

October 2015

Migration to Amazon RedshiftEdmunds.com

Page 40: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

18 MILLION Monthly Visitors

Page 41: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

OF CAR BUYERS INFLUENCED BY

EDMUNDS.COM59%

*R. L. Polk & Co.

Page 42: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
Page 43: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Edmunds.com

• 18M unique visitors a month

• 200M+ page views a month

• Over 10k dealer partners

• 14k+ API users

• Over 6M automotive

inventory

• Over 1M content pages

• Lots and lots of data

• Continuously growing data

• 24x7 real-time BI

• DWH in Amazon Redshift

• 32-node cluster

Page 44: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

From unsustainable, painful operations to:

• Efficient, cost-effective cluster

• Squeak-free operations

• Happy customers

• Cost reduction (new system costs 1/5 of the old one)

Improvement

Page 45: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Challenges

• Painfully slow queries

• High system resource utilization

• Slow data loading

• Timeouts !

• …all in all, we were running into HUGE PROBLEMS

Page 46: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Lessons learned

• Know the system, the strengths, and the limitations

• Understand the end-to-end usage scenario

• Design the processes following Best Practices

• Invest in real-time monitoring

• Lift and shift may not be the best choice

• Let Enterprise Support and TAMs be your partners

• Monitor, monitor, and trend

Page 47: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

The System, the infrastructure

• Syntactical differences (i.e., PostgreSQL 7 vs.

PostgreSQL 8)

• Architectural choices (i.e., columnar database)

• Transaction processing

• Historical data analysis, business intelligence

• Node type, cluster size

• Shared infrastructure vs. dedicated throughput

• The larger the cluster, the bigger the resizing effort

Page 48: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Make the up-front investment: Design

• Select the right sort key

• Timestamp, range filtering on column name, joins

• Compound sort key, interleaved sort key

• Measure query performance, system load, and vacuum

• Ensuring tables have a sort key alone helped us gain

20% performance

• Over 50% of our tables did not have a sort key

• Ensuring that the right sort key is assigned is the path to

winning

Page 49: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Make the upfront investment: Use cases

• Select the right distribution style

• Locate data faster

• Uniform load

• Less data movement

• A good distribution style ensures a healthy system

• Many of our tables did not have the right distribution style

Page 50: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Queries

• Select * is #1 performance killer

• Use WHERE clause on the primary sort column

• Watch out for queries that create “temporary tables”

• Long-running queries might impact downstream services

• Define constraints

Page 51: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

VACUUM

• Run VACUUM frequently

• Run right after loading data

• Monitor vacuum time

Page 52: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Data loading

• Load data in sort key order

• Load using multiple files (1 MB to 1 GB)

• #files: Multiples of slices in cluster

• Use compression

• Use single COPY command

• S3 is your best friend

Page 53: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

A closer look

• Each node is split into slices

• One slice per core

• Each slice is allocated

memory, CPU, and disk

space

• Each slice processes a

piece of the workload in

parallel

Page 54: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Monitoring commit queue

Page 55: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Monitoring commit time

Page 56: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Monitoring

• Console/Amazon CloudWatch monitoring

• CPU, memory, processes

• Data distribution across slices

• Space used per table

• WLM query count, queue wait time, execution time

• Commit stats, top time-consuming queries

Page 57: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

In closing

• Amazon Redshift is a great data warehousing platform

• Parting advice: Make investment in Best Practices

• Check out Redshift Utils

Page 58: (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Thank you!