(dat201) introduction to amazon redshift

50
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pavan Pothukuchi, Amazon Redshift Nam Nguyen, RetailMeNot October 2015 DAT201 Introduction to Amazon Redshift

Upload: amazon-web-services

Post on 12-Jan-2017

2.401 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: (DAT201) Introduction to Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Pavan Pothukuchi, Amazon Redshift

Nam Nguyen, RetailMeNot

October 2015

DAT201

Introduction to Amazon Redshift

Page 2: (DAT201) Introduction to Amazon Redshift

What to expect from the session

• Amazon Redshift – What and Why

• Benefits

• Use cases

• Amazon Redshift at RetailMeNot

• Q&A

Page 3: (DAT201) Introduction to Amazon Redshift

AnalyzeStore

Import/Export

Direct Connect

Collect

Amazon Kinesis

Amazon

Glacier

S3

DynamoDB

Amazon Aurora

AWS big data portfolio

Data Pipeline

CloudSearch

EMR EC2

Amazon

RedshiftMachine

Learning

Page 4: (DAT201) Introduction to Amazon Redshift

Relational data warehouse

Massively parallel; Petabyte scale

Fully managed

HDD and SSD Platforms

$1,000/TB/Year; starts at $0.25/hour

Amazon

Redshift

a lot faster

a lot simpler

a lot cheaper

Page 5: (DAT201) Introduction to Amazon Redshift

The legacy view of data warehousing ...

Global 2,000 companies

Sell to central IT

Multi-year commitment

Multi-year deployments

Multi-million dollar deals

Page 6: (DAT201) Introduction to Amazon Redshift

… Leads to dark data

This is a narrow view

Small companies also have big data

(mobile, social, gaming, adtech, IoT)

Long cycles, high costs, administrative

complexity all stifle innovation

0

200

400

600

800

1000

1200

Enterprise Data Data in Warehouse

Page 7: (DAT201) Introduction to Amazon Redshift

The Amazon Redshift view of data warehousing

10x cheaper

Easy to provision

Higher DBA productivity

10x faster

No programming

Easily leverage BI tools,

Hadoop, Machine Learning,

Streaming

Analysis in-line with process

flows

Pay as you go, grow as you

need

Managed availability & DR

Enterprise Big Data SaaS

Page 8: (DAT201) Introduction to Amazon Redshift

Selected Amazon Redshift customers

Page 9: (DAT201) Introduction to Amazon Redshift

Amazon Redshift architecture

Leader Node

Simple SQL end point

Stores metadata

Optimizes query plan

Coordinates query execution

Compute Nodes

Local columnar storage

Parallel/distributed execution of all queries, loads,

backups, restores, resizes

Start at just $0.25/hour, grow to 2 PB (compressed)

DC1: SSD; scale from 160 GB to 326 TB

DS2: HDD; scale from 2 TB to 2 PB

Ingestion/Backup

Backup

Restore

JDBC/ODBC

10 GigE

(HPC)

Page 10: (DAT201) Introduction to Amazon Redshift

Benefit #1: Amazon Redshift is fast

Dramatically less I/O

Column storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

Page 11: (DAT201) Introduction to Amazon Redshift

SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’

MIN: 01-JUNE-2013

MAX: 20-JUNE-2013

MIN: 08-JUNE-2013

MAX: 30-JUNE-2013

MIN: 12-JUNE-2013

MAX: 20-JUNE-2013

MIN: 02-JUNE-2013

MAX: 25-JUNE-2013

Unsorted Table

MIN: 01-JUNE-2013

MAX: 06-JUNE-2013

MIN: 07-JUNE-2013

MAX: 12-JUNE-2013

MIN: 13-JUNE-2013

MAX: 18-JUNE-2013

MIN: 19-JUNE-2013

MAX: 24-JUNE-2013

Sorted By Date

Benefit #1: Amazon Redshift is fastSort Keys and Zone Maps

Page 12: (DAT201) Introduction to Amazon Redshift

Benefit #1: Amazon Redshift is fast

Parallel and Distributed

Query

Load

Export

Backup

Restore

Resize

Page 13: (DAT201) Introduction to Amazon Redshift

ID Name

1 John Smith

2 Jane Jones

3 Peter Black

4 Pat Partridge

5 Sarah Cyan

6 Brian Snail

1 John Smith

4 Pat Partridge

2 Jane Jones

5 Sarah Cyan

3 Peter Black

6 Brian Snail

Benefit #1: Amazon Redshift is fast

Distribution Keys

Page 14: (DAT201) Introduction to Amazon Redshift

Benefit #1: Amazon Redshift is fast

H/W optimized for I/O intensive workloads, 4GB/sec/node

Enhanced networking, over 1M packets/sec/node

Choice of storage type, instance size

Regular cadence of auto-patched improvements

Example: Our new Dense Storage (HDD) instance type

Improved memory 2x, compute 2x, disk throughput 1.5x

Cost: same as our prior generation !

Page 15: (DAT201) Introduction to Amazon Redshift

Benefit #2: Amazon Redshift is inexpensive

DS2 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB compressed

On-Demand $ 0.250 $ 13,690

1 Year Reservation $ 0.161 $ 8,795

3 Year Reservation $ 0.100 $ 5,500

Pricing is simple

Number of nodes x price/hour

No charge for leader node

No up front costs

Pay as you go

Page 16: (DAT201) Introduction to Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Continuous/incremental backups

Multiple copies within cluster

Continuous and incremental backups

to S3

Continuous and incremental backups

across regions

Streaming restore

Amazon S3

Amazon S3

Region 1

Region 2

Page 17: (DAT201) Introduction to Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Amazon S3

Amazon S3

Region 1

Region 2

Fault tolerance

Disk failures

Node failures

Network failures

Availability Zone/Region level disasters

Page 18: (DAT201) Introduction to Amazon Redshift

Benefit #4: Security is built-in

• Load encrypted from S3

• SSL to secure data in transit

• ECDHE perfect forward security

• Amazon VPC for network isolation

• Encryption to secure data at rest

• All blocks on disks & in Amazon S3 encrypted

• Block key, Cluster key, Master key (AES-256)

• On-premises HSM & AWS CloudHSM support

• Audit logging and AWS CloudTrail integration

• SOC 1/2/3, PCI-DSS, FedRAMP, BAA

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

Internal

VPC

JDBC/ODBC

Page 19: (DAT201) Introduction to Amazon Redshift

Benefit #5: We innovate quickly

Well over 100 new features added since launch

Release every two weeks

Automatic patching

Service Launch (2/14)

PDX (4/2)

Temp Credentials (4/11)

DUB (4/25)

SOC1/2/3 (5/8)

Unload Encrypted Files

NRT (6/5)

JDBC Fetch Size (6/27)

Unload logs (7/5)

SHA1 Builtin (7/15)

4 byte UTF-8 (7/18)

Sharing snapshots (7/18)

Statement Timeout (7/22)

Timezone, Epoch, Autoformat (7/25)

WLM Timeout/Wildcards (8/1)

CRC32 Builtin, CSV, Restore Progress (8/9)

Resource Level IAM (8/9)

PCI (8/22)

UTF-8 Substitution (8/29)

JSON, Regex, Cursors (9/10)

Split_part, Audit tables (10/3)

SIN/SYD (10/8)

HSM Support (11/11)

Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit

Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS

Alerts, Cross Region Backup (11/13)

Distributed Tables, Single Node Cursor Support, Maximum Connections to 500

(12/13)

EIP Support for VPC Clusters (12/28)

New query monitoring system tables and diststyle all (1/13)

Redshift on DW2 (SSD) Nodes (1/23)

Compression for COPY from SSH, Fetch size support for single node clusters, new

system tables with commit stats, row_number(), strotol() and query

termination (2/13)

Resize progress indicator & Cluster Version (3/21)

Regex_Substr, COPY from JSON (3/25)

50 slots, COPY from EMR, ECDHE ciphers (4/22)

3 new regex features, Unload to single file, FedRAMP(5/6)

Rename Cluster (6/2)

Copy from multiple regions, percentile_cont, percentile_disc (6/30)

Free Trial (7/1)

pg_last_unload_count (9/15)

AES-128 S3 encryption (9/29)

UTF-16 support (9/29)

Page 20: (DAT201) Introduction to Amazon Redshift

Benefit #6: Amazon Redshift is powerful

• Approximate functions

• User defined functions

• Machine Learning

• Data Science

Amazon ML

Page 21: (DAT201) Introduction to Amazon Redshift

Benefit #7: Amazon Redshift has a large ecosystem

Data Integration Systems IntegratorsBusiness Intelligence

Page 22: (DAT201) Introduction to Amazon Redshift

Benefit #8: Service oriented architecture

DynamoDB

EMR

S3

EC2/SSH

RDS/Aurora

Amazon

Redshift

Amazon Kinesis

Machine

Learning

Data Pipeline

CloudSearch

Mobile

Analytics

Page 23: (DAT201) Introduction to Amazon Redshift

Use cases

Page 24: (DAT201) Introduction to Amazon Redshift

Analyzing Twitter Firehose

Page 25: (DAT201) Introduction to Amazon Redshift

Amazon

Redshift

Starts at

$0.25/hour

EC2

Starts at

$0.02/hour

S3

$0.030/GB-Mo

Amazon Glacier

$0.010/GB-Mo

Amazon Kinesis

$0.015/shard 1MB/s in; 2MB/out

$0.028/million puts

Analyzing Twitter Firehose

Page 26: (DAT201) Introduction to Amazon Redshift

500MM tweets/day = ~ 5,800 tweets/sec

2k/tweet is ~12MB/sec (~1TB/day)

$0.015/hour per shard, $0.028/million PUTS

Amazon Kinesis cost is $0.765/hour

Amazon Redshift cost is $0.850/hour (for a 2TB node)

S3 cost is $1.28/hour (no compression)

Total: $2.895/hour

Data warehouses

can be

inexpensive

and

powerful

Page 27: (DAT201) Introduction to Amazon Redshift

Use only the services you need

Scale only the services you need

Pay for what you use

~40% discount with 1 year commitment

~70% discounts with 3 year commitment

Data warehouses

can be

inexpensive

and

powerful

Page 28: (DAT201) Introduction to Amazon Redshift

Amazon.com – Weblog analysis

Web log analysis for Amazon.com

1PB+ workload, 2TB/day, growing 67% YoY

Largest table: 400 TB

Want to understand customer behavior

Solution

Legacy DW—query across 1 week/hr.

Hadoop—query across 1 month/hr.

Page 29: (DAT201) Introduction to Amazon Redshift

Query 15 months of data (1PB) in 14 minutes

Load 5B rows in 10 minutes

21B rows joined with 10B rows – 3 days (Hive) to 2 hours

Load pipeline: 90 hours (Oracle) to 8 hours

64 clusters

800 total nodes

13PB provisioned storage

2 DBAs

Data warehouses

can be

fast

and

simple

Page 30: (DAT201) Introduction to Amazon Redshift

Petabytes of data generated

by many cell phone towers

Hard to scale, expensive

Needed a secure scalable

system that can work with on

premises

NTT Docomo – Mobile usage analysis

Data

Source

ET

Direct

Connect

Client

Forwarder

LoaderState

Management

SandboxRedshift

S3

Page 31: (DAT201) Introduction to Amazon Redshift

High speed redundant direct connect lines

Load billions of rows in minutes

All data in private VPC

All data encrypted with private on-premises hardware keys

Encryption of data, transport, backups, partial spills

Audit of all SQL actions

Audit of all configuration changes

The cloud

can be made

more secure than

on premises

Page 32: (DAT201) Introduction to Amazon Redshift

Sushiro – Real-time streaming from IoT & analysis

Page 33: (DAT201) Introduction to Amazon Redshift

Sushiro – Real-time streaming & analysisReal-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift

380 stores stream live data from

Sushi plates

Inventory information combined

with consumption information

near real-time

Forecast demand by store,

minimize food waste, and

improve efficiencies

Amazon

Page 34: (DAT201) Introduction to Amazon Redshift

Big data does not mean batch

Can be streamed in

Can be processed in near real time

Can be used to respond quickly to requests

You can mix and match

On premises and cloud

Custom development and managed services

Infrastructure with managed scaling, security

Data warehouses

can support

real-time data

Page 35: (DAT201) Introduction to Amazon Redshift

In sum…

Amazon Redshift: Spend time with your data, not your database

Page 36: (DAT201) Introduction to Amazon Redshift
Page 37: (DAT201) Introduction to Amazon Redshift

Europe: 67.3M

Greater China: 27.5M

Middle East & Africa: 81.7M

Asia-Pacific: 81.7M

Latin America: 43.4M

Page 38: (DAT201) Introduction to Amazon Redshift

Our Data

Page 39: (DAT201) Introduction to Amazon Redshift

Our data

100s of TBs in Data Warehouses

2012 2013 2014 2015

>100% Year over Year Data Growth

Page 40: (DAT201) Introduction to Amazon Redshift

The legacy

Vertica Reporting

Content Presentation

Source DBs

3rd Party Data

Log Data

A B

Testing

Page 41: (DAT201) Introduction to Amazon Redshift

Pain points

Fire Fights

Query Traffic Jams

Processing Windows

Scaling

Page 42: (DAT201) Introduction to Amazon Redshift

Adopting cloud strategies

Amazon Redshift Instances

Reporting

Content Presentation

A B

Testing

Source DBs

3rd Party Data

Log Data

Page 43: (DAT201) Introduction to Amazon Redshift

On-demand breakdown

Only when needed

Ephemeral Processing

Up during business hours

Always Up

Page 44: (DAT201) Introduction to Amazon Redshift

Benefits to the data team

Processing Windows

Fire Fights

Scaling Number of

Clusters

Scaling the Size of

Clusters

Page 45: (DAT201) Introduction to Amazon Redshift

DOH!

Reserved Instances

Automated vs. Manual Backups

Automated Cluster Shut Down

Sort/Distribution Keys

For Joins

Page 46: (DAT201) Introduction to Amazon Redshift

Benefits to the business

50% Reduced time on administration

$0 Licensing50% cost reduction for instances

100% Growth of Internal Customers

Page 47: (DAT201) Introduction to Amazon Redshift

Q&A

Page 48: (DAT201) Introduction to Amazon Redshift

Thank you!

Page 49: (DAT201) Introduction to Amazon Redshift

Remember to complete

your evaluations!

Page 50: (DAT201) Introduction to Amazon Redshift

Related SessionsHear from other customers discussing their Amazon Redshift use cases:

• DAT308—How Yahoo! Analyzes Billions of Events with Amazon Redshift (Yahoo)

• ISM303—Migrating Your Enterprise Data Warehouse to Amazon Redshift (Boingo Wireless

and Edmunds)

• ARC303—Pure Play Video OTT: A Microservices Architecture in the Cloud (Verizon)

• ARC305—Self-Service Cloud Services: How J&J Is Managing AWS at Scale for Enterprise

Workloads

• BDT306—The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with

AWS

• DAT311—Large-Scale Genomic Analysis with Amazon Redshift (Human Longevity)

• BDT314—Running a Big Data and Analytics Application on Amazon EMR and Amazon

Redshift with a Focus on Security (Nasdaq)

• BDT316—Offloading ETL to Amazon Elastic MapReduce (Amgen)

• BDT401—Amazon Redshift Deep Dive (TripAdvisor)