aws webcast - amazon redshift best practices for data loading and query performance

23
Amazon Redshift Best Practices Part 1 April 2013 Vidhya Srinivasan & David Pearson

Upload: amazon-web-services

Post on 15-Jan-2015

7.864 views

Category:

Technology


2 download

DESCRIPTION

Loading very large data sets can take a long time and consume a lot of computing resources. How data is loaded can also affect query performance. We will discuss best practices for loading data efficiently using COPY commands, bulk inserts, and staging tables. We will also cover the key decisions that will heavily influence overall query performance. These design choices also have a significant effect on storage requirements, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process queries.

TRANSCRIPT

Page 1: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Amazon Redshift Best Practices –

Part 1

April 2013

Vidhya Srinivasan & David Pearson

Page 2: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Agenda

• Introduction

• Redshift cluster architecture

• Best Practices for Data loading

Key selection

Querying

WLM

• Q&A

Page 3: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Amazon DynamoDB Fast, Predictable, Highly-Scalable NoSQL Data Store

Amazon RDS Managed Relational Database Service for

MySQL, Oracle and SQL Server

Amazon ElastiCache In-Memory Caching Service

Amazon Redshift Fast, Powerful, Fully Managed, Petabyte-Scale

Data Warehouse Service

Compute Storage

AWS Global Infrastructure

Database

Application Services

Deployment & Administration

Networking

AWS Database

Services

Scalable High Performance

Application Storage in the Cloud

Page 4: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

objectives

design and build a petabyte-scale data warehouse service

Amazon Redshift

A Whole Lot Simpler

A Lot Cheaper

A Lot Faster

Page 5: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Redshift Dramatically Reduces I/O

• Direct-attached storage

• Large data block sizes

• Columnar storage

• Data compression

• Zone maps

Id Age State 123 20 CA 345 25 WA 678 40 FL

Row storage Column storage

Page 6: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Redshift Runs on Optimized Hardware

• Optimized for I/O intensive workloads

• HS1.8XL available on Amazon EC2

• Runs in HPC - fast network

• High disk density

HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate

HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage

Click to grow …to 1.6PB

Page 7: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

data v

olume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

data available for analysis

data generated

Gap cost + effort

Page 8: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Redshift is Priced to Analyze All Your Data

$0.85 per hour for on-demand (2TB)

$999 per TB per year (3-yr reservation)

Page 9: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Amazon Redshift Architecture

• Leader Node SQL endpoint Postgres based Stores metadata Communicates with client Compiles queries Coordinates query execution

• Compute Nodes Local, columnar storage Execute queries in parallel - slices Load, backup, restore via Amazon S3

• Everything is mirrored

10 GigE (HPC)

Ingestion Backup Restore

JDBC/ODBC

Page 10: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Ingestion – Best Practices

• Goal 1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead

• Best Practices Preferred method - COPY from S3 Loads data in sorted order through the compute nodes Single Copy command, Split data into multiple files Strongly recommend that you gzip large datasets

• If you must ingest through SQL Multi-row inserts Avoid large number of singleton insert/update/delete operations

• To copy from another table CREATE TABLE AS or INSERT INTO SELECT

insert into category_stage values

(default, default, default, default),

(20, default, 'Country', default),

(21, 'Concerts', 'Rock', default);

copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-

Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’;

Page 11: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Ingestion – Best Practices (Cont’d)

• Verifying load data files

For US east – S3 provides eventual consistency

• Verify files are in S3

• Listing Object Keys

• Query Redshift after

load. This query

returns entries for

loading the tables in

the TICKIT database…

select query, trim(filename), curtime, status

from stl_load_commits

where filename like '%tickit%' order by query;

query | btrim | curtime | status

-------+---------------------------+----------------------------+--------

22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1

22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1

22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1

22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1

22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1

22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1

22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1

22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 1

22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1

22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1

22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1

Page 12: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Ingestion – Best Practices (Cont’d)

• Redshift does not currently support an upsert statement. Use staging tables to perform an upsert by doing a join on the staging table with the target – Update then Insert

• Redshift does not currently enforce primary key constraint, if you COPY same data twice, it will be duplicated

• Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count set wlm_query_slot_count to 3

• Run the ANALYZE command whenever you’ve made a non-trivial number of changes to your data to ensure your table statistics are current

• Amazon Redshift system table that can be helpful in troubleshooting data load issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust MAX ERRORS as needed.

• Check character set : Support UTF8 up to 3 bytes long

• View the console for errors

Page 13: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Console

Page 14: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Choose a Sort key

• Goal Skip over data blocks to minimize IO

• Best Practice Sort based on range or equality predicate (WHERE clause)

If you access recent data frequently, sort based on TIMESTAMP

Page 15: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Choose a Distribution Key

• Goal Distribute data evenly across nodes Minimize data movement among nodes : Co-located Joins and Co-located Aggregates

• Best Practice Consider using Join key as distribution key (JOIN clause) If multiple joins, use the foreign key of the largest dimension as distribution key Consider using Group By column as distribution key (GROUP BY clause)

• Avoid Keys used as equality filter as your distribution key

• If de-normalized tables and no aggregates, do not specify a distribution key -Redshift will use round robin

Page 16: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Distribution Key – Verify Data Skew

Check the data distribution

select slice, col, num_values, minvalue, maxvalue

from svv_diskusage where name='users' and col =0

order by slice, col;

slice| col | num_values | minvalue | maxvalue

-----+-----+------------+----------+----------

0 | 0 | 12496 | 4 | 49987

1 | 0 | 12498 | 1 | 49988

2 | 0 | 12497 | 2 | 49989

3 | 0 | 12499 | 3 | 49990

Page 17: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Select sum( S.Price * S.Quantity )

FROM SALES S

JOIN CATEGORY C ON C.ProductId = S.ProductId

JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId

Where C.CategoryId = ‘Produce’ And F.State = ‘WA’

AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’

Example

Dist key (C) = ProductID

Sort key (S) = Date

-- Total Produce sold in Washington in January 2013

Dist key (F) = FranchiseID

Dist key (S) = ProductID

Page 18: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Query Performance – Best Practices

• Encode date and time using “TIMESTAMP” data type instead of “CHAR”

• Specify Constraints Redshift does not enforce constraints (primary key, foreign key, unique values) but

the optimizer uses it Loading and/or applications need to be aware

• Specify redundant predicate on the sort column

SELECT * FROM tab1, tab2

WHERE tab1.key = tab2.key

AND tab1.timestamp > '1/1/2013'

AND tab2.timestamp > '1/1/2013';

• WLM settings

Page 19: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Workload Manager

• Allows you to manage and adjust query concurrency

• WLM allows you to Increase query concurrency up to 15 Define user groups and query groups Segregate short and long running queries Help improve performance of individual queries

• Be aware: query workload is distributed to every compute node Increasing concurrency may not always help due to resource contention

• CPU, Memory and I/O

Total throughput may increase by letting one query complete first and allowing other queries to wait

Page 20: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Workload Manager

• Default : 1 queue with a concurrency of 5

• Define up to 8 queues with a total concurrency of 15

• Redshift has a super user queue internally

Page 21: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Summary

• Avoid large number of singleton DML statements

if possible

• Use COPY for uploading large datasets

• Choose Sort and Distribution keys with care

• Encode data and time with TIMESTAMP data type

• Experiment with WLM settings

Page 22: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

More Information

Best Practices for Designing Tables http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html

Best Practices for Data Loading http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

View the Redshift Developer Guide at:

http://aws.amazon.com/documentation/redshift/

Page 23: AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

Questions?