building your data warehouse with redshift
TRANSCRIPT
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Amazon Redshift Deep Dive
Ran Tessler, AWS Solutions Architect
Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
– Optimized for data processing
– Dense Storage: HDD; scale from 2TB to 2PB
– Dense Compute: SSD; scale from 160GB to 326TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
• Massive Parallel Processing (MPP)
– Nodes are split into independent slices
– Each slice has a single virtual core, dedicated RAM
and storage
Amazon Redshift Architecture
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Compute Node
Slice 1 Slice 2
Virtual Core
7.5 GiB RAM
Local Disk
Virtual Core
7.5 GiB RAM
Local Disk
Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Direct-attached storage
• Large data block sizes
• Track of the minimum and
maximum value for each block
• Skip over blocks that don’t
contain the data needed for a
given query
• Minimize unnecessary I/O
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Use direct-attached storage
to maximize throughput
• Hardware optimized for high
performance data
processing
• Large block sizes to make the
most of each read
• Amazon Redshift manages
durability for you
Data Modeling
Data Distribution
• Data is allocated to slices based on
distribution style
– DISTSTYLE EVEN – Round Robin
– DISTSTYLE KEY – based on the distribution key
hash value
– DISTSTYLE ALL - Replicated to slice 0 on all
nodes
• Query performance considerations
– Uneven distribution harms query
– Data redistribution is expensive
Compute Node 1
Slice 1 Slice 2
Compute Node 2
Slice 3 Slice 4
5M
records
2M
records 1M
records
4M
records
Compute Node 1
Slice 1 Slice 2
Compute Node 2
Slice 3 Slice 4
Compute Node 3
Slice 5 Slice 6
Suboptimal Distribution
ORDERS ITEMS
Default (No Distribution Key, Round Robin Order)
Order 1 Order 2 Order 3Item 2.1 Item 1.1 Item 1.2
Item 2.2Item 3.1
Order 1: Dave Smith, Total $195
Item 1.1: Order 1, Kindle Fire HD 7”, $159
Item 1.2: Order 1, Kindle Fire Case, $36
Compute Node 1
Slice 1 Slice 2
Compute Node 2
Slice 3 Slice 4
Compute Node 3
Slice 5 Slice 6
Optimal Distribution
ORDERS ITEMSOrder 1: Dave Smith, Total $195
Item 1.1: Order 1, Kindle Fire HD 7”, $159
Item 1.2: Order 1, Kindle Fire Case, $36
Order 1 Order 2 Order 3
Item 2.1Item 1.1
Item 1.2 Item 2.2
Item 3.1
Customised (ORDERS.ORDER_ID DISTKEY, ITEMS.ORDER_ID DISTKEY)
Sorting Table Data
• Sort Keys ≠ Index
– Data is initially written by INSERT/COPY order
– VACUUM sorts the rows and reclaims stale
storage
Compound Sort Keys Illustrated
Records in Redshift are stored in blocks.
For this illustration, let’s assume that four records fill a block
Records with a given cust_id are all in one block
However, records with a given prod_id are spread across four blocks
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved Sort Keys Illustrated
Records with a given
cust_id are spread across
two blocks
Records with a given
prod_id are also spread
across two blocks
Data is sorted in equal
measures for both keys
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
Query Optimization
Query Performance
• Good choice of distribution and sort keys speed query
performance more than any other factor
• Redshift Uses a Cost Based Query Optimizer
– Good statistics are VITAL to ensure good performance
– Table constraints, while not enforced, are used to optimize queries
• Run ANALYZE command to update statistics:ANALYZE lineitem;
Query Analysis
• EXPLAIN command followed by the query:EXPLAIN select avg(datediff(day, listtime, saletime)) as avgwait from sales,
listing where sales.listid = listing.listid;
QUERY PLAN
XN Aggregate (cost=6350.30..6350.31 rows=1 width=16)
-> XN Hash Join DS_DIST_NONE (cost=47.08..6340.89 rows=3766 width=16)
Hash Cond: ("outer".listid = "inner".listid)
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=12)
-> XN Hash (cost=37.66..37.66 rows=3766 width=12)
-> XN Seq Scan on sales (cost=0.00..37.66 rows=3766 width=12)
• From the EXPLAIN plan you can tell:
– Query execution steps
– Which operation to be performed in each step
– Which table to be used in each step
– How much data needs to be processed in each step
Query Analysis
• Access the STL_EXPLAIN table for executed queries: select query,nodeid,parentid,substring(plannode from 1 for 30),
substring(info from 1 for 20) from stl_explain
where query=10 order by 1,2;
query | nodeid | parentid | substring | substring
------+--------+----------+---------------------+---------------------
10 | 1 | 0 | XN Aggregate (cost=6350.30... |
10 | 2 | 1 | -> XN Merge Join DS_DIST_NO | Merge Cond: ("outer"
10 | 3 | 2 | -> XN Seq Scan on lis |
10 | 4 | 2 | -> XN Seq Scan on sal |
• SVL_QUERY_SUMMARY and SVL_QUERY_REPORT for finer details
Query Analysis
• Explain plans and performance metrics are also available via
the console:
Query Analysis
• Explain Plan Visualization is now also available
Amazon Redshift
Spend time with your data, not your database….
Thank You