getting started with amazon redshift
Post on 20-Jan-2017
293 Views
Preview:
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Meyers, Principal Solution Architect, AWS
July 7th, 2016
Getting Started with
Amazon Redshift
AnalyzeStore
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
AWS Data Pipeline
Amazon
CloudSearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon
Machine
Learning
Amazon
Elasticsearch
Service
Amazon
QuickSight
Amazon
Kinesis
Firehose
AWS Import/Export
Collect
Amazon Kinesis
Streams
AWS Direct
Connect
AWS Database
Migration Service
Amazon
CloudWatch
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, machine
learning, streaming
Analysis inline with
process flows
Pay as you go, grow as
you need
Managed availability and
disaster recovery
Enterprise Big data SaaS
Selected Amazon Redshift customers
Amazon Redshift architecture
Leader node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
Benefit #1: Amazon Redshift is fast
Dense Storage DS2 (HDD) instance type
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: Same as our prior generation DS1!
Performance improvement: 50%
Enhanced I/O and commit improvements (Jan ’16)
Reduce amount of time to commit data
Performance improvement: 35%
Benefit #2: Amazon Redshift is inexpensive
Ds2 (HDD)Price per hour for
DW1.XL single node
Effective annual
price per TB compressed
On-demand $ 0.850 $ 3,725
1 year reservation $ 0.500 $ 2,190
3 year reservation $ 0.228 $ 999
Dc1 (SSD)Price per hour for
DW2.L single node
Effective annual
price per TB compressed
On-demand $ 0.250 $ 13,690
1 year reservation $ 0.161 $ 8,795
3 year reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go
Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to Amazon S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/region level disasters
Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
All blocks on disks and in S3 encrypted
Block key, cluster key, master key (AES-256)
On-premises HSM & AWS CloudHSM support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion, Backup & Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Benefit #5: We innovate quickly
Well over 100 new features added since launch
Release every two weeks
Automatic patching
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress (8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS
Alerts, Cross Region Backup (11/13)
Distributed Tables, Single Node Cursor Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch size support for single node clusters, new
system tables with commit stats, row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE ciphers (4/22)
3 new regex features, Unload to single file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions, percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
pg_last_unload_count (9/15)
AES-128 S3 encryption (9/29)
UTF-16 support (9/29)
Benefit #6: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence
Getting started
Enter cluster details
Select node configuration
Select security settings and provision
Point-and-click resize
Resize
Resize while remaining online
Provision a new cluster in the
background
Copy data in parallel from node to
node
You are only charged for the source
cluster
Data modeling
3 Important Details…
Column Encoding
Applied on First Data Load
Automatically
Ensure correct encoding is
used
Periodically revisit
encodings in case of change
Data Distribution
Even, Key Based, or
Replicated distribution of
data is available
Focus on colocation of data
to limit network transfer
View network transfer
information in Explain Plan
Data Sorting
Compound (default) Sort
Keys for predictable query
patterns
Interleaved Sort Keys for
tables that can be queried in
any way
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
Unsorted
table MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
Sorted by date
Columnar Encoding
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
Hardware optimized for I/O intensive workloads,
4 GB/sec/node
Enhanced networking, over 1 million
packets/sec/node
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Even Data is distributed evenly amongst all
Compute Nodes on the basis of the
Key
Based
Data is distributed to Compute Nodes on
the basis of the provided distribution key
column from a given record
All Data is replicated onto each Compute
Node
Key Based
Large fact tables
Large dimension tables
All
Medium dimension tables (1K–2M)
Even
Tables with no joins or group bys
Small dimension tables (<1000)
When to use which type of distribution?
Choosing a good distribution key
• High cardinality• Number of unique values in the distribution key is significantly
larger than the number of slices in the cluster
• Low skew (uniform distribution)• Each unique value in the distribution key is associated with
the same number of records in the table
• High entropy• The unique values in the distribution key vary from each other
greatly
• Think GUIDs not sequential ID’s
• Frequently joined to other tables
SELECT COUNT(*) FROM LOGS WHERE MY_DATE = ‘09-JUNE-2013’
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
Unsorted table
MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
Sorted by MY_DATE
Types of Sort Keys
• Compound (default)
• Good for known query patterns
• Contains up to 400 columns
• Interleaved
• Good for unknown query patterns
• Can contain up to 8 columns
• Must be maintained during Vacuum phase
Getting data in…
Corporate data center
Amazon S3Amazon
Redshift
Flat files
Data loading options - Files
Corporate data center
ETL
Source DBs
Amazon
Redshift
Amazon
Redshift
Data loading options – ETL Tools
Corporate data center
Source DBs
Amazon
Redshift
Data loading options - Replication
AWS Database
Migration
Service
Amazon
Redshift
Amazon
Kinesis
Firehose
Data loading options – Stream Loading
Amazon S3
Getting data out…
JDBC/ODBC
Amazon Redshift
Amazon Redshift works with your existing
analysis tools
Monitor query performance
View explain plans
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Richard.Williams@MakerBot.com – Senior Engineer, Big Data
7th July 2016, London UK
Customer 360+
Dream Stack: Redshift, Matillion, Python & Tableau
3D Printing
Bot-Farm, Innovation Centre
www.thingiverse.com —> Community
www.makerbot.com/uses/for-educators
Enable – Prosthetics for Kids
MakerBot.com
• MakerBot, a subsidiary of Stratasys Ltd. (Nasdaq: SSYS), is
leading the next industrial revolution by setting the standards in
reliable and affordable desktop 3D-Printing
• Founded in 2009, MakerBot sells desktop 3D-Printers to innovative
and industry-leading customers worldwide, including engineers,
architects, designers, educators and consumers
• Has the largest installed base, and market share, of the desktop
3D-Printing industry
• Runs Thingiverse.com, the largest 3D-Printing Community
• 3D-Printing easy and accessible for everyone
Thingiverse.com
The 50 Most Influential
Gadgets of All Time
Richard L Williams~20 years in Data Warehousing in HK & USA
• discovered unknown author Ralph Kimball
• used Cognos (shipped with VB 4.0) & RedBrick
• eCommerce, Retail, Insurance, Pharma
• Email/Lifecycle Marketing, Campaign Mgt, Actuarial
• Using AWS: 1800-Flowers, BMS, Janssen (J&J), MakerBot
Ecosystem – where’s the data?
Largest table ~130m rows
But most in 100k – 1m range
Tables Slowest to Load:
- Salesforce
- 100-200 columns “wide”
SQL-Tool:-
- DBVisualizer
- SQLWorkBench/J
- Aginity (Windows)
MS SQL-Svr on EC2
MySQL as RDS
Cloud apps
Internal web-sites
Desktop s/w
Firmware (on printer) s/w
Dream Stack
Redshift Matillion Tableau
Python
Addresses all the issues in DW:-
- can even do unstructured data..!
Works with Redshift, and Fast:-
- Informatica, Snaplogic, Talend do
not work with MPP
- Hadoop/EMR not necessary
Power to the users
Intuitive, data-types, Boto3,
libraries, widely used
So what..?
Personally: Career Transformative
- accurately predict effort and time
Manager: very happy
- Quickly build
- Quickly iterate
- “No Limits” –> Roadmap to the Vision
Company: becoming strategic
- Competitive Advantage
AWS Marketplace
Ease of Purchase
Reserved Instances
Demo - Master Class
Deep Copy
Deep Insert “Waves”
S3 “Trigger” files
Grants on Schemas to Groups
Groups are “roles”, add Users
Revoke on Schema [Public]
Matillion working Schema
Delta’s
Lookup’s
…
..
.
Scripts
Python + Boto(3)
ETL Matillion
Future
I wish I could describe these in more
detail but they are the company’s
Competitive Advantage
richard.williams@makerbot.com
Thank you!
aws.amazon.com/big-data
Please remember to rate this
session under My Agenda on
awssummit.london
top related