migration to redshift from sql server

22
SQL Server to Redshift

Upload: joeharris76

Post on 15-Jan-2015

4.878 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Migration to Redshift from SQL Server

SQL Server to Redshift

Page 2: Migration to Redshift from SQL Server

RealityMine provides digital behaviour analytics.

Our applications passively measure the activity of opt-in users on all digital platforms.

This could be focused on • how to direct marketing• how to direct product development• question individuals whom

undertake certain behavior patterns

Background

Page 3: Migration to Redshift from SQL Server

Starting State

• SQL Server DW on in-house server• SQL Server 2008 R2 Enterprise Edition• Single 4 core (8 thread) i7 w/ 16GB RAM• 2 960GB PCIe SSDs for DBs• 1 240GB PCIe SSD for TempDb

SQL Server to Redshift - @joeharris76

Page 4: Migration to Redshift from SQL Server

Data Environment

• ~20 billion rows in active use• Largest table is also the widest• Volume is doubling more than annually• Data is in many languages• Starts as JSON, ends as Star Schema DW

SQL Server to Redshift - @joeharris76

Page 5: Migration to Redshift from SQL Server

Pain Points

• Biggest cost is SQL Server license• Biggest bottleneck is single threaded perf.• Hand tuning needed to push CPU / disks• SSD reliability is not perfect• SSD performance degrades over time

SQL Server to Redshift - @joeharris76

Page 6: Migration to Redshift from SQL Server

Why Redshift

• Vertica wanted £45k per terabyte• 16 SQL Server Enterprise cores even more!• Teradata, Netezza, etc. don’t want <5TB sales• SAP HANA not viable for this volume on AWS• Infobright does not support incremental loads• Hadoop/Impala slow & requires lots of learning

SQL Server to Redshift - @joeharris76

Page 7: Migration to Redshift from SQL Server

Data Processing Approach

• No ETL tool truly supports Redshift–Requirement to load from S3 is a killer– Tried SSIS, Pentaho, Talend and others

• You’re stuck with ELT – Load data then transform as needed–Keep data raw as possible from source

SQL Server to Redshift - @joeharris76

Page 8: Migration to Redshift from SQL Server

War of EncodingsThe road to heaven goes

through ÜÑÎÇØDÈ hell

SQL Server to Redshift - @joeharris76

Page 9: Migration to Redshift from SQL Server

Redshift: UTF-8 Only

• Redshift has zero-tolerance for certain chars– NUL/0x00 => Treated as EOR, documented– DEL/0x7F => Treated as EOR, undocumented– 0xBFEFEF => UTF-8 spec "guaranteed non-char" – These must be removed before loading data

• Other control characters can be loaded by escaping– You cannot escape a single column, all or nothing

SQL Server to Redshift - @joeharris76

Page 10: Migration to Redshift from SQL Server

SQL Server: UTF-16LE Only

• NVARCHAR takes 2x as much space as a VARCHAR• Makes functions consistent across ASCII & Unicode– N/VARCHAR(32) = 32chars / Redshift = 32 bytes

• SQL Server tolerates anything character columns• Input and output is not sanitized against UTF-16 spec– Invalid or "guaranteed non-chars" are stored as is

SQL Server to Redshift - @joeharris76

Page 11: Migration to Redshift from SQL Server

SQL Extract: The Hard Way

• BCP is the “standard” way to extract data• Using BCP your process looks something like this:– Extract data as a huge UTF-16LE file using bcp– Convert to a new UTF-8 file using iconv– Remove or escape problem chars using sed – Compress the final file using gzip– All steps are heavily constrained by disk speed

SQL Server to Redshift - @joeharris76

Page 12: Migration to Redshift from SQL Server

SQL Extract: The Easy Way

SQLCMD one-liner for extracts:Set the cmd code page to UTF-8 chcp 65001 &Interactive SQL terminal sqlcmd –E -QPrevent summary in output “SET NOCOUNT ON;Select from the table / view SELECT * FROM Db.Schema.Table;”No column headers -h-1Remove special characters -k1Delimit output with 1 ASCII char -s”|”No padding in output -WOutput in Unicode -uPipe stdout to gzip | gzip > “C:\file.gz”

SQL Server to Redshift - @joeharris76

Page 13: Migration to Redshift from SQL Server

Data Encryption

• On SQL Server we use TDE• Redshift offers AES encrypted data on disk• Redshift can load client-side encrypted data• Client side encryption only applies while on S3• “Small performance penalty” for using AES

SQL Server to Redshift - @joeharris76

Page 14: Migration to Redshift from SQL Server

Security

• S3 Access => Create bucket(s) just for Redshift staging• Redshift admin => Use IAM, create automation user(s)• Redshift database => – Do not use admin it’s like SQL Server ‘sa’

• Database objects =>– Must actively GRANT access to each object– Use groups to make management easier

SQL Server to Redshift - @joeharris76

Page 15: Migration to Redshift from SQL Server

Sizing your cluster

• Redshift is over-provisioned on storage• Redshift is super efficient at compression–Compression not affected by the data model

• Redshift scale out is almost perfectly linear–2 nodes is twice as fast as 1 node

• You'll be sizing your cluster for speed!

SQL Server to Redshift - @joeharris76

Page 16: Migration to Redshift from SQL Server

Performance

• Redshift speed depends on node count– A single node is not particularly fast

• Loading speed appears to be linked to S3 speed– You must use multiple files for bulk loads

• Query speed appears to be CPU constrained– Vacuum runs 250 MB/s, queries <20 MB/s

• Data modeling matters for complex query speed – Use a star schema & well chosen distribution key

SQL Server to Redshift - @joeharris76

Page 17: Migration to Redshift from SQL Server

Data Modeling

2 main concepts to learn• Distribution key–Where data is placed, which node & slice–Needs to be common across most tables

• Sort key –How data is ordered on disk within the slice–Good sort keys simply expensive joins

SQL Server to Redshift - @joeharris76

Page 18: Migration to Redshift from SQL Server

Database Maintenance

• Data loaded to non-empty tables is not sorted• Data loaded to non-empty tables may kills their stats• ANALYZE rebuilds the stats without making changes• VACUUM re-sorts the physical data and rebuilds stats– Needed to get the best performance– Very similar to a REBUILD in SQL Server

SQL Server to Redshift - @joeharris76

Page 19: Migration to Redshift from SQL Server

Database Backups

• Redshift ‘backups’ are snapshots of the system• Taken very quickly, much slower to restore

• Redshift automatically takes intra-day snapshots• Manual snapshots can be run using AWS cmd line• Snapshot storage is free up to size of cluster storage• Snapshots must be restored to an identical cluster• Snapshots cannot be restored to a running cluster

SQL Server to Redshift - @joeharris76

Page 20: Migration to Redshift from SQL Server

Code Changes

Code changes required so far• ROW_NUMBER() missing in Redshift

• We gain LAG() and LEAD() which helps• But very difficult to persist an order value

• DATETIMEOFFSET (e.g. timezone) not avail.• DATETIMEs now split into 2 columns

• Work in progress…

SQL Server to Redshift - @joeharris76

Page 21: Migration to Redshift from SQL Server

That’s all folks!

SQL Server to Redshift - @joeharris76

Page 22: Migration to Redshift from SQL Server

Come Work With Me!

http://www.realitymine.com/careers/• Currently trying to fill the following roles:

• Business Intelligence Architect (Redshift!)• Business Intelligence Developer (Tableau!)• Test Engineer (Quality!)• Server Developer (C#!)• Mobile App Developer (Android! iOS!)• Project Manager

SQL Server to Redshift - @joeharris76