cloudcamp chicago lightning talk "building warehousing systems on redshift" - tristan...
TRANSCRIPT
Redshift: Lessons Learned
Tristan Crockett – Software Engineer, Edgeflip
Basics
● Analytical database● PostgreSQL with column storage engine● Automatic Data compression● No traditional indexes; specify a sort key (how
are records in the table sorted?) and distribution key (which node will house a record?)
My Work with Redshift
● Data warehouse for Facebook user feeds and related app data
● Inputs– RDS (MySQL)
– DynamoDB
● Stats– ~2TB of compressed data
– Two main tables, ~5bil and ~25bil rows respectively
Advantages / Disadvantages
● Fast at copying data in from S3● Fast at computing aggregate/analytical
functions over an entire table● Decent at intra-db operations (create table as
select, insert into select)● Most everything else is slow● Without traditional indexes, table design isn't as
flexible
Lessons/Tips
● Optimize load size (1 MB to 1 GB per file)● Compress input● Upsert when needed, and always vacuum● Don't populate tables with 'CREATE TABLE AS'
if you like compression (which you do)● To avoid complicated joins, consider computing
single-table aggregates and join on the results
Upsert
Keep an Eye on Compression
Single-Table Aggregates