the little warehouse that couldn't or: how we learned to stop worrying and move to spark-(yandu...
TRANSCRIPT
The Little Warehouse That Couldn’t Or: How We Learned to
Stop Worrying and Move to Spark
1
Yandu Oppacher (@yandu)Data Infrastructure
2
Shopify Stores
ETL Warehouse Reporting
August 2013
TilllerRuby Vertica
3
Why we had to move
• Data volume
• Data/Query complexity
• Performance issues
4
Couple of false starts
5
Pig + Luigi
Pig + Oozie
Platfora
–platfora.com
“Without coding or ETL, data warehousing, BI tools, or breaking a
sweat.”
6
Enter Spark
• Fast
• Nice development model
• Python
7
88
The Good Book
GMVA Case Study
9
165,000+ACTIVE SHOPIFY MERCHANTS
$8 BILLION+CUMULATIVE GMV
Growing pains
• Joins
• Groupings
• General data skew
• Getting to know python’s performance quirks
10
Starscream
11
• specialized joins
• resolvers
• range
• cassandra
• overby
• contracts
• incrementalized fact builds
Our current stack
12
Kafka
OLTPHDFS
Cassandra
Spark
FrontroomBackroom
Redshift
Tableau
Thank you
13
Yandu Oppacher (@yandu)Data Infrastructure