let's build a service oriented data pipeline!
TRANSCRIPT
Let’s Build a Service Oriented Data Pipeline!
June 2016
Software Developer | HootsuiteYasha Podeswa
Before: Oceanographer
Me!
Now: Software Developer at Hootsuite
Me!
Introduce a problem that requires a new data pipeline
Design it in a service oriented style
Build it on stage!
This Talk
Passive Aggressive Inc. just cancelled their subscription!
Desperate Dan in trouble!
The Problem
Want to Build a Tool Like This
Want to Build a Tool Like This
Want to Build a Tool Like This
What We’re Starting With
What We’re Starting With
Things Users Did
What We’re Starting With
Things Organizations
Did
What We’re Starting With
Crap
High Level Plan
JSON filesCalculate stats
about organizations
DB
High Level Plan
JSON filesCalculate stats
about organizations
DB
Extract
Transform
Load
High Level Plan
JSON filesCalculate stats
about organizations
DB
Extract
Transform
Load
JSON filesCalculate stats
about organizations
DB
Clean and organize data
Calculate stats per organization
JSON filesCalculate stats
about organizations
DB
Clean and organize data
Calculate stats per organization
Useful for lots of things!
JSON filesCalculate stats
about organizations
DB
Clean and organize data
Calculate stats per organization
Shouldn’t run until dependent job done
Need a “Service” Communication and Orchestration Layer!
Let’s build it!
First App Event Cleaning and Loading
Read logs from S3, clean and sort into different types of events, load into data warehouse
Vanilla Scala app
AWS Lambda
Second App Organization Stat CalculationRead cleaned/sorted events from data warehouse, calculate stats about organization, load stats to data warehouse
Vanilla Scala app
AWS Lambda
Third App Airflow
Hook up the Lambda apps in a dependency graph● Scheduling● Retries● Monitoring
Steal my code!
https://github.com/yashap/etl-load-eventshttps://github.com/yashap/etl-organization-statshttps://github.com/yashap/airflow
Questions?