fuss free etl with airflow

27
Fuss Free ETL Vijay Bhat Tame your data pipelines with Airflow @vijaysbha t /in/ vijaysbhat

Upload: vijaysbhat

Post on 06-Jan-2017

373 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Fuss Free ETL with Airflow

Fuss Free ETLVijay Bhat

Tame your data pipelines with Airflow

@vijaysbhat/in/vijaysbhat

Page 2: Fuss Free ETL with Airflow

About Me13 Years In The Industry

Mobile Financial Services Smart Meter Analytics

Data Science Applications

Forecasting Fraud DetectionRecommendation Systems

Social Media

Growth Analytics

@vijaysbhat/in/vijaysbhat

Page 3: Fuss Free ETL with Airflow

80%

@vijaysbhat/in/vijaysbhat

Page 4: Fuss Free ETL with Airflow

What WhenHow

@vijaysbhat/in/vijaysbhat

Page 5: Fuss Free ETL with Airflow

What does your ETL process look like?

@vijaysbhat/in/vijaysbhat

Page 6: Fuss Free ETL with Airflow

What do we do to get here?

@vijaysbhat/in/vijaysbhat

Page 7: Fuss Free ETL with Airflow

❏ Automation❏ Scheduling❏ Version Control ❏ Redundancy ❏ Error Recovery ❏ Monitoring

@vijaysbhat/in/vijaysbhat

Page 8: Fuss Free ETL with Airflow

Evolution

@vijaysbhat/in/vijaysbhat

Page 9: Fuss Free ETL with Airflow

Introducing Airflow

@vijaysbhat/in/vijaysbhat

● Open source ETL workflow engine● Developed by Airbnb● Inspired by Facebook’s Dataswarm● Production ready● Pipelines written in Python

Page 10: Fuss Free ETL with Airflow

Defining Pipelines

@vijaysbhat/in/vijaysbhat

Page 11: Fuss Free ETL with Airflow

Pipeline Code Structure

@vijaysbhat/in/vijaysbhat

from datetime import datetime, timedelta

default_args = {

'owner': 'airflow',

'depends_on_past': False,

'start_date': datetime(2015, 6, 1),

'email': ['[email protected]'],

'email_on_failure': False,

'email_on_retry': False,

'retries': 1,

'retry_delay': timedelta(minutes=5),

# 'queue': 'bash_queue',

# 'pool': 'backfill',

# 'priority_weight': 10,

# 'end_date': datetime(2016, 1, 1),

}

Define default arguments

Page 12: Fuss Free ETL with Airflow

Pipeline Code Structure

@vijaysbhat/in/vijaysbhat

dag = DAG('tutorial', default_args=default_args,

schedule_interval=timedelta(1))

Instantiate DAG

Page 13: Fuss Free ETL with Airflow

Pipeline Code Structure

@vijaysbhat/in/vijaysbhat

t1 = BashOperator(

task_id='print_date',

bash_command='date',

dag=dag)

t2 = BashOperator(

task_id='sleep',

bash_command='sleep 5', retries=3,

dag=dag)

Define tasks

Page 14: Fuss Free ETL with Airflow

Pipeline Code Structure

@vijaysbhat/in/vijaysbhat

t2.set_upstream(t1)

# This means that t2 will depend on t1

# running successfully to run

# It is equivalent to

# t1.set_downstream(t2)

t3.set_upstream(t1)

# all of this is equivalent to

# dag.set_dependency('print_date', 'sleep')

# dag.set_dependency('print_date',

'templated')

Chain tasks

Page 15: Fuss Free ETL with Airflow

Then we get this pipeline

@vijaysbhat/in/vijaysbhat

t1

t3

t2

Page 16: Fuss Free ETL with Airflow

Code Merge

Deployment Process

@vijaysbhat/in/vijaysbhat

Develop Test PR Review

Prod Airflow

Scheduler

Page 17: Fuss Free ETL with Airflow

Job Runs

@vijaysbhat/in/vijaysbhat

Page 18: Fuss Free ETL with Airflow

Logs

@vijaysbhat/in/vijaysbhat

Page 19: Fuss Free ETL with Airflow

Performance - Gantt Chart

@vijaysbhat/in/vijaysbhat

Page 20: Fuss Free ETL with Airflow

Operators

@vijaysbhat/in/vijaysbhat

● PythonOperator● HiveOperator● ...

● S3ToHiveTransfer● HiveToDruidTransfer● ...

● HdfsSensor● HivePartitionSensor● ...

Action

Transfer

Sensor

Page 21: Fuss Free ETL with Airflow

Useful Configuration Options

@vijaysbhat/in/vijaysbhat

● depends_on_past○ wait until task run for previous day is

complete? ● wait_for_downstream

○ dependency on downstream tasks for previous day.

● sla○ send email alerts if SLA is missed.

Page 22: Fuss Free ETL with Airflow

CLI Commands

@vijaysbhat/in/vijaysbhat

● airflow [-h] ○ webserver ○ scheduler○ test○ run○ backfill○ ...

Page 23: Fuss Free ETL with Airflow

Example: Hacker News Sentiment Tracker

@vijaysbhat/in/vijaysbhat

Page 24: Fuss Free ETL with Airflow

Hacker News Example: Data Flow

@vijaysbhat/in/vijaysbhat

Pull Data Call API

Hacker News

IBM Watson

Reporting DB

CaravelDashboard

Page 25: Fuss Free ETL with Airflow

Hacker News Example: Demo

@vijaysbhat/in/vijaysbhat

Page 26: Fuss Free ETL with Airflow

pip install airflow!

@vijaysbhat/in/vijaysbhat

Page 27: Fuss Free ETL with Airflow

Thank You.@vijaysbhat

/in/[email protected]