luigi presentation at oscon 2013

56
Erik Bernhardsson [email protected] Batch data processing in Python

Upload: erik-bernhardsson

Post on 15-Jan-2015

5.870 views

Category:

Technology


4 download

DESCRIPTION

From OSCON 2013

TRANSCRIPT

Page 1: Luigi Presentation at OSCON 2013

Erik [email protected]

Batch data processing in Python

Page 2: Luigi Presentation at OSCON 2013

Focusing mostly on music discovery and large scale machine learningPreviously managed the “Analytics team” in Stockholm

I’m at Spotify in NYC

Btw I’m Erik Bernhardsson

Page 3: Luigi Presentation at OSCON 2013

Background

Billions of log messages (several TBs) every dayUsage and backend stats, debug informationWhat we want to do

AB-testingMusic recommendationsMonthly/daily/hourly reportingBusiness metric dashboards

We experiment a lot – need quick development cycles

We crunch a lot of data

Why did we build Luigi?

Page 4: Luigi Presentation at OSCON 2013

Our second cluster (in 2009):

We like Hadoop

Page 5: Luigi Presentation at OSCON 2013

Long story short :)

Our fifth cluster

Page 6: Luigi Presentation at OSCON 2013

Running one job is easy

Lots of long-running processes with dependenciesNeed monitoringHandle failuresGo from experimentation to production easily

But what about running 1000s of job every day?

Page 7: Luigi Presentation at OSCON 2013

But also non-Hadoop stuffMost things are Python Map/Reduce jobsAlso Pig, HiveSCP files from one host to anotherTrain a machine learning modelPut data in Cassandra

Page 8: Luigi Presentation at OSCON 2013

In the pre-Luigi world

How not to do workflows

Page 9: Luigi Presentation at OSCON 2013

“Streams” is a list of (username, track, artist ,timestamp) tuples

Example: Artist Toplist

Streams Artist Aggregation

Top 10 Database

Page 10: Luigi Presentation at OSCON 2013

Pre-Luigi example of artist toplists

Don’t do this at home

Page 11: Luigi Presentation at OSCON 2013

OK, so chain the tasks

Page 12: Luigi Presentation at OSCON 2013

Cron nicer, yay!

Page 13: Luigi Presentation at OSCON 2013

That’s OK, but don’t leave broken data somewhere(btw, Luigi gives you atomic file operations locally and in HDFS)

Errors will occur

Page 14: Luigi Presentation at OSCON 2013

The second step fails, you fix it, then you want to resume

Don’t run things twice

Page 15: Luigi Presentation at OSCON 2013

To use data flows as command line tools

Parametrize tasks

Page 16: Luigi Presentation at OSCON 2013

You want to run the dataflow for a set of similar inputs

Put tasks in loops

Page 17: Luigi Presentation at OSCON 2013

Plumbing sucks

Page 18: Luigi Presentation at OSCON 2013

Graph algorithms rock!

Plumbing sucks...

Page 19: Luigi Presentation at OSCON 2013

Who’s the world’s second most famous plumber?

Hint: he wears green

Page 20: Luigi Presentation at OSCON 2013

A Python framework for data flow definition and execution

Introducing Luigi

Page 21: Luigi Presentation at OSCON 2013

On steroids and PCP... with a toolbox of mainly Hadoop related stuff

Simple dependency definitionsEmphasis on Hadoop/HDFS integrationAtomic file operationsData flow visualizationCommand line integration

Main features

Luigi is “kind of like Makefile” in Python

Page 22: Luigi Presentation at OSCON 2013

Luigi Task

Page 23: Luigi Presentation at OSCON 2013

Luigi - Aggregate Artists

Page 24: Luigi Presentation at OSCON 2013

Luigi - Aggregate ArtistsRun on the command line:$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74375] Running AggregateArtists()INFO: [pid 74375] Done AggregateArtists()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time

Page 25: Luigi Presentation at OSCON 2013

Top 10 artists - Wrapped arbitrary Python code

Completing the top list

Page 26: Luigi Presentation at OSCON 2013

Basic functionality for exporting to Postgres. Cassandra support is in the works

Database support

Page 27: Luigi Presentation at OSCON 2013

Running it all...DEBUG: Checking if ArtistToplistToDatabase() is completeINFO: Scheduled ArtistToplistToDatabase()DEBUG: Checking if Top10Artists() is completeINFO: Scheduled Top10Artists()DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 3INFO: [pid 74811] Running AggregateArtists()INFO: [pid 74811] Done AggregateArtists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 74811] Running Top10Artists()INFO: [pid 74811] Done Top10Artists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74811] Running ArtistToplistToDatabase()INFO: Done writing, importing at 2013-03-13 15:41:09.407138INFO: [pid 74811] Done ArtistToplistToDatabase()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time

Page 28: Luigi Presentation at OSCON 2013

Imagine how cool this would be with real data...

The results

Page 29: Luigi Presentation at OSCON 2013
Page 30: Luigi Presentation at OSCON 2013

Tasks have implicit __init__

Task Parameters

Generates command line interface with typing and documentation

Class variables with some magic

$ python dataflow.py AggregateArtists --date 2013-03-05

Page 31: Luigi Presentation at OSCON 2013

Combined usage example

Task Parameters

Page 32: Luigi Presentation at OSCON 2013

Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-filesRunning Hive and (soon) Pig queriesInserting data sets into Postgres

Luigi comes with a toolbox of abstract Tasks for...

... how to run anything, really

Task templates and targets

Writing new ones are as easy as defining an interface and implementing run()

Page 33: Luigi Presentation at OSCON 2013

Built-in Hadoop Streaming Python framework

Hadoop MapReduce

Tiny interface – just implement mapper and reducerFetches error logs from Hadoop cluster and displays them to the userClass instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joinsEasy to send along Python modules that might not be installed on the clusterSupport for counters, secondary sort, combiners, distributed cache, etc.Runs on CPython so you can use your favorite libs (numpy, pandas etc.)

Features

Page 34: Luigi Presentation at OSCON 2013

Built-in Hadoop Streaming Python framework

Hadoop MapReduce

Page 35: Luigi Presentation at OSCON 2013

More features

Page 36: Luigi Presentation at OSCON 2013

Luigi’s “visualiser”

Page 37: Luigi Presentation at OSCON 2013

Dive into any task

Page 38: Luigi Presentation at OSCON 2013

Basic multi-processing

Multiple workers$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W08

Page 39: Luigi Presentation at OSCON 2013

Great for automated execution

Error notifications

Page 40: Luigi Presentation at OSCON 2013

Prevents two identical tasks from running simultaneously

Process Synchronization

Luigi worker 1 Luigi worker 2

A

B C

A C

F

Luigi central planner

Page 41: Luigi Presentation at OSCON 2013

... what happens

Process Synchronization

Luigi worker 1 Luigi worker 2

A

B C

A

C

F

Page 42: Luigi Presentation at OSCON 2013

... what happens

Process Synchronization

Luigi worker 1 Luigi worker 2

A

B C

A

C

F

Page 43: Luigi Presentation at OSCON 2013

... what happens

Process Synchronization

Luigi worker 1 Luigi worker 2

A

B C

A

C

F

Page 44: Luigi Presentation at OSCON 2013

Large data flows(Screenshot from web interface)

Page 45: Luigi Presentation at OSCON 2013

Things Luigi is not

Page 46: Luigi Presentation at OSCON 2013

Yes, you can run Python Hadoop jobs in Luigi. But the main focus is workflow management.

Luigi is not trying to replace mrjob

Page 47: Luigi Presentation at OSCON 2013

You still need to figure out how each task runs

Luigi does not give you scalability

Page 48: Luigi Presentation at OSCON 2013

Mapreduce/Pig/Hive/etc are wonderful tools for doing this and Luigi is more than happy to delegate it to them.

Luigi does not help you transform the data

Page 49: Luigi Presentation at OSCON 2013

Although Oozie is kind of annoying

...but it’s sort of like Oozie

Oozie Luigi

Only Hadoop Yes!

Horrible XML Yes!

Easy Yes!

Fun & powerful Yes!

Page 50: Luigi Presentation at OSCON 2013

“Oozie example”<workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'>

<start to='getDirInfo' /> <!-- STEP ONE --> <action name='getDirInfo'> <!--writes 2 properties: dir.num-files: returns -1 if dir doesn't exist, otherwise returns # of files in dir dir.age: returns -1 if dir doesn't exist, otherwise returns age of dir in days --> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.GetDirInfo</main-class> <arg>${inputDir}</arg> <capture-output /> </java> <ok to="makeIngestDecision" /> <error to="fail" /> </action>

<!-- STEP TWO --> <decision name="makeIngestDecision"> <switch> <!-- empty or doesn't exist --> <case to="end"> ${wf:actionData('getDirInfo')['dir.num-files'] lt 0 || (wf:actionData('getDirInfo')['dir.age'] lt 1 and wf:actionData('getDirInfo')['dir.num-files'] lt 24)} </case> <!-- # of files >= 24 --> <case to="ingest"> ${wf:actionData('getDirInfo')['dir.num-files'] gt 23 || wf:actionData('getDirInfo')['dir.age'] gt 6} </case> <default to="sendEmail"/> </switch> </decision>

<!--EMAIL--> <action name="sendEmail"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.StandaloneMailer</main-class> <arg>[email protected]</arg> <arg>[email protected]</arg> <arg>${inputDir}</arg> <arg>${wf:actionData('getDirInfo')['dir.num-files']}</arg> <arg>${wf:actionData('getDirInfo')['dir.age']}</arg> </java> <ok to="end" /> <error to="fail" /> </action>

<!--INGESTION --> <action name="ingest"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${outputDir}" /> </prepare> <configuration> <property> <name>mapred.reduce.tasks</name> <value>300</value> </property> </configuration> <main-class>com.navteq.probedata.drivers.ProbeIngest</main-class> <arg>-conf</arg> <arg>action.xml</arg> <arg>${inputDir}</arg> <arg>${outputDir}</arg> </java> <ok to=" archive-data" /> <error to="ingest-fail" /> </action>

<!—Archive Data --> <action name="archive-data"> <fs> <move source='${inputDir}' target='/probe/backup/${dirName}' /> <delete path = '${inputDir}' /> </fs> <ok to="end" /> <error to="ingest-fail" /> </action>

<kill name="ingest-fail"> <message>Ingestion failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill>

<kill name="fail"> <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name='end' /> </workflow-app>

Page 51: Luigi Presentation at OSCON 2013

Instead, focus on ridiculously little boiler plate codeGeneral so you can build whatever on top of itAs well as rapid experimentation cycleOnce things work, trivial to put in production

Luigi does not have 999 features

Page 52: Luigi Presentation at OSCON 2013

What we use Luigi forHadoop StreamingJava Hadoop MapReduceHivePigTrain machine learning modelsImport/export data to/from PostgresInsert data into Cassandrascp/rsync/ftp data files and reportsDump and load databases

Others using it with Scala MapReduce and MRJob as well

Page 53: Luigi Presentation at OSCON 2013

Be one of the cool kids!

Page 54: Luigi Presentation at OSCON 2013

Originated at SpotifyMainly built by me and Elias FreiderBased on many years of experience with data processingOpen source since September 2012

https://github.com/spotify/luigi

Luigi is open source

Page 55: Luigi Presentation at OSCON 2013

•Pig•EC2•Scalding•Cassandra

Future plans!

Page 56: Luigi Presentation at OSCON 2013

For more information feel free to reach out at http://github.com/spotify/luigi

Thank you!Oh, and we’re hiring – http://spotify.com/jobs

Erik [email protected]