luigi presentation at oscon 2013

Erik [email protected]

Batch data processing in Python

mailto:[email protected]


Focusing mostly on music discovery and large scale machine learningPreviously managed the “Analytics team” in Stockholm

I’m at Spotify in NYC

Btw I’m Erik Bernhardsson

Background

Billions of log messages (several TBs) every dayUsage and backend stats, debug informationWhat we want to do

AB-testingMusic recommendationsMonthly/daily/hourly reportingBusiness metric dashboards

We experiment a lot – need quick development cycles

We crunch a lot of data

Why did we build Luigi?

Our second cluster (in 2009):

We like Hadoop

Long story short :)

Our fifth cluster

Running one job is easy

Lots of long-running processes with dependenciesNeed monitoringHandle failuresGo from experimentation to production easily

But what about running 1000s of job every day?

But also non-Hadoop stuffMost things are Python Map/Reduce jobsAlso Pig, HiveSCP files from one host to anotherTrain a machine learning modelPut data in Cassandra

In the pre-Luigi world

How not to do workflows

“Streams” is a list of (username, track, artist ,timestamp) tuples

Example: Artist Toplist

Streams Artist Aggregation

Top 10 Database

Pre-Luigi example of artist toplists

Don’t do this at home

OK, so chain the tasks

Cron nicer, yay!

That’s OK, but don’t leave broken data somewhere(btw, Luigi gives you atomic file operations locally and in HDFS)

Errors will occur

The second step fails, you fix it, then you want to resume

Don’t run things twice

To use data flows as command line tools

Parametrize tasks

You want to run the dataflow for a set of similar inputs

Put tasks in loops

Plumbing sucks

Graph algorithms rock!

Plumbing sucks...

Who’s the world’s second most famous plumber?

Hint: he wears green

A Python framework for data flow definition and execution

Introducing Luigi

On steroids and PCP... with a toolbox of mainly Hadoop related stuff

Simple dependency definitionsEmphasis on Hadoop/HDFS integrationAtomic file operationsData flow visualizationCommand line integration

Main features

Luigi is “kind of like Makefile” in Python

Luigi Task

Luigi - Aggregate Artists

Luigi - Aggregate ArtistsRun on the command line:$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74375] Running AggregateArtists()INFO: [pid 74375] Done AggregateArtists()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time

Top 10 artists - Wrapped arbitrary Python code

Completing the top list

Basic functionality for exporting to Postgres. Cassandra support is in the works

Database support

Running it all...DEBUG: Checking if ArtistToplistToDatabase() is completeINFO: Scheduled ArtistToplistToDatabase()DEBUG: Checking if Top10Artists() is completeINFO: Scheduled Top10Artists()DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 3INFO: [pid 74811] Running AggregateArtists()INFO: [pid 74811] Done AggregateArtists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 74811] Running Top10Artists()INFO: [pid 74811] Done Top10Artists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74811] Running ArtistToplistToDatabase()INFO: Done writing, importing at 2013-03-13 15:41:09.407138INFO: [pid 74811] Done ArtistToplistToDatabase()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time

Imagine how cool this would be with real data...

The results

Tasks have implicit __init__

Task Parameters

Generates command line interface with typing and documentation

Class variables with some magic

$ python dataflow.py AggregateArtists --date 2013-03-05

Combined usage example

Task Parameters

Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-filesRunning Hive and (soon) Pig queriesInserting data sets into Postgres

Luigi comes with a toolbox of abstract Tasks for...

... how to run anything, really

Task templates and targets

Writing new ones are as easy as defining an interface and implementing run()

Built-in Hadoop Streaming Python framework

Hadoop MapReduce

Tiny interface – just implement mapper and reducerFetches error logs from Hadoop cluster and displays them to the userClass instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joinsEasy to send along Python modules that might not be installed on the clusterSupport for counters, secondary sort, combiners, distributed cache, etc.Runs on CPython so you can use your favorite libs (numpy, pandas etc.)

Features

Built-in Hadoop Streaming Python framework

Hadoop MapReduce

More features

Luigi’s “visualiser”

Dive into any task

Basic multi-processing

Multiple workers$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W08

Great for automated execution

Error notifications

Prevents two identical tasks from running simultaneously

Process Synchronization

Luigi worker 1 Luigi worker 2

A

B C

A C

F

Luigi central planner

... what happens

Process Synchronization

Luigi worker 1 Luigi worker 2

A

B C

A

C

F

Large data flows(Screenshot from web interface)

Things Luigi is not

Yes, you can run Python Hadoop jobs in Luigi. But the main focus is workflow management.

Luigi is not trying to replace mrjob

You still need to figure out how each task runs

Luigi does not give you scalability

Mapreduce/Pig/Hive/etc are wonderful tools for doing this and Luigi is more than happy to delegate it to them.

Luigi does not help you transform the data

Although Oozie is kind of annoying

...but it’s sort of like Oozie

Oozie Luigi

Only Hadoop Yes!

Horrible XML Yes!

Easy Yes!

Fun & powerful Yes!

“Oozie example”<workflow-app xmlns='uri:oozie:workflow:0.1' name='processDir'>

<start to='getDirInfo' />  <action name='getDirInfo'>  <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.GetDirInfo</main-class> <arg>${inputDir}</arg> <capture-output /> </java> <ok to="makeIngestDecision" /> <error to="fail" /> </action>

 <decision name="makeIngestDecision"> <switch>  <case to="end"> ${wf:actionData('getDirInfo')['dir.num-files'] lt 0 || (wf:actionData('getDirInfo')['dir.age'] lt 1 and wf:actionData('getDirInfo')['dir.num-files'] lt 24)} </case>  <case to="ingest"> ${wf:actionData('getDirInfo')['dir.num-files'] gt 23 || wf:actionData('getDirInfo')['dir.age'] gt 6} </case> <default to="sendEmail"/> </switch> </decision>

 <action name="sendEmail"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <main-class>com.navteq.oozie.StandaloneMailer</main-class> <arg>[email protected]</arg> <arg>[email protected]</arg> <arg>${inputDir}</arg> <arg>${wf:actionData('getDirInfo')['dir.num-files']}</arg> <arg>${wf:actionData('getDirInfo')['dir.age']}</arg> </java> <ok to="end" /> <error to="fail" /> </action>

 <action name="ingest"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${outputDir}" /> </prepare> <configuration> <property> <name>mapred.reduce.tasks</name> <value>300</value> </property> </configuration> <main-class>com.navteq.probedata.drivers.ProbeIngest</main-class> <arg>-conf</arg> <arg>action.xml</arg> <arg>${inputDir}</arg> <arg>${outputDir}</arg> </java> <ok to=" archive-data" /> <error to="ingest-fail" /> </action>

<!—Archive Data --> <action name="archive-data"> <fs> <move source='${inputDir}' target='/probe/backup/${dirName}' /> <delete path = '${inputDir}' /> </fs> <ok to="end" /> <error to="ingest-fail" /> </action>

<kill name="ingest-fail"> <message>Ingestion failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill>

<kill name="fail"> <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name='end' /> </workflow-app>

Instead, focus on ridiculously little boiler plate codeGeneral so you can build whatever on top of itAs well as rapid experimentation cycleOnce things work, trivial to put in production

Luigi does not have 999 features

What we use Luigi forHadoop StreamingJava Hadoop MapReduceHivePigTrain machine learning modelsImport/export data to/from PostgresInsert data into Cassandrascp/rsync/ftp data files and reportsDump and load databases

Others using it with Scala MapReduce and MRJob as well

Be one of the cool kids!

Originated at SpotifyMainly built by me and Elias FreiderBased on many years of experience with data processingOpen source since September 2012

https://github.com/spotify/luigi

Luigi is open source

http://github.com/spotify/luigi

http://github.com/spotify/luigi

•Pig•EC2•Scalding•Cassandra

Future plans!

For more information feel free to reach out at http://github.com/spotify/luigi

Thank you!Oh, and we’re hiring – http://spotify.com/jobs

Erik [email protected]





http://spotify.com/jobs

http://spotify.com/jobs



luigi presentation at oscon 2013

Technology

aggregateartists info

complete info

top10artists info

artisttoplisttodatabase

scheduling tasks debug

scheduled top10artists

pending tasks

c f luigi central planner