cascading for the impatient

Cascading for the ImpatientPaco NathanConcurrent, Inc.

[email protected]@pacoid

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Copyright @2012, Concurrent, Inc.

Unstructured Data meets Enterprise Scale

why?

Cascading.org/how?

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

• Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)

• Systems Integrator POV:system integration of heterogenous data sources and compute platforms

• Data Scientist POV:a directed, acyclic graph (DAG) on which we can apply Amdahl's Law

• Data Architect POV:a physical plan for large-scale data flow management

• Software Architect POV:a pattern language, similar to plumbing or circuit design

• App Developer POV:API bindings for Scala, Clojure, Python, Ruby, Java

• Systems Engineer POV:a JAR file, has passed CI, available in a Maven repo

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

who?

Scala, Clojure, Python, Ruby, Java, etc.…envision whatever else runs in a JVM

where?

Nagios, etc.

(raw human intellect, unless…)

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Domain expertise, business trade-offs,operating parameters, etc.

Apache Hadoop, in-memory local mode…envision GPUs, other frameworks, etc.

business process

APIlanguage

logical plan/ optimize

physicalplan

compute framework

monitors, notification

“asse

mb

ler”

cod

e

1: copy

Source

Sink

M

public class Main { public static void main( String[] args ) { String inPath = args[ 0 ]; String outPath = args[ 1 ];

Properties props = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create the source tap Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );

// create the sink tap Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );

// specify a pipe to connect the taps Pipe copyPipe = new Pipe( "copy" );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "copy" ) .addSource( copyPipe, inTap ) .addTailSink( copyPipe, outTap );

// run the flow flowConnector.connect( flowDef ).complete(); } } 1 mapper

0 reducers10 lines code

ten lines of code for a file copy …seems like a lot.

wait!

same JAR, any scale…

Your Mom’s Laptop:Mb’s dataHadoop standalone modepasses unit tests, or notruntime: seconds – minutes

Staging Cluster:Gb’s dataEMR + 4 Spot InstancesCI shows red or green lightsruntime: minutes – hours

Production Cluster:Tb’s dataEMR + 50 HPC InstancesOps monitors resultsruntime: hours – days

MegaCorp Enterprise IT:Pb’s data1000+ node clusterEVP calls you when app failsruntime: days+

2: word count

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 mapper 1 reducer18 lines code

3: wc + scrub

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken Count

M

R

1 mapper 1 reducer22+10 lines code

4: wc + scrub + stop words

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1 mapper 1 reducer28+10 lines code

5: tf-idf

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

token

TF

GroupBydoc_id, token Count

D Uniquedoc_id

Insert1

SumBydoc_id

HashJoinLeft

RHS

HashJoin

RHS

DF Unique

tokenGroupBy

token Count CoGroup

RHS

ExprFunctf-idf

TF-IDF

M

R

R

R

R

RR

RM

M

M RM

M

M

RM

M

M

M

11 mappers 9 reducers65+10 lines code

6: tf-idf + tdd

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

token

TF

GroupBydoc_id, token Count

D Uniquedoc_id

Insert1

SumBydoc_id

HashJoinLeft

RHS

HashJoin

RHS

DF Unique

tokenGroupBy

token CountCoGroup

RHS

ExprFunctf-idf

TF-IDF

Assert

FailureTraps

CheckpointM

R

R

R

R

RR

RM

M

M RM

M

M

RM

M

M

M

M

12 mappers 9 reducers76+14 lines code

deployed…

elastic-mapreduce --create --name "TF-IDF" \ --jar s3n://temp.cascading.org/impatient/part6.jar \ --arg s3n://temp.cascading.org/impatient/rain.txt \ --arg s3n://temp.cascading.org/impatient/out/wc \ --arg s3n://temp.cascading.org/impatient/en.stop \ --arg s3n://temp.cascading.org/impatient/out/tfidf \ --arg s3n://temp.cascading.org/impatient/out/trap \ --arg s3n://temp.cascading.org/impatient/out/check

results?

doc_id textdoc01 A rain shadow is a dry area on the lee back side of a mountainous area.doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.doc05 Two Women. Secrets. A Broken Land. [DVD Australia]zoink null

doc_id tf-idf tokendoc02 0.9163 airdoc05 0.9163 australiadoc05 0.9163 brokendoc04 0.9163 california'sdoc04 0.9163 causedoc02 0.9163 cloudcoverdoc04 0.9163 deathdoc04 0.9163 desertsdoc03 0.9163 downwind …doc02 0.9163 sinkingdoc04 0.9163 suchdoc04 0.9163 valleydoc05 0.9163 womendoc03 0.5108 landdoc05 0.5108 landdoc01 0.5108 leedoc02 0.5108 leedoc03 0.5108 leewarddoc04 0.5108 leewarddoc01 0.4463 areadoc02 0.2231 areadoc03 0.2231 areadoc01 0.2231 drydoc02 0.2231 drydoc03 0.2231 drydoc02 0.2231 mountaindoc03 0.2231 mountaindoc04 0.2231 mountaindoc01 0.0000 raindoc02 0.0000 raindoc03 0.0000 raindoc04 0.0000 raindoc01 0.0000 shadowdoc02 0.0000 shadowdoc03 0.0000 shadowdoc04 0.0000 shadow

comparisons?

compare similar code in Scalding and Cascalog:

sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

based on: github.com/twitter/scalding/wiki

github.com/Quantisan/Impatient

based on: github.com/nathanmarz/cascalog/wiki

blog, code, wiki, gists, jars, list, DevOps products:

cascading.org/category/impatient/github.org/Cascading/conjars.org/goo.gl/KQtULconcurrentinc.com/

drill-down?

cascading for the impatient

Technology