cascading for the impatient
TRANSCRIPT
Cascading for the ImpatientPaco NathanConcurrent, Inc.
[email protected]@pacoid
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Copyright @2012, Concurrent, Inc.
Unstructured Data meets Enterprise Scale
why?
Cascading.org/how?
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
• Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)
• Systems Integrator POV:system integration of heterogenous data sources and compute platforms
• Data Scientist POV:a directed, acyclic graph (DAG) on which we can apply Amdahl's Law
• Data Architect POV:a physical plan for large-scale data flow management
• Software Architect POV:a pattern language, similar to plumbing or circuit design
• App Developer POV:API bindings for Scala, Clojure, Python, Ruby, Java
• Systems Engineer POV:a JAR file, has passed CI, available in a Maven repo
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
who?
Scala, Clojure, Python, Ruby, Java, etc.…envision whatever else runs in a JVM
where?
Nagios, etc.
(raw human intellect, unless…)
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Domain expertise, business trade-offs,operating parameters, etc.
Apache Hadoop, in-memory local mode…envision GPUs, other frameworks, etc.
business process
APIlanguage
logical plan/ optimize
physicalplan
compute framework
monitors, notification
“asse
mb
ler”
cod
e
1: copy
Source
Sink
M
public class Main { public static void main( String[] args ) { String inPath = args[ 0 ]; String outPath = args[ 1 ];
Properties props = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create the source tap Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );
// create the sink tap Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );
// specify a pipe to connect the taps Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "copy" ) .addSource( copyPipe, inTap ) .addTailSink( copyPipe, outTap );
// run the flow flowConnector.connect( flowDef ).complete(); } } 1 mapper
0 reducers10 lines code
ten lines of code for a file copy …seems like a lot.
wait!
same JAR, any scale…
Your Mom’s Laptop:Mb’s dataHadoop standalone modepasses unit tests, or notruntime: seconds – minutes
Staging Cluster:Gb’s dataEMR + 4 Spot InstancesCI shows red or green lightsruntime: minutes – hours
Production Cluster:Tb’s dataEMR + 50 HPC InstancesOps monitors resultsruntime: hours – days
MegaCorp Enterprise IT:Pb’s data1000+ node clusterEVP calls you when app failsruntime: days+
2: word count
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
1 mapper 1 reducer18 lines code
3: wc + scrub
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken Count
M
R
1 mapper 1 reducer22+10 lines code
4: wc + scrub + stop words
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1 mapper 1 reducer28+10 lines code
5: tf-idf
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
token
TF
GroupBydoc_id, token Count
D Uniquedoc_id
Insert1
SumBydoc_id
HashJoinLeft
RHS
HashJoin
RHS
DF Unique
tokenGroupBy
token Count CoGroup
RHS
ExprFunctf-idf
TF-IDF
M
R
R
R
R
RR
RM
M
M RM
M
M
RM
M
M
M
11 mappers 9 reducers65+10 lines code
6: tf-idf + tdd
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
token
TF
GroupBydoc_id, token Count
D Uniquedoc_id
Insert1
SumBydoc_id
HashJoinLeft
RHS
HashJoin
RHS
DF Unique
tokenGroupBy
token CountCoGroup
RHS
ExprFunctf-idf
TF-IDF
Assert
FailureTraps
CheckpointM
R
R
R
R
RR
RM
M
M RM
M
M
RM
M
M
M
M
12 mappers 9 reducers76+14 lines code
deployed…
elastic-mapreduce --create --name "TF-IDF" \ --jar s3n://temp.cascading.org/impatient/part6.jar \ --arg s3n://temp.cascading.org/impatient/rain.txt \ --arg s3n://temp.cascading.org/impatient/out/wc \ --arg s3n://temp.cascading.org/impatient/en.stop \ --arg s3n://temp.cascading.org/impatient/out/tfidf \ --arg s3n://temp.cascading.org/impatient/out/trap \ --arg s3n://temp.cascading.org/impatient/out/check
results?
doc_id textdoc01 A rain shadow is a dry area on the lee back side of a mountainous area.doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.doc05 Two Women. Secrets. A Broken Land. [DVD Australia]zoink null
doc_id tf-idf tokendoc02 0.9163 airdoc05 0.9163 australiadoc05 0.9163 brokendoc04 0.9163 california'sdoc04 0.9163 causedoc02 0.9163 cloudcoverdoc04 0.9163 deathdoc04 0.9163 desertsdoc03 0.9163 downwind …doc02 0.9163 sinkingdoc04 0.9163 suchdoc04 0.9163 valleydoc05 0.9163 womendoc03 0.5108 landdoc05 0.5108 landdoc01 0.5108 leedoc02 0.5108 leedoc03 0.5108 leewarddoc04 0.5108 leewarddoc01 0.4463 areadoc02 0.2231 areadoc03 0.2231 areadoc01 0.2231 drydoc02 0.2231 drydoc03 0.2231 drydoc02 0.2231 mountaindoc03 0.2231 mountaindoc04 0.2231 mountaindoc01 0.0000 raindoc02 0.0000 raindoc03 0.0000 raindoc04 0.0000 raindoc01 0.0000 shadowdoc02 0.0000 shadowdoc03 0.0000 shadowdoc04 0.0000 shadow
comparisons?
compare similar code in Scalding and Cascalog:
sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
based on: github.com/twitter/scalding/wiki
github.com/Quantisan/Impatient
based on: github.com/nathanmarz/cascalog/wiki
blog, code, wiki, gists, jars, list, DevOps products:
cascading.org/category/impatient/github.org/Cascading/conjars.org/goo.gl/KQtULconcurrentinc.com/
drill-down?