hadoop user group eu 2014
DESCRIPTION
TRANSCRIPT
![Page 1: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/1.jpg)
A QUICK INTRODUCTION TO THE CASCADING ECOSYSTEM
Chris K Wensel | Hadoop Summit EU 2014
![Page 2: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/2.jpg)
• Lead developer of the Cascading open-source project
• Founder of Concurrent, Inc.
• Involved with Apache Hadoop since it was called Apache Nutch
!
• Systems Architect, not a Data Scientist
WHO AM I?
2
![Page 3: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/3.jpg)
3
For creating data oriented applications, frameworks, and languages [on Apache Hadoop]
Originally designed to hide complexity of Hadoop and prevent thinking in MapReduce
cascading.org
![Page 4: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/4.jpg)
• Started in 2007
• 2.0 released June 2012
• 2.5 out now
• 3.0 WIP (if you look for it)
• Apache 2.0 Licensed
• Supports all Hadoop distros
SOME STATS
4
![Page 5: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/5.jpg)
5
What’s it used for?
![Page 6: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/6.jpg)
6
• Cascading Java API
• Data normalization and cleansing of search and click-through
logs for use by analytics tools
• Easy to operationalize heavy lifting of data
![Page 7: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/7.jpg)
7
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
![Page 8: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/8.jpg)
8
• Scalding (Scala)
• Machine learning (linear algebra) to improve
• User experience
• Ad quality (matching users and ad effectiveness)
• All revenue applications are running on Cascading/Scalding
• IPO
![Page 9: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/9.jpg)
9
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
![Page 10: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/10.jpg)
KEY PROJECTS
10
Lingual Pattern
Cascading
Apache Hadoop
Scalding Cascalog
![Page 11: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/11.jpg)
• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
11
Process Planner
Processing API Integration APIScheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
ScriptingScala, Clojure, JRuby, Jython, Groovy
Enterprise Java
![Page 12: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/12.jpg)
• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical
• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)
• Aggregations ‣ Count, Average, etc ‣ Rolling windows
SOME COMMON PATTERNS
12
filter
filter
function
functionfilterfunctiondata
PipelineSplit Join
Merge
data
Topology
![Page 13: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/13.jpg)
13
word count – Cascading Java API !String docPath = args[ 0 ];!String wcPath = args[ 1 ];!Properties properties = new Properties();!AppProps.setApplicationJarClass( properties, Main.class );!HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!!// create source and sink taps!Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );!Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );!!// specify a regex to split "document" text lines into token stream!Fields token = new Fields( "token" );!Fields text = new Fields( "text" );!RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );!// only returns "token"!Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!// determine the word counts!Pipe wcPipe = new Pipe( "wc", docPipe );!wcPipe = new GroupBy( wcPipe, token );!wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!!// connect the taps, pipes, etc., into a flow definition!FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );!// create the Flow!Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide!wcFlow.complete(); // <<-- Runs jobs on Cluster
1
3
2
scheduling
processing
integration
configuration
![Page 14: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/14.jpg)
14
map
reduceEvery('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count'][{1}:'token']
[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']
wc[{1}:'token'][{1}:'token']
[{2}:'token', 'count'][{2}:'token', 'count']
[{1}:'token'][{1}:'token']
wc.dot
![Page 15: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/15.jpg)
A REAL WORLD APP
15
[1/75] map+reduce
[2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce
[19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce
[36/75] map+reduce
[37/75] map+reduce
[38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce
[54/75] map+reduce
[55/75] map [56/75] map+reduce [57/75] map[58/75] map[59/75] map
[60/75] map [61/75] map[62/75] map
[63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce
[71/75] map [72/75] map
[73/75] map+reduce [74/75] map+reduce
[75/75] map+reduce
1 App, 1 Flow, 75 Steps/MRJobs !green = map + reduce purple = map blue = join/merge orange = map split
A graph of jobs, not operations!
![Page 16: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/16.jpg)
16
It’s not just for Java
![Page 17: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/17.jpg)
17
word count – Scalding (Scala) // Sujit Pal!// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html!!
package com.mycompany.impatient!!
import com.twitter.scalding._!!
class Part2(args : Args) extends Job(args) {! val input = Tsv(args("input"), ('docId, 'text))! val output = Tsv(args("output"))! input.read.! flatMap('text -> 'word) {! text : String => text.split("""\s+""")! }.! groupBy('word) { group => group.size }.! write(output)!}!
![Page 18: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/18.jpg)
18
word count – Cascalog (Clojure) ; Paul Lam!; github.com/Quantisan/Impatient!!(ns impatient.core! (:use [cascalog.api]! [cascalog.more-taps :only (hfs-delimited)])! (:require [clojure.string :as s]! [cascalog.ops :as c])! (:gen-class))!!(defmapcatop split [line]! "reads in a line of string and splits it by regex"! (s/split line #"[\[\]\\\(\),.)\s]+"))!!(defn -main [in out & args]! (?<- (hfs-delimited out)! [?word ?count]! ((hfs-delimited in :skip-header? true) _ ?line)! (split ?line :> ?word)! (c/count ?count)))!
![Page 19: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/19.jpg)
• Step by step tutorials on Cascading on GitHub
• Community has ported them to Scalding and Cascalog
!
• http://docs.cascading.org/impatient/
“FOR THE IMPATIENT” SERIES
19
![Page 20: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/20.jpg)
• Foundation of patterns and best practices for building
Languages, Frameworks, and Applications
• Designed to abstract Hadoop away from the business logic
• Other models than MapReduce on the way!
WHY CASCADING?
20
![Page 21: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/21.jpg)
• ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LINGUAL
21
Query Planner
JDBC API Lingual APIProvider API
Cascading
Apache HadoopLingual
Data Stores
CLI / Shell Enterprise Java
Catalog
![Page 22: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/22.jpg)
22
Cascading API !
FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );! !SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );! !flowDef.addAssemblyPlanner( sqlPlanner );!
!
!
![Page 23: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/23.jpg)
23
JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();! ! ResultSet resultSet = statement.executeQuery(! "select *\n"! + "from \"EXAMPLE\".\"SALES_FACT_1997\" as s\n"! + "join \"EXAMPLE\".\"EMPLOYEE\" as e\n"! + "on e.\"EMPID\" = s.\"CUST_ID\"" );! ! // do something! ! resultSet.close();! statement.close();! connection.close();! }
![Page 24: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/24.jpg)
SHELL - !TABLES
24
![Page 25: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/25.jpg)
25
# load the JDBC package!library(RJDBC)! !# set up the driver!drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")! !# set up a database connection to a local repository!connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")! !# query the repository: in this case the MySQL sample database (CSV files)!df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!head(df)! !# use R functions to summarize and visualize part of the data!df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!summary(df$hire_age)!!library(ggplot2)!m <- ggplot(df, aes(x=hire_age))!m <- m + ggtitle("Age at hire, people named Gina")!m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
![Page 26: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/26.jpg)
26
> summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
![Page 27: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/27.jpg)
27
“But we use a custom data format”
![Page 28: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/28.jpg)
• Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0
• The Jar is dynamically loaded into cluster
DATA PROVIDER API
28
![Page 29: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/29.jpg)
29
Amazon Elastic MapReduceJob Job Job Job
SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...
Amazon S3
Amazon RedShift
file1 file2
results
![Page 30: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/30.jpg)
• Quickly migrate existing work loads from RDBMS to Hadoop
• Quickly extract data from Hadoop into applications
WHY LINGUAL
30
![Page 31: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/31.jpg)
• Predictive model scoring • Java API and PMML parser • Supports: ‣ (General) Regression ‣ Clustering ‣ Decisions Trees ‣ Random Forest ‣ and ensembles of models
PATTERN
31
PMML Parser Pattern API
Cascading
Apache Hadoop
Pattern
Data Stores
Enterprise Java
![Page 32: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/32.jpg)
32
!
!
FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );! !PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();! !flowDef.addAssemblyPlanner( pmmlPlanner );!!
!
![Page 33: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/33.jpg)
• Standards compliance provides integration with many tools
• Models are independent of data and integration
• Only debugging Cascading, not an ensemble of applications
WHY PATTERN
33
![Page 34: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/34.jpg)
CLOSING THE LOOP
34
Cluster
Pattern
Desktop
Job
PMMLFlow
JDBCFlowimport data
create models
export models
execute models
import resultsJDBC
Flow
PMML
DATA
DATA
test results
Job Job
![Page 35: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/35.jpg)
• Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the line of code implicated on a failure
• Plugin available via Maven repo
• Beta UI hosted online
DRIVEN
35
http://cascading.io/driven/
![Page 36: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/36.jpg)
MANAGED WITH DRIVEN
36
![Page 37: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/37.jpg)
37
![Page 38: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/38.jpg)
• New query planner ‣ User definable Assertion and Transformation rules
‣ Sub-Graph Isomorphism Pattern Matching
‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75
• Hadoop Tez support • And likely other platforms
CASCADING 3.0
38
![Page 39: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/39.jpg)
THERE’S A BOOK!
39
Enterprise Data Workflows with Cascading
- Paco Nathan
O’Reilly, 2013 amazon.com/dp/1449358721
![Page 40: Hadoop User Group EU 2014](https://reader033.vdocuments.mx/reader033/viewer/2022042814/54c67f4c4a79597a728b4589/html5/thumbnails/40.jpg)
CONTACT
40
@cwensel | @cascading
www.cascading.org
www.concurrentinc.com