![Page 1: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/1.jpg)
DRIVING INNOVATION THROUGH DATACASCADING 3 AND BEYONDAndré Kelpe | Apache Big Data Europe | Budapest, September 28th 2015
![Page 2: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/2.jpg)
SPEAKER
2
André KelpeSenior Software Engineer at Concurrent company behind Cascading, Lingual and Drivenhttp://concurrentinc.com / @concurrent
[email protected] / @fs111
![Page 3: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/3.jpg)
http://cascading.org
Apache licensed Java framework for writing data oriented applications
production ready, stable and battle proven
INTRODUCTION
3
![Page 4: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/4.jpg)
4
PHILOSOPHY
![Page 5: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/5.jpg)
developer productivity
users focus on business problems, not distributed systems knowledge
predictable runtime behaviour
fail fast
PHILOSOPHY
5
![Page 6: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/6.jpg)
stable user APIs
safe defaults with knobs for experts
batch workloads
PHILOSOPHY
6
![Page 7: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/7.jpg)
testability & robustness
production quality applications rather than a collection of scripts
abstractions over interchangeable platforms
PHILOSOPHY
7
![Page 8: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/8.jpg)
8
TERMINOLOGY
![Page 9: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/9.jpg)
A SERIES OF PIPES
9
https://www.flickr.com/photos/theilr/4283377543/sizes/l
![Page 10: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/10.jpg)
CASCADING TERMINOLOGY
10
• Taps are sources and sinks for data• Schemes represent the format of the data • Pipes are connecting Taps
![Page 11: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/11.jpg)
● Tuples flow through Pipes● Fields describe the Tuples● Operations are executed on Tuples in
TupleStreams● Pipes can be merged, spliced, joined etc.● Pipe-assemblies are reusable components
CASCADING TERMINOLOGY
11
![Page 12: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/12.jpg)
FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational platform
Flows can be orchestrated via Cascade
Applications are Directed Acyclic Graphs (DAG)
CASCADING TERMINOLOGY
12
![Page 13: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/13.jpg)
DAG
13
![Page 14: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/14.jpg)
14
PLATFORMS
![Page 15: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/15.jpg)
CASCADING PLATFORMS
15
local
change 1 line of code, recompile, done.
![Page 16: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/16.jpg)
COMPILER ANALOGY
16
User Code TranslationOptimisationAssembly
CPU Architecture
QueryPlanner/RuleEngine
MR
Tez
Flink
FlowDef
FlowDef
FlowDef
FlowDef
FlowDefothers…
![Page 17: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/17.jpg)
DAG
17
![Page 18: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/18.jpg)
A DAG RUNNING ON A PLATFORM
18
![Page 19: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/19.jpg)
REAL WORLD DAG
19
https://github.com/cchepelov/wcplus
https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1
![Page 20: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/20.jpg)
20
CODE EXAMPLE
![Page 21: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/21.jpg)
● Fluid - A Fluent API for Cascading− Targeted at application writers− https://github.com/Cascading/fluid
● „Raw“ Cascading API− Targeted for library writers, code
generators, integration layers− https://github.com/Cascading/cascading
APIS
21
![Page 22: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/22.jpg)
COUNTING WORDS
22
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
...
![Page 23: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/23.jpg)
COUNTING WORDS (CONT.)
23
// specify a regex operation to split the "document" text lines into a token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter =
new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
...
![Page 24: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/24.jpg)
COUNTING WORDS (CONT.)
24
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( “word count" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
Flow wcFlow = flowConnector.connect( flowDef )
wcFlow.complete(); // ← runs the code
}
![Page 25: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/25.jpg)
A FULL TOOLBOX
25
● Operations − Function
− Filter
− Regex/Scripts
− Boolean operators
− Count/Limit/Last/First
− Scripts
− Unique
− Asserts
− Min/Max
● Splices− GroupBy− CoGroup− HashJoin− Merge
● JoinsLeft, right, outer, inner, mixed, custom
![Page 26: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/26.jpg)
A FULL TOOLBOX
26
• data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra, kinesis, accumulo …
• data formats: avro, parquet, ORC (+ACID), thrift, protobuf, CSV, TSV…
• integration points: Cascading Lingual (SQL), Apache Hive, M/R apps, custom
![Page 27: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/27.jpg)
OUTLOOK TO CASCADING 3.1+
27
• improved serialization through strong typing
• Cascading on Apache Flink
• Cascading on Hazelcast
![Page 28: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/28.jpg)
DON’T LIKE JAVA?
28
Clojure/logic programming
https://github.com/nathanmarz/cascalog
Clojure
https://github.com/Netflix/PigPen
Scala
https://github.com/twitter/scalding
![Page 29: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/29.jpg)
29
QUESTIONS?
![Page 30: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/30.jpg)
LINK COLLECTION
30
• http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ • http://concurrentinc.com • https://groups.google.com/forum/#!forum/
cascading-user • http://docs.cascading.org/tutorials/etl-log/ • http://docs.cascading.org/cascading/3.0/
userguide/html/
![Page 31: DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA CASCADING 3 AND BEYOND André Kelpe | Apache Big Data Europe | Budapest, September 28th](https://reader034.vdocuments.mx/reader034/viewer/2022042218/5ec484d4521a3c34131b5467/html5/thumbnails/31.jpg)
DRIVING INNOVATION THROUGH DATACASCADING 3 AND BEYONDAndré Kelpe | Apache Big Data Europe | Budapest, September 28th 2015