Wri$ng Spark applica$ons,the easy way
¨¨Pierre Borckmans
Data Science Meetup - Spark & Machine Learning - October, 27th 2016 - Brussels
The pivot...
Pla$orm overview
Data pipeline overview
The journey...3 paradigms for spark applica0ons development
From hardcoded dataflows...val subscribers = cdrs.map( x => ( x.A.toLong, x ) ).groupByKey
subscribers.mapValues(_.map( cdr => { for ( ( category, dimensions ) <- allDimensions ) yield ( category, for ( dim <- dimensions ) yield { val fields = dim._1 val values = dim._2 if ( cdr.check( fields, values ) ) f( category )( cdr ) else f0( category ) } )} ).reduce( ( m1, m2 ) => { for ( ( category, l1 ) <- m1 ) yield { val l2 = m2( category ) val d = l1.zip( l2 ).map( l => { g( category )( l._1, l._2 ) } ) ( category, d ) }} ) )
...to fully interac/ve ones...
and back to code...
... with benefits !
Harcoded dataflows
Dataflow Editor
Dataflow EditorShow &me!
Video
Video
Datamodules• self-contained units of the pipeline
• expressing dependencies on sources and other dms
• recycling the dataflow engine
• DSL to declare dataflows
• unit test DSL to test flow and individual transforma=ons
• sbt plugin to handle all devops related tasks
• automa=c orchestra=on through Airflow
Dataflow DSL
Dataflow Test DSL
Automated Data Modules Orchestra2on
Data Module ExplorerShow &me!
Video
Video
Video