etl in clojure
TRANSCRIPT
Dmitriy MorozovDmitriy Morozov
Software engineer at Functional programming junkyOccasional cyclist
Zoomdata.com
@argc
Plan of attackPlan of attack
ETL at ZoomdataETL at Zoomdata
CascalogCascalog
SparkSpark
DemoDemo
ConclusionConclusion
Is a modern BI application focused onIs a modern BI application focused onallowing everyday business users toallowing everyday business users tobe able to visually interact andbe able to visually interact andexplore their data and discoverexplore their data and discoverinsight out of that data.insight out of that data.
Using SQL for ETLUsing SQL for ETL
Hive is slow, and so is Hive on TezSQL is horrible for doing anything complicatedCode is hard to maintain, reuse and test
Lessons learnedLessons learned
Why Clojure?Why Clojure?
Functional!
Runs on JVM
Interactive development
Zero delta between prototyp code andproduction code
CascalogCascalog
Datalog DSL in CLojure
Built on top of Hadoop and Cascading
Query compiles to Hadoop MapReduce jobs
Supports local execution for prototyping
Great testing story
DatalogDatalog
language
Syntactically is a subset of Prolog
It is often used as a fordeductive databases.
Query statements can be stated in any order
Logic programming
query language
Flow Visualisation / Flow Visualisation / DOTDOT
What are the alternatives?What are the alternatives?
Java API for Java API for
FlamboFlamboSparklingSparkling
SparkSpark
Drug PersistenceDrug Persistence
Determining whether a patient isDetermining whether a patient ispersistent or not based on whether shepersistent or not based on whether she
refilled the prescription in time.refilled the prescription in time.
Things to check outThings to check out
How Yieldbot does Data science in ClojureCascalog for the ImpatientStreaming MapReduce in ClojureSparklingFlambo