onyx data processing the clojure way
TRANSCRIPT
Onyx - Data Processing
The Clojure WayBy Bahadir Cambel@bahadircambel
Raise your hands if you have used
- Cascalog
- Hadoop
- Spark
- Flink
- Samza
- Storm
- Sqoop
What’s good for ?
Realtime event stream processing
Continuous computation
Extract, transform, load (ETL)
Data transformation à la map-reduce
Data ingestion and storage medium transfer
Data cleaning
Data Structures + Simple Functions
- Sounds familiar ?
Hadoop
Cascading
https://github.com/Cascading/cascading.samples/blob/master/wordcount/src/java/wordcount/Main.java
Storm
https://github.com/nathanmarz/storm-starter/blob/master/src/clj/storm/starter/clj/word_count.clj
Spark
http://spark.apache.org/examples.html SCALA
JAVA dev having a productive day
Cascalog
http://cascalog.org/articles/getting_started.html
?-
<-
:>
Internet Scale MongoDB could save youu?!
Onyx
Workflow
- articulate the paths that data flows through the cluster at runtime
- DAG
Catalog
- Describe and configure workflow items
Function(s)
Flow Conditions
From -> To ( if predicate correct)
Flow conditions are used for isolating logic about whether or not segments should pass through different tasks in a workflow, exception handling and support a rich degree of composition with runtime parameterization.
Windows / Triggers
partitions a possible unbounded sequence of data into finite pieces, allowing aggregations to be specified
- Timer
- Segment
- Punctuation
- Watermark
- Percentile-Watermark
Life Cycles
allows you to hook in and execute arbitrary code at critical points during a task (kinda middleware)
:lifecycle/start-task?:lifecycle/before-task-start:lifecycle/before-batch:lifecycle/after-read-batch:lifecycle/after-batch:lifecycle/after-task-stop:lifecycle/after-ack-segment:lifecycle/after-retry-segment
Job
A job will be translated into multiple tasks. Peers will take care of these tasks.
If your number of tasks > available peers
A job won’t be complete ( Buy me a beer or 10)
Bulk functions
perform a fn more efficiently over a batch of segments rather than processing one segment at a time.
- Write to DB Onyx will ignore the output of your function and pass the same segments that you received downstream
Group by
“like” values are always routed to the same virtual peer
- Group by key
- Group by a fn
Specify in the catalog!
Fixed Windows
a data point will fall into exactly one instance of a window (often called an extent in the literature)Between t1=0 and t2=4 how many events have happened?
t1=5 t2=9, t1=10 t2=14
And so on..
Sliding Window
a slide value for how long to wait between spawning a new window extentBetween t1=0 and t2=14 how many events have happened?
t1=5 t2=19 ?
t1=10 t2=24 ?
Global Window
Session Window
dynamically resize their upper and lower bounds in reaction to incoming dataSessions capture a time span of activity for a specific key, such as a user ID. If no activity occurs within a timeout gap, the session closes. If an event occurs within the bounds of a session, the window size is fused with the new event, and the session is extended by its timeout gap either in the forward or backward direction
Aggregation:onyx.windowing.aggregation/conj
:onyx.windowing.aggregation/count
:onyx.windowing.aggregation/sum
:onyx.windowing.aggregation/min
:onyx.windowing.aggregation/max
:onyx.windowing.aggregation/average
Architecture
Peer
is a node in the cluster responsible for processing data
Virtual Peer
A Virtual Peer refers to a single peer process running on a single physical machine. A single Virtual Peer executes at most one task at a time.
ZooKeeper
Watches Peers
Aeron
Efficient reliable UDP unicast, UDP multicast, and IPC message transport
Messaging layer takes care of the direct peer to peer transfer of segment batches, acks, segment completion and segment retries to the relevant virtual peers.
The Log
Scheduling
If there is no master, how does scheduling work ?
Peers contend to work on tasks.
Types of Job Schedulers
- Greedy ( I need ALL!!!! Gimme all!!)
- Balanced Robin ( Fair play)
- Percentage ( Not so fair play)
Types of Task Schedulers
- Balanced
- Percentage
- Colocation (assigns them to the peers on a single physical machine, low latency, min network)
Tags
a set of machines in your cluster are privileged
Run some tasks at some specific machines
Declare a peer with capabilities
- Datomic
- Special Hardware (GPU, Memory)
- Network
Onyx Plugins
onyx-core-async
onyx-kafka
onyx-datomic
onyx-redis
onyx-sql
onyx-bookkeeper
onyx-seq
onyx-durable-queue
onyx-elasticsearch
Example - Let’s process some logs49556677821280438558577372995495836672945903576549425154
Check out the repository
https://github.com/bcambel/onyx-test
End users configuring what
- workflows should look like.
- Language agnostic
- Location agnostic
- Tolerant to machine generation
- Temporally agnostic ( should wait for a time to be realized)
If you are not enjoying your experience
There is something fundamentally wrong with the tool
Think about Apple’s smooth product experience.
Pain detected, thought through. (Pain -> Pleasure)
Tools
- Onyx ETL https://github.com/onyx-platform/onyx-etl
- Dashboard https://github.com/onyx-platform/onyx-dashboard
- Replica Cons https://github.com/onyx-platform/onyx-console-dashboard
- Ansible Playbook https://github.com/onyx-platform/ansible-onyx
- Metrics Suite https://github.com/onyx-platform/onyx-metrics
- Benchmark Suite https://github.com/onyx-platform/onyx-benchmark
- Jepsen https://github.com/onyx-platform/onyx-jepsen
Onyx Dashboard
https://github.com/onyx-platform/onyx-dashboard
Questions
- How does Onyx distributes reads (input tasks) ? Parallelization?? - evenly break up a database table into chunks which can be read by multiple
peers
- Segments realized. Tasks created. Peers get into action
Helpful Links
http://www.onyxplatform.org/
https://gitter.im/onyx-platform/onyx
https://clojurians.slack.com/messages/onyx
https://github.com/onyx-platform/onyx-examples
https://github.com/onyx-platform/learn-onyx
https://github.com/Yuppiechef/cqrs-server (Open source project using Onyx)
Shameless self plug
http://www.bahadir.io/
https://twitter.com/bahadircambel
https://www.strava.com/athletes/8258974
Thanks
Michael Drogalis - https://twitter.com/michaeldrogalis
Lucas Bradstreet - https://twitter.com/ghaz
Thanks for using Clojure!