Transcript
Page 1: Concurrent Stream Processing

Concurrent Stream Processing

Alex Miller - @puredangerRevelytix - http://revelytix.com

Page 2: Concurrent Stream Processing

Contents• Query execution - the problem• Plan representation - plans in our program• Processing components - building blocks• Processing execution - executing plans

2

Page 3: Concurrent Stream Processing

Query Execution

Page 4: Concurrent Stream Processing

Relational Data & Queries

SELECT NAMEFROM PERSONWHERE AGE > 20

4

NAME AGE

Joe 30

Page 5: Concurrent Stream Processing

RDF"Resource Description Framework" - a fine-grained graph representation of data

5

http://data/Joe

30

"Joe"

http://demo/age

http://demo/name

Subject Predicate Object

http://data/Joe http://demo/age 30

http://data/Joe http://demo/name "Joe"

Page 6: Concurrent Stream Processing

SPARQL queriesSPARQL is a query language for RDF

6

PREFIX demo: <http://demo/>SELECT ?nameWHERE { ?person demo:age ?age. ?person demo:name ?name. FILTER (?age > 20) }

A "triple pattern"

Natural join on ?person

Page 7: Concurrent Stream Processing

PREFIX demo: <http://demo/>SELECT ?nameWHERE { ?person demo:age ?age. ?person demo:name ?name. FILTER (?age > 20) }

Relational-to-RDF• W3C R2RML mappings define how to virtually

map a relational db into RDF

7

NAME AGEJoe 30

http://data/Joe

30

"Joe"

http://demo/age

http://demo/name

SELECT NAMEFROM PERSONWHERE AGE > 20

Page 8: Concurrent Stream Processing

Enterprise federation• Model domain at enterprise level• Map into data sources• Federate across the enterprise (and beyond)

8

Enterprise

SPARQL

SPARQLSPARQLSPARQL

SQLSQLSQL

Page 9: Concurrent Stream Processing

Query pipeline• How does a query engine work?

9

Parse Plan Resolve Optimize Process

SQL

Results!

AST Plan

Plan

Plan

Metadata

Page 10: Concurrent Stream Processing

Trees!

10

Parse Plan Resolve Optimize Process

SQL

Results!

AST Plan

Plan

Plan

Metadata

Trees!

Page 11: Concurrent Stream Processing

Plan Representation

Page 12: Concurrent Stream Processing

SQL query plans

12

Person

Dept

join filter project

DeptID Age > 20 Name, DeptName

DeptIDDeptName

NameAgeDeptID

SELECT Name, DeptNameFROM Person, DeptWHERE Person.DeptID = Dept.DeptID AND Age > 20

Page 13: Concurrent Stream Processing

SPARQL query plans

13

TP1

TP2

join filter project

?Person ?Age > 20 ?Name

{ ?Person :Age ?Age }

{ ?Person :Name ?Name }

SELECT ?NameWHERE { ?Person :Name ?Name . ?Person :Age ?Age . FILTER (?Age > 20) }

Page 14: Concurrent Stream Processing

Common modelStreams of tuples flowing through a network of processing nodes

14

node

node

node node node

Page 15: Concurrent Stream Processing

What kind of nodes?• Tuple generators (leaves)

– In SQL: a table or view– In SPARQL: a triple pattern

• Combinations (multiple children)– Join– Union

15

• Transformations– Filter– Dup removal– Sort– Grouping

– Project– Slice (limit / offset)– etc

Page 16: Concurrent Stream Processing

RepresentationTree data structure with nodes and attributes

16

TableTableNode

joinTypejoinCriteria

JoinNode

childNodesPlanNode

criteriaFilterNode

projectExpressionsProjectNode

limitoffset

SliceNode

Java

Page 17: Concurrent Stream Processing

s-expressionsTree data structure with nodes and attributes

17

(* (+ 2 3) (- 6 5) )

Page 18: Concurrent Stream Processing

List representationTree data structure with nodes and attributes

18

(project+ [Name DeptName] (filter+ (> Age 20) (join+ (table+ Empl [Name Age DeptID]) (table+ Dept [DeptID DeptName]))))

Page 19: Concurrent Stream Processing

Query optimizationExample - pushing criteria down

19

(project+ [Name DeptName] (filter+ (> Age 20) (join+ (project+ [Name Age DeptID] (bind+ [Age (- (now) Birth)] (table+ Empl [Name Birth DeptID]))) (table+ Dept [DeptID DeptName]))))

Page 20: Concurrent Stream Processing

Query optimizationExample - rewritten

20

(project+ [Name DeptName] (join+ (project+ [Name DeptID] (filter+ (> (- (now) Birth) 20) (table+ Empl [Name Birth DeptID]))) (table+ Dept [DeptID DeptName])))

Page 21: Concurrent Stream Processing

Hash join conversion

21

first+

let+

preduce+

join+

left tree

right tree

hash-tupleshashes

mapcat tuple-matches

left tree

right tree

Page 22: Concurrent Stream Processing

Hash join conversion

22

(join+ _left _right)

(let+ [hashes (first+ (preduce+ (hash-tuple join-vars {} #(merge-with concat %1 %2)) _left))] (mapcat (fn [tuple] (tuple-matches hashes join-vars tuple)) _right)))

Page 23: Concurrent Stream Processing

Processing trees

23

• Compile abstract nodes into more concrete stream operations:

– map+– mapcat+ – filter+

– first+ – mux+

– let+– let-stream+

– pmap+– pmapcat+ – pfilter+– preduce+

– number+– reorder+– rechunk+

– pmap-chunk+– preduce-chunk+

Page 24: Concurrent Stream Processing

Summary• SPARQL and SQL query plans have essentially

the same underlying algebra• Model is a tree of nodes where tuples flow from

leaves to the root• A natural representation of this tree in Clojure is

as a tree of s-expressions, just like our code• We can manipulate this tree to provide

– Optimizations– Differing levels of abstraction

24

Page 25: Concurrent Stream Processing

Processing Components

Page 26: Concurrent Stream Processing

PipesPipes are streams of data

26

Producer Consumer

Pipe

(enqueue pipe item)(enqueue-all pipe items)(close pipe)(error pipe exception)

(dequeue pipe item)(dequeue-all pipe items)(closed? pipe)(error? pipe)

Page 27: Concurrent Stream Processing

Pipe callbacks

Events on the pipe trigger callbacks which are executed on the caller's thread

27

Page 28: Concurrent Stream Processing

Pipe callbacks

Events on the pipe trigger callbacks which are executed on the caller's thread

27

1. (add-callback pipe callback-fn)

callback-fn

Page 29: Concurrent Stream Processing

Pipe callbacks

Events on the pipe trigger callbacks which are executed on the caller's thread

27

1. (add-callback pipe callback-fn)

callback-fn

Page 30: Concurrent Stream Processing

Pipe callbacks

Events on the pipe trigger callbacks which are executed on the caller's thread

27

1. (add-callback pipe callback-fn)2. (enqueue pipe "foo")

callback-fn

Page 31: Concurrent Stream Processing

Pipe callbacks

Events on the pipe trigger callbacks which are executed on the caller's thread

27

1. (add-callback pipe callback-fn)2. (enqueue pipe "foo")

callback-fn

Page 32: Concurrent Stream Processing

Pipe callbacks

Events on the pipe trigger callbacks which are executed on the caller's thread

27

1. (add-callback pipe callback-fn)2. (enqueue pipe "foo")3. (callback-fn "foo") ;; during enqueue

callback-fn

Page 33: Concurrent Stream Processing

PipesPipes are thread-safe functional data structures

28

Page 34: Concurrent Stream Processing

PipesPipes are thread-safe functional data structures

28

callback-fn

Page 35: Concurrent Stream Processing

Batched tuples• To a pipe, data is just data. We actually pass

data in batches through the pipe for efficiency.

29

[ {:Name "Alex" :Eyes "Blue" } {:Name "Jeff" :Eyes "Brown"} {:Name "Eric" :Eyes "Hazel" } {:Name "Joe" :Eyes "Blue"} {:Name "Lisa" :Eyes "Blue" } {:Name "Glen" :Eyes "Brown"}]

Page 36: Concurrent Stream Processing

Pipe multiplexerCompose multiple pipes into one

30

Page 37: Concurrent Stream Processing

Pipe teeSend output to multiple destinations

31

Page 38: Concurrent Stream Processing

Nodes• Nodes transform tuples from the input pipe and

puts results on output pipe.

32

fnInput Pipe Output PipeNode

•input-pipe•output-pipe•task-fn•state •concurrency

Page 39: Concurrent Stream Processing

Processing Trees• Tree of nodes and pipes

33

fn

fnfn

fn

fn

fn

Data flow

Page 40: Concurrent Stream Processing

SPARQL query example

34

TP1

TP2

join filter project

?Person ?Age > 20 ?Name

{ ?Person :Age ?Age }

{ ?Person :Name ?Name }

SELECT ?NameWHERE { ?Person :Name ?Name . ?Person :Age ?Age . FILTER (?Age > 20) }

(project+ [?Name] (filter+ (> ?Age 20) (join+ [?Person] (triple+ [?Person :Name ?Name]) (triple+ [?Person :Age ?Age]))))

Page 41: Concurrent Stream Processing

Processing tree

35

TP1

TP2

filter project

?Age > 20 ?Name

{ ?Person :Age ?Age }

{ ?Person :Name ?Name }

first+

preduce+ hash-tuples

hashes

mapcat tuple-matches

let+

Page 42: Concurrent Stream Processing

Mapping to nodes• An obvious mapping to nodes and pipes

36

fn

fn

fnfnfn fn

fn project+filter+let+

triple pattern

triple pattern

triple pattern

first+

preduce+

Page 43: Concurrent Stream Processing

Mapping to nodes• Choosing between compilation and evaluation

37

eval

triple pattern

project

?Age > 20 ?Name

filterfn

fn

fnfnfn

fn let+

triple pattern

first+

preduce+

Page 44: Concurrent Stream Processing

Compile vs eval• We can evaluate our expressions

– Directly on streams of Clojure data using Clojure– Indirectly via pipes and nodes (more on that next)

• Final step before processing makes decision– Plan nodes that combine data are real nodes– Plan nodes that allow parallelism (p*) are real nodes– Most other plan nodes can be merged into single eval– Many leaf nodes actually rolled up, sent to a database– Lots more work to do on where these splits occur

38

Page 45: Concurrent Stream Processing

Processing Execution

Page 46: Concurrent Stream Processing

Execution requirements• Parallelism

– Across plans – Across nodes in a plan– Within a parallelizable node in a plan

• Memory management– Allow arbitrary intermediate results sets w/o OOME

• Ops– Cancellation– Timeouts– Monitoring

40

Page 47: Concurrent Stream Processing

Event-driven processing• Dedicated I/O thread pools stream data into plan

41

fn

fnfn

fn

fn

fn

Compute threadsI/O threads

Page 48: Concurrent Stream Processing

Task creation1.Callback fires when data added to input pipe2.Callback takes the fn associated with the node

and bundles it into a task3.Task is scheduled with the compute thread pool

42

fncallback Node

Page 49: Concurrent Stream Processing

Fork/join vs Executors• Fork/join thread pool vs classic Executors

– Optimized for finer-grained tasks– Optimized for larger numbers of tasks– Optimized for more cores– Works well on tasks with dependencies– No contention on a single queue– Work stealing for load balancing

43

Compute threads

Page 50: Concurrent Stream Processing

Task execution

1.Pull next chunk from input pipe2.Execute task function with access to node's state3.Optionally, output one or more chunks to output

pipe - this triggers the upstream callback4.If data still available, schedule a new task,

simulating a new callback on the current node

44

42

fncallback

Page 51: Concurrent Stream Processing

Concurrency

• Delicate balance between Clojure refs and STM and Java concurrency primitives

• Clojure refs - managed by STM– Input pipe– Output pipe– Node state

• Java concurrency– Semaphore - "permits" to limit tasks per node– Per-node scheduling lock

• Key integration constraint– Clojure transactions can fail and retry!

45

Page 52: Concurrent Stream Processing

Concurrency mechanisms

Blue outline = Java lockall = under Java semaphoreGreen outline = Cloj txnBlue shading = Cloj atom

Acquire sempahore Yes Dequeue

inputInput

message Data

Close

set closed = true

empty

closed && !closed_done

Create task

acquire all semaphores

Yesrun-task

w/ nil msg

set closed_done = true

close output-

pipe

release all

semaphores

Yes

invoke task

Result message

release 1 semaphore

No

No

Input closed?

enqueue data on

output pipe

set closed = true

Closes output?

empty

Data

Yes Yes

Close

run-taskclose-output

process-input

Page 53: Concurrent Stream Processing

Memory management• Pipes are all on the heap• How do we avoid OutOfMemory?

47

Page 54: Concurrent Stream Processing

Buffered pipes• When heap space is low, store pipe data on disk• Data is serialized / deserialized to/from disk• Memory-mapped files are used to improve I/O

48

fnfn

fn

fn

0100 ….

Page 55: Concurrent Stream Processing

Memory monitoring• JMX memory beans

– To detect when memory is tight -> writing to disk• Use memory pool threshold notifications

– To detect when memory is ok -> write to memory• Use polling (no notification on decrease)

• Composite pipes– Build a logical pipe out of many segments– As memory conditions go up and down, each segment

is written to the fastest place. We never move data.

49

Page 56: Concurrent Stream Processing

Cancellation• Pool keeps track of what nodes belong to which

plan• All nodes check for cancellation during execution• Cancellation can be caused by:

– Error during execution – User intervention from admin UI– Timeout from query settings

50

Page 57: Concurrent Stream Processing

Summary• Data flow architecture

– Event-driven by arrival of data– Compute threads never block– Fork/join to handle scheduling of work

• Clojure as abstraction tool– Expression tree lets us express plans concisely– Also lets us manipulate them with tools in Clojure– Lines of code

• Fork/join pool, nodes, pipes - 1200• Buffer, serialization, memory monitor - 970• Processor, compiler, eval - 1900

• Open source? Hmmmmmmmmmmm……. 51

Page 58: Concurrent Stream Processing

Thanks...Alex Miller

@puredangerRevelytix, Inc.


Top Related