automata-based dynamic data processing for clouds€¦ · automata data processing model if the...

27
Automata-based Dynamic Data Processing for Clouds Reginald Cushing, Marian Bubak, Adam Belloum, Cees de Laat System Network Engineering University of Amsterdam BigDataClouds 25 th August 2014 COMMIT/

Upload: others

Post on 22-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Automata-based Dynamic Data Processing for Clouds

Reginald Cushing, Marian Bubak,Adam Belloum, Cees de Laat

System Network EngineeringUniversity of Amsterdam

BigDataClouds25th August 2014

COMMIT/

Page 2: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Outline● Motivation

– Traditional workflow process model

● Automata data processing model– Transition Function– Transition Operations

● Implementation– Data Routing– Process Congestion Avoidance

● Example Data State Graphs/Workflows● Results

– State Transitions– Process Congestion Avoidance

● Conclusion/Future Work

Page 3: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

● How can we model data transformation during processing?– Workflows, etc model tasks and their respective

execution order.

– Provenance captured data lineage after processing.

● How can we model data processing agnostic from the underlying infrastructure?– Data processing schema

– Workflows often fit the resources.

● How can we have collaboration in data processing between Vms/Clouds/Groups?

Motivation - 1

Page 4: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Motivation – 2

● The standard black box process:– A → [p] → B (A and B are different entities)

● What if:– A → [p] → A' (both input and output are A but

in different states)

– This allows us to reason about data using state graphs.

– Data processing is described as a sequence of state transitions.

Page 5: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Motivation - 3

Page 6: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Motivation - 3

D:S0

D:S1 D:S2 D:S9

D:S3 D:S4 D:S5

D:S6 D:S7

D:S8

Page 7: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Automata Data Processing Model

● If the side effect of processing data object is changing its state then an automata can represent the possible state a data can be in.

● Specifically we use an NFA which is a 5 tuple:

– (Q, ∑, δ, q0, F)

– Q: finite set of states

– ∑: input alphabet

– δ: transition function

– F: set of final states

● Input alphabet are functions.● The transition function is the process (black box) transforms data from

state to state

Page 8: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Transition Function

Instead of an input symbol, the transition function takes anotherfunction σ as input. σ takes data and state as input and produces transformed data and new state. σ accepts multiple states and Outputs multiple states thus it can perform AND/OR operations onStates.

Page 9: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Transition Operations

● State transition logic follows boolean algebra.● Functions can output multiple states.● Basic state combinations are AND, OR.

Page 10: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Distributed Implementation

AMQP● Each VM/Node runs an identical copy of the stack.● Data processing is decentralized.● Communication adapters allow for P2P, MessageServer, File, HTTP, etc packet communication.

Page 11: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Processing Functions

● Each node can host a number of functions● Each function performs a state transitions

and data processing● Functions process data packets● Functions can optionally implement

split/run/merge functions– Split: splits data packets into many packets

– Run: processes the data packets

– Merge: merges the data packets

Page 12: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Data Packet Split/Merge

split()

run()

merge()

Function A

run()

Function A

run()

Function A

VM1

VM2

VM3

Page 13: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Data Routing

● Each node builds a state routing table

● Depending on the routes data can be partitioned amongst replica functions or replicated to different functions.

– Data replicated to S1:func1, S1:func2 but split between endpoint list of each.

State Tag Endpoint List

S3:func3

S1:func1

S1:func2

S4:func3

S5:func3

tcp://192.168.1.1:7711, tcp://192.168.1.2:7711, inproc://someid, amqp://rabbitmq

tcp://192.168.1.3:7711, inproc://someid, amqp://rabbitmq

tcp://192.168.1.5:7711, inproc://someid, amqp://rabbitmq

tcp://192.168.1.5:7711, inproc://someid, amqp://rabbitmq

tcp://192.168.1.5:7711, inproc://someid, amqp://rabbitmq

Page 14: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Processing Congestion Avoidance

● Data partitioning on heterogeneous performing resources is not straight forward. – Fine-grain partitioning increases parellisim, increases

overhead (communication, etc).

– Coarse grain partitioning decreases overhead, decreases parallelism.

– Too much data can crash a VM.

● Dynamically controlling granularity.– Successively increment data payload using PRTT

(Processing Round Trip Time).

– Reduce payload when degradation is detected.

Page 15: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Processing Efficiency

Foo()Bar()

ttimestamp

ttimestamp

Data record1

Data record2

Data recordn

texec

ttimestamp

ttimestamp

ack1

ack2

ackn

Efficiency = texec

/ (tnow

- ttimestamp

)

VM1VM2

Page 16: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Data Processing State Graph

The graph describing the allowable data transitions is also a data protocol header which allows data to be self contained and routable and processable.

Page 17: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

MRI - Workflow

Page 18: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

MRI – State Graph

Page 19: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Combined State Network

Page 20: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

State Transition Pipeline

● Creating a chain of state transitions on 11 Vms

● Each VM hosted 2 transitions 1 Forward, 1 Backward

Page 21: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Filtering Machine

Page 22: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

PCA – 1 Machine Shared Memory

● Functions on same node.● Communication through shared memory.● Stable Efficiency ~0.85● Little change in efficiency after ~150 Data Records.● Minimum comm overhead.

Page 23: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

PCA – 2 VMs P2P

● Functions on different nodes on same virtual network.● Communication P2P through ZMQ.● Stable Efficiency ~0.65

● Little change in efficiency after ~1550 Data Records.

Page 24: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

PCA – 2 LAN VMs AMQP Server

● Functions on different nodes. ● Communication through 3rd party message queue server.● Evidence of jitter.

● Efficiency increase but never reached plateau. ● Efficiency was degrading due to VM running out of memory.

Page 25: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

PCA – Local Cloud – EC2

● Functions on different nodes 1 on local private cloud and the other on EC2. ● Communication through 3rd party message queue server.

● Evidence of heavy jitter.● Strange result that we have two efficiency curves?

Page 26: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Conclusion/Future Work

● Data automata modeling provides– Better understanding of data transformations

– Doubles as routing table for distributed architectures

– Facilitates computing collaboration

● Data model can be considered as a schema. Data records can have an additional dimension, state, which results in a kind of 3D database.

Page 27: Automata-based Dynamic Data Processing for Clouds€¦ · Automata Data Processing Model If the side effect of processing data object is changing its state then an automata can represent

Questions?

● REFs– https://github.com/recap/pumpkin