development of a distributed stream processing system

88
Development of a Distributed Stream Processing System Maycon Viana Bordin Final Assignment Instituto de Informática Universidade Federal do Rio Grande do Sul CMP157 – PDP 2013/2, Claudio Geyer

Upload: maycon-viana-bordin

Post on 26-Jun-2015

956 views

Category:

Technology


8 download

DESCRIPTION

Development of a Distributed Stream Processing System (DSPS) in node.js and ZeroMQ and demonstration of an application of trending topics with a dataset from Twitter.

TRANSCRIPT

Page 1: Development of a Distributed Stream Processing System

Development of a Distributed Stream Processing System

Maycon Viana Bordin Final Assignment

Instituto de Informática

Universidade Federal do Rio Grande do Sul

CMP157 – PDP 2013/2, Claudio Geyer

Page 2: Development of a Distributed Stream Processing System

What’s Stream Processing?

Page 3: Development of a Distributed Stream Processing System
Page 4: Development of a Distributed Stream Processing System

Stream Source: emits data continuously and

sequentially

Page 5: Development of a Distributed Stream Processing System

Operators: count, join, filter, map

B

Page 6: Development of a Distributed Stream Processing System

Data streams

B

Page 7: Development of a Distributed Stream Processing System

Sink

Page 8: Development of a Distributed Stream Processing System

Data Stream

Page 9: Development of a Distributed Stream Processing System

Tuple -> (“word”, 55)

B

Page 10: Development of a Distributed Stream Processing System

Tuples are ordered by a

timestamp or other attribute B 1 2 3 4 5 6 7

Page 11: Development of a Distributed Stream Processing System

Data from the stream source may or may not be structured

Page 12: Development of a Distributed Stream Processing System

The amount of data is usually unbounded in size

Page 13: Development of a Distributed Stream Processing System

The input rate is variable and typically unpredictable

Page 14: Development of a Distributed Stream Processing System

Operators

Page 15: Development of a Distributed Stream Processing System

OP

Page 16: Development of a Distributed Stream Processing System

OP

Receives one or

more data

streams

Sends one or

more data

streams

Page 17: Development of a Distributed Stream Processing System

Operators Classification

Page 18: Development of a Distributed Stream Processing System

OPERATORS

Page 19: Development of a Distributed Stream Processing System

OPERATORS

Stateless (map, filter)

Page 20: Development of a Distributed Stream Processing System

OPERATORS

Stateless (map, filter)

Stateful

Page 21: Development of a Distributed Stream Processing System

OPERATORS

Stateless (map, filter)

Stateful

Non-Blocking (count, sum)

Page 22: Development of a Distributed Stream Processing System

OPERATORS

Stateless (map, filter)

Stateful

Blocking (join, freq. itemset)

Non-Blocking (count, sum)

Page 23: Development of a Distributed Stream Processing System

Blocking operators need all input in order to generate a result

Page 24: Development of a Distributed Stream Processing System

but that’s not possible since data streams are unbounded

Page 25: Development of a Distributed Stream Processing System

To solve this issue, tuples are grouped in windows

Page 26: Development of a Distributed Stream Processing System

window start (ws)

window end (we)

Range in time units or number of tuples

Page 27: Development of a Distributed Stream Processing System

old ws old we

new ws new we

advance

Page 28: Development of a Distributed Stream Processing System

Implementation Architecture

Page 29: Development of a Distributed Stream Processing System

master

slave

slave

slave

slave

client

submit/start/stop app

heartbeat

worker thread

Page 30: Development of a Distributed Stream Processing System

The heartbeat carries the status of each worker in the slave

Page 31: Development of a Distributed Stream Processing System

The heartbeat carries the status of each worker in the slave

Tuples processed

Throughput

Latency

Page 32: Development of a Distributed Stream Processing System

Implementation Application

Page 33: Development of a Distributed Stream Processing System

Applications are composed as a DAG (Directed Acyclic Graph)

Page 34: Development of a Distributed Stream Processing System

To illustrate, let’s look at the graph of a Trending Topics application

Page 35: Development of a Distributed Stream Processing System

extract hashtags

countmin sketch

File Sink

stream

Page 36: Development of a Distributed Stream Processing System

extract hashtags

countmin sketch

File Sink

stream

Stream source emits

tweets in JSON

format

Page 37: Development of a Distributed Stream Processing System

extract hashtags

countmin sketch

File Sink

stream

Extract the text from

the tweet and add a

timestamp to each

tuple

Page 38: Development of a Distributed Stream Processing System

extract hashtags

countmin sketch

File Sink

stream

Extract and emit

each #hashtag in

the tweet

Page 39: Development of a Distributed Stream Processing System

extract hashtags

countmin sketch

File Sink

stream

Constant time and space

approximate frequent

itemset [Cormode and Muthukrishnan, 2005]

Page 40: Development of a Distributed Stream Processing System

extract hashtags

countmin sketch

File Sink

stream

Without a window, it

will emit all top-k

items each time a

hashtag is received

Page 41: Development of a Distributed Stream Processing System

extract hashtags

countmin sketch

File Sink

stream

With a window the

number of tuples emitted

is reduced, but the

latency is increasead

Page 42: Development of a Distributed Stream Processing System

The second step in building an application is to set the number of

instances of each operator:

Page 43: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 44: Development of a Distributed Stream Processing System

But the user has to choose the way tuples are going to be partitioned

among the operators

Page 45: Development of a Distributed Stream Processing System

All-to-All Partitioning

Page 46: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 47: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 48: Development of a Distributed Stream Processing System

Round-Robin Partitioning

Page 49: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 50: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 51: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 52: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 53: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 54: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

Page 55: Development of a Distributed Stream Processing System

Field Partitioning

Page 56: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

(“foo”, 1)

Page 57: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

(“foo”, 1)

Page 58: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

(“foo”, 1)

Page 59: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

(“foo”, 1)

Page 60: Development of a Distributed Stream Processing System

The communication between operators is done with the pub/sub

pattern

Page 61: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin U

Page 62: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin U The operator subscribes

to all upstream

operators, with his ID as

a filter

Page 63: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin U The operator will only

receive tuples with his

ID as prefix

Page 64: Development of a Distributed Stream Processing System

The last step is to get each operator instance from the graph and assign it

to a node

Page 65: Development of a Distributed Stream Processing System

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

node-0 node-1 node-2

Page 66: Development of a Distributed Stream Processing System

Currently the scheduler is static and only balances the number of

operators per node

Page 67: Development of a Distributed Stream Processing System

Implementation Framework

Page 68: Development of a Distributed Stream Processing System

trending-topics.js

Page 69: Development of a Distributed Stream Processing System

Tests Specification

Page 70: Development of a Distributed Stream Processing System

Application Trending Topics Dataset of 40GB from Twitter

Page 71: Development of a Distributed Stream Processing System

GridRS - PUCRS

3 nodes

4 x 3.52 GHz (Intel Xeon)

2 GB RAM

Linux 2.6.32-5-amd64

Gigabit Ethernet

Test Environment

Page 72: Development of a Distributed Stream Processing System

Variables

Metrics

Runtime

Latency: time to a tuple traverse the graph

Throughput: no. of tuples processed per sec.

Loss of Tuples

Methodology

5 runs per test.

Every 3s each operator sends its status with

no. of tuples processed.

The PerfMon sink collects a tuple every

100ms, and sends the average latency every

3s (and cleans up the collected tuples).

Number of nodes

Number of operator instances

Window size

Page 73: Development of a Distributed Stream Processing System

Tests Number of Nodes

Page 74: Development of a Distributed Stream Processing System

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

1800.00

2000.00

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

1 2 3

Late

ncy

(m

s)

Runti

me (

min

ute

s)

No. of nodes

Runtime vs Latency

runtime (min)

latency (ms)

Page 75: Development of a Distributed Stream Processing System

0.00

5.00

10.00

15.00

20.00

25.00

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

1 2 3

Str

eam

rate

(M

B/s

)

Runti

me (

min

ute

s)

No. of nodes

Runtime vs Stream Rate

runtime (min)

stream rate (MB/s)

Page 76: Development of a Distributed Stream Processing System

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

1 2 3

Tup

les

per

seco

nd

(tp

s)

No. of nodes

Throughput stream

extractor

countmin

filesink

perfmon

Page 77: Development of a Distributed Stream Processing System

-2000.00

0.00

2000.00

4000.00

6000.00

8000.00

10000.00

1 2 3

Lo

st t

up

les

No. of nodes

Loss of Tuples stream

extractor

countmin

filesink

perfmon

Page 78: Development of a Distributed Stream Processing System

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0

2000

4000

6000

8000

10000

12000

14000

376

11

715

820

727

431

535

641

746

650

754

860

965

769

873

881

185

289

393

610

08

10

48

10

89

11

56

11

97

12

38

13

08

13

48

13

89

14

46

15

01

15

42

15

83

16

17

16

57

16

98

17

28

17

59

18

00

18

38

18

65

18

99

19

39

19

80

20

11

20

38

20

79

Late

ncy

(m

s)

Thro

ug

hp

ut

(tup

les/

ps)

Time (seconds)

Throughput and Latency Over Time (nodes=3, instances=5, window=20)

stream

extractor

countmin

filesink

latency

Page 79: Development of a Distributed Stream Processing System

Tests Window Size

Page 80: Development of a Distributed Stream Processing System

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

25.20

25.40

25.60

25.80

26.00

26.20

26.40

26.60

26.80

27.00

27.20

20 80 120 200

Late

ncy

(m

s)

Runti

me (

min

ute

s)

Window Size

Runtime vs Latency

runtime (min)

latency (ms)

Page 81: Development of a Distributed Stream Processing System

Tests No. of Instances

Page 82: Development of a Distributed Stream Processing System

0

5

10

15

20

25

30

35

0

5

10

15

20

25

30

35

40

1 5

Str

eam

rate

(M

B/s

)

Runti

me (

min

ute

s)

No. of Instances

Runtime vs Stream Rate

runtime (min)

stream rate (MB/s)

Page 83: Development of a Distributed Stream Processing System

Conclusions

Page 84: Development of a Distributed Stream Processing System

The system was able to process more data with the inclusion of more nodes

Page 85: Development of a Distributed Stream Processing System

On the other hand, distributing the load increased the latency

Page 86: Development of a Distributed Stream Processing System

The scheduler has to reduce the network communication

Page 87: Development of a Distributed Stream Processing System

The communication between workers in the same node has to happen

through main memory

Page 88: Development of a Distributed Stream Processing System

References Chakravarthy, Sharma. Stream data processing: a quality of service perspective: modeling, scheduling, load shedding, and complex event processing. Vol. 36. Springer, 2009. Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75. Gulisano, Vincenzo Massimiliano, Ricardo Jiménez Peris, and Patrick Valduriez. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica, 2012.

Source code @ github.com/mayconbordin/tempest