development of a distributed stream processing system
DESCRIPTION
Development of a Distributed Stream Processing System (DSPS) in node.js and ZeroMQ and demonstration of an application of trending topics with a dataset from Twitter.TRANSCRIPT
Development of a Distributed Stream Processing System
Maycon Viana Bordin Final Assignment
Instituto de Informática
Universidade Federal do Rio Grande do Sul
CMP157 – PDP 2013/2, Claudio Geyer
What’s Stream Processing?
Stream Source: emits data continuously and
sequentially
Operators: count, join, filter, map
B
Data streams
B
Sink
Data Stream
Tuple -> (“word”, 55)
B
Tuples are ordered by a
timestamp or other attribute B 1 2 3 4 5 6 7
Data from the stream source may or may not be structured
The amount of data is usually unbounded in size
The input rate is variable and typically unpredictable
Operators
OP
OP
Receives one or
more data
streams
Sends one or
more data
streams
Operators Classification
OPERATORS
OPERATORS
Stateless (map, filter)
OPERATORS
Stateless (map, filter)
Stateful
OPERATORS
Stateless (map, filter)
Stateful
Non-Blocking (count, sum)
OPERATORS
Stateless (map, filter)
Stateful
Blocking (join, freq. itemset)
Non-Blocking (count, sum)
Blocking operators need all input in order to generate a result
but that’s not possible since data streams are unbounded
To solve this issue, tuples are grouped in windows
window start (ws)
window end (we)
Range in time units or number of tuples
old ws old we
new ws new we
advance
Implementation Architecture
master
slave
slave
slave
slave
client
submit/start/stop app
heartbeat
worker thread
The heartbeat carries the status of each worker in the slave
The heartbeat carries the status of each worker in the slave
Tuples processed
Throughput
Latency
Implementation Application
Applications are composed as a DAG (Directed Acyclic Graph)
To illustrate, let’s look at the graph of a Trending Topics application
extract hashtags
countmin sketch
File Sink
stream
extract hashtags
countmin sketch
File Sink
stream
Stream source emits
tweets in JSON
format
extract hashtags
countmin sketch
File Sink
stream
Extract the text from
the tweet and add a
timestamp to each
tuple
extract hashtags
countmin sketch
File Sink
stream
Extract and emit
each #hashtag in
the tweet
extract hashtags
countmin sketch
File Sink
stream
Constant time and space
approximate frequent
itemset [Cormode and Muthukrishnan, 2005]
extract hashtags
countmin sketch
File Sink
stream
Without a window, it
will emit all top-k
items each time a
hashtag is received
extract hashtags
countmin sketch
File Sink
stream
With a window the
number of tuples emitted
is reduced, but the
latency is increasead
The second step in building an application is to set the number of
instances of each operator:
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
But the user has to choose the way tuples are going to be partitioned
among the operators
All-to-All Partitioning
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
Round-Robin Partitioning
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
Field Partitioning
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
(“foo”, 1)
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
(“foo”, 1)
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
(“foo”, 1)
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
(“foo”, 1)
The communication between operators is done with the pub/sub
pattern
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin U
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin U The operator subscribes
to all upstream
operators, with his ID as
a filter
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin U The operator will only
receive tuples with his
ID as prefix
The last step is to get each operator instance from the graph and assign it
to a node
extract
File Sink
stream
extract extract extract extract
countmin countmin countmin countmin countmin
node-0 node-1 node-2
Currently the scheduler is static and only balances the number of
operators per node
Implementation Framework
trending-topics.js
Tests Specification
Application Trending Topics Dataset of 40GB from Twitter
GridRS - PUCRS
3 nodes
4 x 3.52 GHz (Intel Xeon)
2 GB RAM
Linux 2.6.32-5-amd64
Gigabit Ethernet
Test Environment
Variables
Metrics
Runtime
Latency: time to a tuple traverse the graph
Throughput: no. of tuples processed per sec.
Loss of Tuples
Methodology
5 runs per test.
Every 3s each operator sends its status with
no. of tuples processed.
The PerfMon sink collects a tuple every
100ms, and sends the average latency every
3s (and cleans up the collected tuples).
Number of nodes
Number of operator instances
Window size
Tests Number of Nodes
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
1400.00
1600.00
1800.00
2000.00
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
1 2 3
Late
ncy
(m
s)
Runti
me (
min
ute
s)
No. of nodes
Runtime vs Latency
runtime (min)
latency (ms)
0.00
5.00
10.00
15.00
20.00
25.00
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
1 2 3
Str
eam
rate
(M
B/s
)
Runti
me (
min
ute
s)
No. of nodes
Runtime vs Stream Rate
runtime (min)
stream rate (MB/s)
0.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
7000.00
1 2 3
Tup
les
per
seco
nd
(tp
s)
No. of nodes
Throughput stream
extractor
countmin
filesink
perfmon
-2000.00
0.00
2000.00
4000.00
6000.00
8000.00
10000.00
1 2 3
Lo
st t
up
les
No. of nodes
Loss of Tuples stream
extractor
countmin
filesink
perfmon
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
0
2000
4000
6000
8000
10000
12000
14000
376
11
715
820
727
431
535
641
746
650
754
860
965
769
873
881
185
289
393
610
08
10
48
10
89
11
56
11
97
12
38
13
08
13
48
13
89
14
46
15
01
15
42
15
83
16
17
16
57
16
98
17
28
17
59
18
00
18
38
18
65
18
99
19
39
19
80
20
11
20
38
20
79
Late
ncy
(m
s)
Thro
ug
hp
ut
(tup
les/
ps)
Time (seconds)
Throughput and Latency Over Time (nodes=3, instances=5, window=20)
stream
extractor
countmin
filesink
latency
Tests Window Size
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
25.20
25.40
25.60
25.80
26.00
26.20
26.40
26.60
26.80
27.00
27.20
20 80 120 200
Late
ncy
(m
s)
Runti
me (
min
ute
s)
Window Size
Runtime vs Latency
runtime (min)
latency (ms)
Tests No. of Instances
0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
40
1 5
Str
eam
rate
(M
B/s
)
Runti
me (
min
ute
s)
No. of Instances
Runtime vs Stream Rate
runtime (min)
stream rate (MB/s)
Conclusions
The system was able to process more data with the inclusion of more nodes
On the other hand, distributing the load increased the latency
The scheduler has to reduce the network communication
The communication between workers in the same node has to happen
through main memory
References Chakravarthy, Sharma. Stream data processing: a quality of service perspective: modeling, scheduling, load shedding, and complex event processing. Vol. 36. Springer, 2009. Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75. Gulisano, Vincenzo Massimiliano, Ricardo Jiménez Peris, and Patrick Valduriez. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica, 2012.
Source code @ github.com/mayconbordin/tempest