edgewise: a better stream processing engine for …...stream processing engines (spes): •apache...
Embed Size (px)
TRANSCRIPT

- 1 -
EdgeWise: A Better Stream Processing Enginefor the Edge
Xinwei Fu, Talha Ghaffar, James C. Davis, Dongyoon Lee
Department of Computer Science

- 2 -
Edge Stream Processing Central Analytics
Edge Stream Processing
Things Gateways Cloud
sensors
actuators
city hospital
Internet of Things (IoT)• Things, Gateways and Cloud
Edge Stream Processing• Gateways process continuous
streams of data in a timely fashion.

- 3 -
Edge Stream Processing Central Analytics
Our Edge Model
Things Gateways Cloud
sensors
actuators
hospital
Hardware• Limited resources• Well connected
Application• Reasonable complex operations• For example, FarmBeats [NSDI’17]
city

- 4 -
Edge Stream Processing Central Analytics
Edge Stream Processing Requirements
Things Gateways Cloud
sensors
actuators
hospital
• Multiplexed - Limited resources • No Backpressure - latency and storage
• Low Latency - Locality • Scalable - millions of sensors
city

- 5 -
Edge Stream Processing
source sink
operation data flow
Central Analytics
Dataflow Programming Model
Things Gateways Cloud
sensors
actuators
hospital
Topology - a Directed Acyclic Graph Deployment• Describe # of instances for each
operation
Running on gateways
city

- 6 -
Runtime SystemStream Processing Engines (SPEs):• Apache Storm• Apache Flink• Apache Heron
Topology 0 1
3
2
OWPOA
Worker 0Q0
operation 0
Worker 1Q1
operation 1
Worker 2Q2
Worker 3Q3
operation 4
operation 2
src
sink
sink
One-Worker-per-Operation-Architecture• Queue and Worker thread • Pipelined manner
• Backpressure - latency- storage

- 7 -
ProblemExisting One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
OWPOA SPEs • Cloud-class resources• OS scheduler
Edge• Limited resources• # of workers > # of CPU cores• Inefficiency in OS scheduler
Low input rate à Most queues are empty
High input rate à Most or all queues contain data à Scheduling Inefficiencyà Backpressureà Latency

- 8 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 9 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 10 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 11 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 12 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 13 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 14 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 15 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 16 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
Single coreQ0 -> Q1

- 17 -
Problem
Q0
Q1
OP0
OP1
timeOWPOA – Random OS Scheduler
Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___Backpressure
OS Scheduler doesn’t have engine-level knowledge.
Single coreQ0 -> Q1

- 18 -
EdgeWise
Topology 0 1
3
2
Key Ideas:
• Inefficiency in OS schedulerà Engine-level scheduler
• # of workers > # of CPU coresà A fixed-sized worker pool
OWPOA
Worker 0Q0
operation 0
Worker 1Q1
operation 1
Worker 2Q2
Worker 3Q3
operation 4
operation 2
EdgeWise
operation 1
operation 3
Q0
Q1
Q2
Q3
worker pool(fixed-size workers)
scheduler
operation 2
operation 4

- 19 -
EdgeWise – Fixed-size Worker PoolTopology 0 1
3
2Fixed-size Worker Pool• # of worker = # of cores• Support an arbitrary topology on
limited resources• Reduce overhead of contending
cores
___Scalable ___Multiplexed
___Latency ___Alleviate Backpressure
EdgeWise
operation 1
operation 3 sink
src Q0
Q1
Q2
Q3
worker pool(fixed-size workers)
scheduler
operation 2operation 4

- 20 -
EdgeWise – Engine-level SchedulerTopology 0 1
3
2A Lost Lesson: Operation Scheduling• Profiling-guided priority-based• Multiple OPs with a single worker
Carney [VLDB’03]• Min-Latency Algorithm• Higher static priority on latter OPs
Babcock [VLDB’04]• Min-Memory Algorithm• Higher static priority on faster filters
We should regain the benefit of the engine-level operation scheduling!!!
EdgeWise
operation 1
operation 3 sink
src Q0
Q1
Q2
Q3
worker pool(fixed-size workers)
scheduler
operation 2
operation 4

- 21 -
EdgeWise – Engine-level SchedulerCongestion-Aware Scheduler• Profiling-free dynamic solution• Balance queue sizes• Choose the OP with the most pending data.
EdgeWise – Congestion-Aware Scheduler
___Scalable ___Multiplexed ___Latency ___Alleviate Backpressure
Q0
Q1
OP0
OP1
time
Single coreQ0 -> Q1

- 22 -
EdgeWise – Engine-level SchedulerCongestion-Aware Scheduler• Profiling-free dynamic solution• Balance queue sizes• Choose the OP with the most pending data.
EdgeWise – Congestion-Aware Scheduler
___Scalable ___Multiplexed ___Latency ___Alleviate Backpressure
Q0
Q1
OP0
OP1
time
Single coreQ0 -> Q1

- 23 -
EdgeWise – Engine-level SchedulerCongestion-Aware Scheduler• Profiling-free dynamic solution• Balance queue sizes• Choose the OP with the most pending data.
EdgeWise – Congestion-Aware Scheduler
___Scalable ___Multiplexed ___Latency ___Alleviate Backpressure
Q0
Q1
OP0
OP1
time
Single coreQ0 -> Q1

- 24 -
EdgeWise – Engine-level SchedulerCongestion-Aware Scheduler• Profiling-free dynamic solution• Balance queue sizes• Choose the OP with the most pending data.
EdgeWise – Congestion-Aware Scheduler
___Scalable ___Multiplexed ___Latency ___Alleviate Backpressure
Q0
Q1
OP0
OP1
time
Single coreQ0 -> Q1

- 25 -
EdgeWise – Engine-level SchedulerCongestion-Aware Scheduler• Profiling-free dynamic solution• Balance queue sizes• Choose the OP with the most pending data.
EdgeWise – Congestion-Aware Scheduler
___Scalable ___Multiplexed ___Latency ___Alleviate Backpressure
Q0
Q1
OP0
OP1
time
Single coreQ0 -> Q1

- 26 -
EdgeWise – Engine-level SchedulerCongestion-Aware Scheduler• Profiling-free dynamic solution• Balance queue sizes• Choose the OP with the most pending data.
EdgeWise – Congestion-Aware Scheduler
___Scalable ___Multiplexed ___Latency ___Alleviate Backpressure
Q0
Q1
OP0
OP1
time
Single coreQ0 -> Q1

- 27 -
Performance Analysis using Queueing Theory
Conclusion 1:Maximum end-to-end throughput depends on scheduling heavier operations proportionally more than lighter operations.
Conclusion 2:Data waits longer in the queues of heavier operations. The growth in wait time is non-linear.
By balancing queue sizes, EdgeWise achieves Higher Throughput and Lower Latencythan One-Worker-per-Operation-Architecture.
See our paper for more details.
Novelty:To the best of our knowledge, we are the first to apply queueing theory to analyze the improved performance in the context of stream processing.

- 28 -
EvaluationImpl: v1.1.0 Hardware: v3
OWPOA Baseline: Schedulers:- Random - Min-Memory - Min-Latency
Experiment Setup:Focus on a single 4 cores -> a pool of 4 worker threads
Benchmarks:RIoTBench - a real-time IoT stream processing benchmark for Storm.
Metrics:Throughput & Latency
Source SenMLParse
LinearReg.
Decision Tree
Average
ErrorEst.
MQTTPublish
O1
O2
O3 O4 O6
O5
PRED
More in the Paper.

- 29 -
Throughput-Latency Performance
600 1200 1800 2400 3000 3600 4200 4800 54000
100
200
300
400
500
WP+MinLatEdgeWise
Storm WP+Random WP+MinMem WP+MinLat EdgeWise
Late
ncy
(ms)
Throughput
PRED topology Throughput-Latency

- 30 -
Fine-Grained Latency Analysis
PRED Latency breakdown in Storm
L M H L M H L M H L M H L M H L M H0
2
4
6
8
10
12
14
16
TQ queuing time TS schedule time TC compute time
Late
ncy
(ms)
L - Throughput of 400 M - Throughput of 1600H - Throughput of 2800
O1 O2 O3 O4 O5 O6L M H L M H L M H L M H L M H L M H
0
2
4
6
8
10
12
14
16 25
O4O2
Late
ncy
(ms)
O1 O3 O6O5
TQ queuing time TS schedule time TC compute time
L - Throughput of 400 M - Throughput of 1600H - Throughput of 2822
PRED Latency breakdown in EdgeWise
Conclusion 2:Data waits longer in the queues of heavier operations.The growth in wait time is non-linear.
This is not a zero-sum game!

- 31 -
Conclusion• Study existing SPEs and discuss their limitations in the Edge
• EdgeWise
• Performance analysis of the congestion-aware scheduler using Queueing Theory
• Up to 3x improvement in throughput while keeping latency low
Sometimes the answers in system design lie not in the future but in the past.
§ Fixed-size worker pool
§ Congestion-aware scheduler
§ Lost lesson of operation scheduling

- 32 -
Backup Slides

- 33 -
ProblemExisting OWPOA SPEs are not suitable for the Edge Setting!
___Scalable ___Multiplexed ___Latency ___No Backpressure
Topology 0 1
3
2
# of instance of each operation can be assigned during the deployment

- 34 -
Performance Analysis using Queueing TheoryNovelty:To the best of our knowledge, we are the first to apply queueing theory to analyze the improved performance in the context of stream processing.
Prior scheduling works in stream processing either provide no analysis or focus only on memory optimization.

- 35 -
Performance Analysis using Queueing TheoryConclusion 1:Maximum end-to-end throughput depends on scheduling heavier operations proportionally more than lighter operations.
Input rate Service rate Utilization
Scheduling weight Effective service rate
Stable constraint
->
->
scheduling weight -> input rate / service rate

- 36 -
Performance Analysis using Queueing TheoryConclusion 2:A data waits longer in the queues of heavier operations, and crucially the growth in wait time is non-linear.
End-to-end latency
Per-operation latency
Queueing time – waiting in the queue(using exponential distribution)

- 37 -
EvaluationBenchmarks:RIoTBench - a real-time IoT stream processing benchmark for StormModification: a timer-based input generator.
Metrics:- Throughput - Latency

- 38 -
Throughput-Latency Performance
600 1200 1800 2400 3000 3600 4200 4800 54000
100
200
300
400
500
WP+MinLatEdgeWise
Storm WP+Random WP+MinMem WP+MinLat EdgeWise
La
tenc
y (m
s)
Throughput40 60 80 100 120 140 160 180 200 220 240 260 280 300 3200
200
400
600
800
1000
WP+MinMem WP+MinLat EdgeWise
Storm WP+Random
Storm WP+Random WP+MinMem WP+MinLat EdgeWise
Late
ncy
(ms)
Throughput
600 800 1000 1200 1400 16000
50
100
150
200
250
300
WP+MinMemWP+MinLatEdgeWise
Storm WP+Random WP+MinMem WP+MinLat EdgeWise
Late
ncy
(ms)
Throughput0 1000 2000 3000 4000 5000 6000 7000 8000
0
50
100
150
200
250
300
Late
ncy
(ms)
Throughput
Storm WP+Random WP+MinMem WP+MinLat EdgeWise
PRED TRAIN
ETL STATS

- 39 -
Fine-Grained Throughput Analysis
In PRED, as the input rate (throughput) increase, the coefficient of variation (CV) of operation utilization grows in Storm, but it decreases in EdgeWise
600 1200 1800 2400 3000 3600 42000.0
0.2
0.4
0.6
0.8
Coe
ffici
ent o
f Var
iatio
n of
Cap
acity
Throughput
Storm EdgeWise
Conclusion 1:Maximum end-to-end throughput depends on scheduling heavier operations proportionally more than lighter operations.

- 40 -
Data Consumption Policy
Sensitivity study on various consumption policies with STATS topology
1000 2000 3000 4000 5000 6000 7000 80000
50
100
150
200
250
300
Late
ncy
(ms)
Throughput
Storm EdgeWise - all EdgeWise - half EdgeWise - 100 EdgeWise - 50

- 41 -
Performance on Distributed Edge
PRED: maximum throughput achieved with the latency less than 100 ms
1 2 4 80
2500
5000
7500
10000
12500
Throughput
Nodes
Storm EdgeWise

- 42 -
LimitationsI/O bound computation:• The preferred idiom is outer I/O, inner compute• Worker pool size could be tuned • I/O could be done elsewhere, like Microsoft Bosque

- 43 -
Icon CreditIcon credit for www.flaticon.com