edgewise: a better stream processing engine for …...stream processing engines (spes): •apache...

- 1 -

EdgeWise: A Better Stream Processing Enginefor the Edge

Xinwei Fu, Talha Ghaffar, James C. Davis, Dongyoon Lee

Department of Computer Science

- 2 -

Edge Stream Processing Central Analytics

Edge Stream Processing

Things Gateways Cloud

sensors

actuators

city hospital

Internet of Things (IoT)• Things, Gateways and Cloud

Edge Stream Processing• Gateways process continuous

streams of data in a timely fashion.

- 3 -


Our Edge Model


sensors

actuators

hospital

Hardware• Limited resources• Well connected

Application• Reasonable complex operations• For example, FarmBeats [NSDI’17]

city

- 4 -


Edge Stream Processing Requirements


sensors

actuators

hospital

• Multiplexed - Limited resources • No Backpressure - latency and storage

• Low Latency - Locality • Scalable - millions of sensors

city

- 5 -

Edge Stream Processing

source sink

operation data flow

Central Analytics

Dataflow Programming Model


sensors

actuators

hospital

Topology - a Directed Acyclic Graph Deployment• Describe # of instances for each

operation

Running on gateways

city

- 6 -

Runtime SystemStream Processing Engines (SPEs):• Apache Storm• Apache Flink• Apache Heron

Topology 0 1

3

2

OWPOA

Worker 0Q0

operation 0

Worker 1Q1

operation 1

Worker 2Q2

Worker 3Q3

operation 4

operation 2

src

sink

sink

One-Worker-per-Operation-Architecture• Queue and Worker thread • Pipelined manner

• Backpressure - latency- storage

- 7 -

ProblemExisting One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!

___Scalable ___Multiplexed ___Latency ___Backpressure

OWPOA SPEs • Cloud-class resources• OS scheduler

Edge• Limited resources• # of workers > # of CPU cores• Inefficiency in OS scheduler

Low input rate à Most queues are empty

High input rate à Most or all queues contain data à Scheduling Inefficiencyà Backpressureà Latency

- 8 -

Problem

Q0

Q1

OP0

OP1

timeOWPOA – Random OS Scheduler

Existing One-Worker-per-Operation-Architecture Stream Processing Engines are not suitable for the Edge Setting!


Single coreQ0 -> Q1

- 9 -

Problem

Q0

Q1

OP0

OP1




Single coreQ0 -> Q1

- 10 -

Problem

Q0

Q1

OP0

OP1




Single coreQ0 -> Q1

- 11 -

Problem

Q0

Q1

OP0

OP1




Single coreQ0 -> Q1

- 12 -

Problem

Q0

Q1

OP0

OP1




Single coreQ0 -> Q1

- 13 -

Problem

Q0

Q1

OP0

OP1




Single coreQ0 -> Q1

- 14 -

Problem

Q0

Q1

OP0

OP1




Single coreQ0 -> Q1

- 15 -

Problem

Q0

Q1

OP0

OP1




Single coreQ0 -> Q1

- 16 -

Problem

Q0

Q1

OP0

OP1




Single coreQ0 -> Q1

- 17 -

Problem

Q0

Q1

OP0

OP1




OS Scheduler doesn’t have engine-level knowledge.

Single coreQ0 -> Q1

- 18 -

EdgeWise

Topology 0 1

3

2

Key Ideas:

• Inefficiency in OS schedulerà Engine-level scheduler

• # of workers > # of CPU coresà A fixed-sized worker pool

OWPOA

Worker 0Q0

operation 0

Worker 1Q1

operation 1

Worker 2Q2

Worker 3Q3

operation 4

operation 2

EdgeWise

operation 1

operation 3

Q0

Q1

Q2

Q3

worker pool(fixed-size workers)

scheduler

operation 2

operation 4

- 19 -

EdgeWise – Fixed-size Worker PoolTopology 0 1

3

2Fixed-size Worker Pool• # of worker = # of cores• Support an arbitrary topology on

limited resources• Reduce overhead of contending

cores

___Scalable ___Multiplexed

___Latency ___Alleviate Backpressure

EdgeWise

operation 1

operation 3 sink

src Q0

Q1

Q2

Q3


scheduler

operation 2operation 4

- 20 -

EdgeWise – Engine-level SchedulerTopology 0 1

3

2A Lost Lesson: Operation Scheduling• Profiling-guided priority-based• Multiple OPs with a single worker

Carney [VLDB’03]• Min-Latency Algorithm• Higher static priority on latter OPs

Babcock [VLDB’04]• Min-Memory Algorithm• Higher static priority on faster filters

We should regain the benefit of the engine-level operation scheduling!!!

EdgeWise

operation 1

operation 3 sink

src Q0

Q1

Q2

Q3


scheduler

operation 2

operation 4

- 21 -

EdgeWise – Engine-level SchedulerCongestion-Aware Scheduler• Profiling-free dynamic solution• Balance queue sizes• Choose the OP with the most pending data.

EdgeWise – Congestion-Aware Scheduler

___Scalable ___Multiplexed ___Latency ___Alleviate Backpressure

Q0

Q1

OP0

OP1

time

Single coreQ0 -> Q1

- 22 -




Q0

Q1

OP0

OP1

time

Single coreQ0 -> Q1

- 23 -




Q0

Q1

OP0

OP1

time

Single coreQ0 -> Q1

- 24 -




Q0

Q1

OP0

OP1

time

Single coreQ0 -> Q1

- 25 -




Q0

Q1

OP0

OP1

time

Single coreQ0 -> Q1

- 26 -




Q0

Q1

OP0

OP1

time

Single coreQ0 -> Q1

- 27 -

Performance Analysis using Queueing Theory

Conclusion 1:Maximum end-to-end throughput depends on scheduling heavier operations proportionally more than lighter operations.

Conclusion 2:Data waits longer in the queues of heavier operations. The growth in wait time is non-linear.

By balancing queue sizes, EdgeWise achieves Higher Throughput and Lower Latencythan One-Worker-per-Operation-Architecture.

See our paper for more details.

Novelty:To the best of our knowledge, we are the first to apply queueing theory to analyze the improved performance in the context of stream processing.

- 28 -

EvaluationImpl: v1.1.0 Hardware: v3

OWPOA Baseline: Schedulers:- Random - Min-Memory - Min-Latency

Experiment Setup:Focus on a single 4 cores -> a pool of 4 worker threads

Benchmarks:RIoTBench - a real-time IoT stream processing benchmark for Storm.

Metrics:Throughput & Latency

Source SenMLParse

LinearReg.

Decision Tree

Average

ErrorEst.

MQTTPublish

O1

O2

O3 O4 O6

O5

PRED

More in the Paper.

- 29 -

Throughput-Latency Performance

600 1200 1800 2400 3000 3600 4200 4800 54000

100

200

300

400

500

WP+MinLatEdgeWise

Storm WP+Random WP+MinMem WP+MinLat EdgeWise

Late

ncy

(ms)

Throughput

PRED topology Throughput-Latency

- 30 -

Fine-Grained Latency Analysis

PRED Latency breakdown in Storm

L M H L M H L M H L M H L M H L M H0

2

4

6

8

10

12

14

16

TQ queuing time TS schedule time TC compute time

Late

ncy

(ms)

L - Throughput of 400 M - Throughput of 1600H - Throughput of 2800

O1 O2 O3 O4 O5 O6L M H L M H L M H L M H L M H L M H

0

2

4

6

8

10

12

14

16 25

O4O2

Late

ncy

(ms)

O1 O3 O6O5

TQ queuing time TS schedule time TC compute time

L - Throughput of 400 M - Throughput of 1600H - Throughput of 2822

PRED Latency breakdown in EdgeWise

Conclusion 2:Data waits longer in the queues of heavier operations.The growth in wait time is non-linear.

This is not a zero-sum game!

- 31 -

Conclusion• Study existing SPEs and discuss their limitations in the Edge

• EdgeWise

• Performance analysis of the congestion-aware scheduler using Queueing Theory

• Up to 3x improvement in throughput while keeping latency low

Sometimes the answers in system design lie not in the future but in the past.

§ Fixed-size worker pool

§ Congestion-aware scheduler

§ Lost lesson of operation scheduling

- 32 -

Backup Slides

- 33 -

ProblemExisting OWPOA SPEs are not suitable for the Edge Setting!

___Scalable ___Multiplexed ___Latency ___No Backpressure

Topology 0 1

3

2

# of instance of each operation can be assigned during the deployment

- 34 -

Performance Analysis using Queueing TheoryNovelty:To the best of our knowledge, we are the first to apply queueing theory to analyze the improved performance in the context of stream processing.

Prior scheduling works in stream processing either provide no analysis or focus only on memory optimization.

- 35 -

Performance Analysis using Queueing TheoryConclusion 1:Maximum end-to-end throughput depends on scheduling heavier operations proportionally more than lighter operations.

Input rate Service rate Utilization

Scheduling weight Effective service rate

Stable constraint

->

->

scheduling weight -> input rate / service rate

- 36 -

Performance Analysis using Queueing TheoryConclusion 2:A data waits longer in the queues of heavier operations, and crucially the growth in wait time is non-linear.

End-to-end latency

Per-operation latency

Queueing time – waiting in the queue(using exponential distribution)

- 37 -

EvaluationBenchmarks:RIoTBench - a real-time IoT stream processing benchmark for StormModification: a timer-based input generator.

Metrics:- Throughput - Latency

- 38 -

Throughput-Latency Performance

600 1200 1800 2400 3000 3600 4200 4800 54000

100

200

300

400

500

WP+MinLatEdgeWise


La

tenc

y (m

s)

Throughput40 60 80 100 120 140 160 180 200 220 240 260 280 300 3200

200

400

600

800

1000

WP+MinMem WP+MinLat EdgeWise

Storm WP+Random


Late

ncy

(ms)

Throughput

600 800 1000 1200 1400 16000

50

100

150

200

250

300

WP+MinMemWP+MinLatEdgeWise


Late

ncy

(ms)

Throughput0 1000 2000 3000 4000 5000 6000 7000 8000

0

50

100

150

200

250

300

Late

ncy

(ms)

Throughput


PRED TRAIN

ETL STATS

- 39 -

Fine-Grained Throughput Analysis

In PRED, as the input rate (throughput) increase, the coefficient of variation (CV) of operation utilization grows in Storm, but it decreases in EdgeWise

600 1200 1800 2400 3000 3600 42000.0

0.2

0.4

0.6

0.8

Coe

ffici

ent o

f Var

iatio

n of

Cap

acity

Throughput

Storm EdgeWise

Conclusion 1:Maximum end-to-end throughput depends on scheduling heavier operations proportionally more than lighter operations.

- 40 -

Data Consumption Policy

Sensitivity study on various consumption policies with STATS topology

1000 2000 3000 4000 5000 6000 7000 80000

50

100

150

200

250

300

Late

ncy

(ms)

Throughput

Storm EdgeWise - all EdgeWise - half EdgeWise - 100 EdgeWise - 50

- 41 -

Performance on Distributed Edge

PRED: maximum throughput achieved with the latency less than 100 ms

1 2 4 80

2500

5000

7500

10000

12500

Throughput

Nodes

Storm EdgeWise

- 42 -

LimitationsI/O bound computation:• The preferred idiom is outer I/O, inner compute• Worker pool size could be tuned • I/O could be done elsewhere, like Microsoft Bosque

- 43 -

Icon CreditIcon credit for www.flaticon.com

edgewise: a better stream processing engine for …...stream processing engines (spes): •apache...

Documents