borealis is a distributed stream processing system (dsps) based on aurora and medusa

Load Management and High Availability in BorealisMagdalena Balazinska, Jeong-Hyon Hwang, and the Borealis team

MIT, Brown University, and Brandeis University

Borealis is a distributed stream processing system (DSPS) based on Aurora and Medusa

Contract-Based Load Management HA Semantics and Algorithms

Network Partitions

Approach: 1 - Offline, participants negotiate and establish bilateral contracts that:• Fix or tightly bound price per unit-load• Are private and customizable (e.g., performance, availability guarantees, SLA)

Properties:• Simple, efficient, and low overhead (provable small bounds)• Provable incentives to participate in mechanism• Experimental result: A small number of contracts and small price-ranges suffice to achieve acceptable allocation

A C

Approach: Favor availability. Use updates to achieve consistency• Use connection points to create replicas and stream versions• Downstream nodes

• Monitor upstream nodes• Reconnect to available upstream replica• Continue processing with minimal disruptions

Goal: Handle network partitions in a distributed stream processing system

p p

[p,p+e]0.8p

B’

B

Contractat p

Convex cost function

Offered load(msgs/sec)

Total cost(delay, $)

Task t moves from A to B if:• unit MC task t > p, at A• unit MC task t < p, at B

BA C

ACKTrim

Upstream backup lowest runtime overhead

BA C

B’Replay

Active Standby shortest recovery time

BA C

B’

ACK

Trim

Passive Standby most suitable for precise recovery

Goal: Streaming applications can tolerate different types of failure recovery:• Gap recovery: may lose tuples• Rollback recovery: produces duplicates but does not lose tuples• Precise recovery: takes over precisely from the point of failure Repeatable

Convergent

Deterministic

Filter, Map, Join

BSort, Resample, Aggregate

Union, operators with timeouts

BA C

B’

ACK

Checkpoint

D

A

CB

Goals: • Manage load through collaborations between autonomous participants • Ensure acceptable allocation where each node’s load is below threshold

Participant

Contract specifying that A will pay C, $p per unit of load

Challenges: Operator and processing non-determinism

2 - At runtime,Load moves only between participants that have a contractMovements are based on marginal costs:• Each participant has a private convex cost function• Load moves when it’s cheaper to pay partner than to process locally

Challenges: Incentives, efficiency, and customizationArbitrary

load(t)

MC(t) at A

Challenges: • Maximize availability• Minimize reprocessing• Maintain consistency

MC(t) at B

Load Management Demonstration Setup

A

CB

D

2) As node A becomes overloaded it sheds load to its partners B and C until system reaches acceptable allocation

A

CB

0.8p

3) Load increases at node B causing system overload

4) Node D joins the system. Load flows from node B to C and C to D until the system reaches acceptable allocation

All nodes process a network monitoring query over real traces of connection summaries

Group by IPcount60s

Group by IPcount distinct port60s Filter

> 10

Filter> 100

Group by IP prefix, sum60s Filter

> 100

Connectioninformation

Clusters of IPs that establish many connections

T

F

A

CB

p p

p

1) Three nodes with identical contracts and uneven initial load distribution Acceptable

allocation

Node A overloaded

A sheds load to B then to C

Acceptable allocation

System overload

Node D joins

Load flows from C to D and from B to C

A

B

C

C

B

D

IPs that establish many connections

IPs that connect over many ports

Query: Count the connections established by each IP over 60 sec and the number of distinct ports to which each IP connected

High Availability Demonstration Setup

Passive Standby 1) The four primaries, B0, C0, D0, and E0 run on one laptop

Identical queries traverse nodes that use different high availability approaches

3) We compare the runtime overhead of the approaches

A

B0 B1

C0 C1

D0 D1

E0 E1

Active Standby

Upstream Backup

Upstream Backup &Duplicate Elimination

B0’

C0’

D0’

E0’

2) All other nodes run on the other laptop

4) We kill all primaries at the same time

5) We compare the recovery time and the effects on tuple delay and duplication

Statically assigned secondary

Tuples received

E2E delay

Failure

Duplicate tuples

Failure

Active standby has highest runtime

overhead

Upstream backup has highest overhead during recovery

Passive standby adds most end-to-end delay

Passive Standby Active Standby

UB no dupsUpstream Backup

Network Partition Demonstration Setup

2) We unplug the cable connecting the laptops

3) Node C detects that node B has become unreachable

1) The initial query distribution crosses computer boundaries

A C

Laptop 2

Laptop 1

R

B

4) Node C identifies node R as reachable alternate replica:Output stream has the same name but a different version

5) Node C connects to node R and continues processing from the same point on the stream

6) Node C changes the version of its output stream

7) When partition heals, node C remains connected to R and continues processing uninterrupted

End-to-end tuple delay increases while C detects the network partition and re-connects to R

End-to-end tuple delay

Sequence nb of received tuples

Tuples received through B

Tuples received through R

No duplications and no losses after network partitions

borealis is a distributed stream processing system (dsps) based on aurora and medusa

Documents

nodes load

overloadeda sheds load

8p3 load increases

node b

upstream nodes

system overload4 node

gap recovery

acceptable allocationall