scalable distributed stream system mitch cherniack, hari balakrishnan, magdalena balazinska, don...

Scalable Distributed Stream SystemMitch Cherniack, Hari Balakrishnan,Magdalena Balazinska, Don Carney, Uğur Çetintemel, Ying Xing, andStan Zdonik

Proceedings of the 2003 Conference on Innovative Data Systems Research

Outline

Memo on stream-based applications Description of Aurora Architectures of Aurora* and Medusa Scalable Communications Infrastructure Load Management High Availability Conclusion

Introduction

For these stream-based applications, data are pushed to a system that must evaluate queries in response to detected events.

Many stream-based applications are naturally distributed computing devices with heterogeneous capabilities.

Aurora: A Centralized Stream Processor

System Model


Query Model Queries are built from a standard set of

well-defined operators (boxes). Each operator accept input streams (

in arrows), and produce more output streams (out arrows)


Query Model A subset of defined operators in Aurora

FilterUnionWSortTumbleMapXSection, SlideJoinResample

A Sample Operation

Tumble takes an input aggregate function and a set of input groupby attributes. aggregate function: Average of B groupby: A

Emitted tuples (A = 1, Result = 2.5) (A = 2, Result = 3.0)


Run-time Operation The heart of the system

is the Scheduler that determines which box to run.

Storage Manager is used when buffer runs out.

QoS Monitor drives the Scheduler in its decision making and also informs the Load Shedder when to discard tuples.


When load shedding is not working, Aurora will try to reoptimize the network using standard query optimization techniques. Ex. Operator commutativity.

Distributed System Architecture

Collaboration between distinct administrative domains is important. Everybody contributes a modest amount of

computing, communication and storage resources and helps improve fault-tolerance against attacks.

Many services inherently process data from different autonomous domains and compose them.

Distributed System Architecture

Intra-participant distribution Small scale One administrative domain Aurora*

Inter-participant distribution Large scale More than one administrative domain Medusa

Aurora*: Intra-participant Distribution

Aurora* consists of multiple single-node Aurora servers that belong to the same administrative domain.

Boxes can be placed on and executed at arbitrary nodes as deemed appropriate.

Aurora*: Intra-participant Distribution When the network is first deployed,

Aurora* will create a crude partitioning of boxes across a network of available nodes.

Each Aurora node will monitor its local operation, its workload and available resources. If a machine finds itself short of resources, it will consider offloading boxes to another appropriate Aurora node.

Medusa: Inter-participant Federated Operation Medusa is a distributed infrastructure

that provides service delivery among autonomous participants.

Participants range in scale from collections of stream processing nodes capable of running Aurora and providing part of the global service, to PCs or PDAs that allow user access to the system, to networks of sensors and their proxies that provide input streams.

Medusa: Inter-participant Federated Operation Medusa is an agoric system, using economic pr

inciples to regulate participant collaborations and solve the hard problems concerning load management and sharing.

Medusa uses a market mechanism with an underlying currency (“dollars”) that backs these contracts.

The receiving participant pays the sender for a stream. The receiver performs query-processing services on the message stream that increases its value. The receiver can sell the resulting stream for a higher price than it paid and make money.

Medusa: Inter-participant Federated Operation

Some Medusa participants are purely stream sources, and are paid for their data, while other participants are strictly stream sinks, and must pay for these streams. Others act both as sources and sinks. They are assumed to operate as profit-making entities.

Scalable Communications Infrastructure The infrastructure must

Include a naming scheme for participants and query operators, and a method for discovering where any portion of a query plan is currently running and what operators are currently in place

route messages between participants and nodes multiplex messages on to transport-layer streams

between participants and nodes enable stream processing to be distributed and

moved across nodes

Naming and Discovery

Single global namespace Each participant has a unique global name New operator, schema, stream: within

namespace (participant, entity-name) The Catalog (either centralized or

distributed) holds information of each entity.

Within a participant, the catalog contains definitions of operators, schemas, streams, queries, and contracts.

Routing

When a data source produces events, it labels them with a stream name and sends them to one of the nodes in the overlay network.

Upon receiving these events, the node consults the intra-participant catalog and forwards events to the appropriate locations.

Message Transport

All the message streams are multiplexed on to a single TCP connection and have a message scheduler that determines which message stream gets to use the connection at any time.

Remote Definition

A participant instantiates and composes operators from a pre-defined set offered by another participant to mimic box sliding.

Load management

On every node that runs a piece of Aurora network, a query optimizer/load share daemon will run periodically in the background.

Load Management

Box sliding Upstream:

low selectivity Downstream:

selectivity>1

Load Management

Box splitting

Load Management

Box Splitting Tumble(cnt, Groupby A)

(A = 1, result = 2) (A = 2, result = 3)

B<3 #1

(A = 1, result = 2) (A = 2, result = 2)

#2 (A = 2, result = 1)

Box Splitting

Load Management

Key repartitioning challenges Initiation of load sharing Choosing what to offload Choosing filter predicates for box splitting Choosing what to split Handling connection points

Ensuring Availability

Availability may suffer from Server and communication failures Sustained congestion levels Software failures

Each server can effectively act as a back-up for its downstream servers.

A tradeoff between the recovery time and the volume of checkpoint messages required to provide safety are enabled.


k-safe: if the failure of any k servers does not result in any message losses.

An upstream backup server simply holds on to a tuple it has processed until its primary server tells it to discard the tuple.


Flow message: Each data source creates and sends flow messages into the system. A box processes a flow message by first recording the sequence number of the earliest tuple that it currently depends on, and the passing it forward.

When the flow message reaches a server boundary, these sequence values are recorded as checkpoints and are sent through a back channel to the upstream servers.


Install an array of sequence numbers on each server, one for each upstream server. On each box’s activation , the box records in this array the earliest tuples on which it depends.

The upstream servers can then query this array periodically and truncate their queues accordingly.

Conclusion

Aurora* is a distributed version of Aurora, which assumes a common administrative domain.

Medusa is an infrastructure supporting federated operation of Aurora nodes across administrative boundaries.

Scalable communication infrastructure, adaptive load management and high availability are discussed in both domain.

scalable distributed stream system mitch cherniack, hari balakrishnan, magdalena balazinska, don...

Documents