hypercast - support of many-to-many multicast communicationsmngroup/hypercast/designdoc/hy… ·...

41
1 Jörg Liebeherr, 2001 HyperCast - Support of Many-To-Many Multicast Communications Jörg Liebeherr University of Virginia Jörg Liebeherr, 2001 Acknowledgements Collaborators: Bhupinder Sethi Tyler Beam Burton Filstrup – Mike Nahas – Dongwen Wang Konrad Lorincz Neelima Putreeva This work is supported in part by the National Science Foundation

Upload: lamkien

Post on 03-May-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

1

Jörg Liebeherr, 2001

HyperCast -

Support of Many-To-Many Multicast Communications

Jörg LiebeherrUniversity of Virginia

Jörg Liebeherr, 2001

Acknowledgements

• Collaborators:

– Bhupinder Sethi – Tyler Beam – Burton Filstrup– Mike Nahas

– Dongwen Wang

– Konrad Lorincz

– Neelima Putreeva

• This work is supported in part by the National Science Foundation

2

Jörg Liebeherr, 2001

Many-to-Many Multicast Applications

Group Size

Number of Senders

Streaming

ContentDistribution

10 1,000

1

1,000,000

10

1,000

1,000,000

CollaborationTools

Games

DistributedSystems

Peer-to-PeerApplications

Jörg Liebeherr, 2001

Need for Multicasting ?

• Maintaining unicast connections is not feasible

• Infrastructure or services needs to support a “send to group”

3

Jörg Liebeherr, 2001

Problem with Multicasting

• Feedback Implosion: A node is overwhelmed with traffic or state

– One-to-many multicast with feedback (e.g., reliable multicast)– Many-to-one multicast (Incast)

NAK

NAK

NAK

NAK

NAK

Jörg Liebeherr, 2001

Multicast support in the network infrastructure (IP Multicast)

• Reality Check (after 10 years of IP Multicast):

– Deployment has encountered severe scalability limitations in both the size and number of groups that can be supported

– IP Multicast is still plagued with concerns pertaining to scalability, network management, deployment and support for error, flow and congestion control

4

Jörg Liebeherr, 2001

Host-based Multicasting

• Logical overlay resides on top of the Layer-3 network

• Data is transmitted between neighbors in the overlay

• No IP Multicast support needed

• Overlay topology should match the Layer-3 infrastructure

Jörg Liebeherr, 2001

Host-based multicast approaches (all after 1998)

• Build an overlay mesh network and embed trees into the mesh:– Narada (CMU)– RMX/Gossamer (UCB)

• Build a shared tree:– Yallcast/Yoid (NTT)– Banana Tree Protocol (UMich) – AMRoute (Telcordia, UMD – College Park)– Overcast (MIT)

• Other: Gnutella

5

Jörg Liebeherr, 2001

Our Approach

• Build virtual overlay is built as a graph with known properties – N-dimensional (incomplete) hypercube– Delaunay triangulation

• Advantages:– Routing in the overlay is implicit– Achieve good load-balancing – Exploit symmetry

• Claim: Can improve scalability of multicast applications by orders of magnitude over existing solutions

Jörg Liebeherr, 2001

Multicasting with Overlay Network

Approach:• Organize group members into a virtual overlay network.

• Transmit data using trees embedded in the virtual network.

6

Jörg Liebeherr, 2001

Introducing the Hypercube

• An n-dimensional hypercube has N=2n nodes, where

– Each node is labeled kn kn-1 … k1 (kn =0,1)– Two nodes are connected by an edge only if their labels

differ in one position.

n=1 n=2

0 1

100

00 01

10 11

n=3

000 001

010 011

101

110 111

n=4

Jörg Liebeherr, 2001

We Use a Gray Code for Ordering Nodes

• A Gray code, denoted by G(.), is defined by

– G(i) and G(i+1) differ in exactly one bit.

– G(2d-1) and G(0) differ in only one bit.

0 1 2 3 4 5 6 7“normal” 000 001 010 011 100 101 110 111Gray ordering 000 001 011 010 110 111 101 100

7

Jörg Liebeherr, 2001

Nodes are added in a Gray order

000

110

001

011010

111

101100

Jörg Liebeherr, 2001

Tree Embedding Algorithm

Input: G (i) := I = In …I2 I1, G (r) := R =Rn … R2 R1Output: Parent of node I in the embedded tree rooted at R.

Procedure Parent (I,R)If (G-1(I) < G-1(R))

Parent := In In-1…Ik+1(1 - Ik)Ik-1 …I2 I1with k = mini(Ii!= RI).

ElseParent := In I{n-1}…I{k+1} (1 - Ik) Ik-1…I2 I1

with k = maxi(Ii!= Ri).Endif

8

Jörg Liebeherr, 2001

Tree Embedding

110

010

000 001

011

111

101

000

001010

110 101 011

111

• Node 000 is root:

Jörg Liebeherr, 2001

Another Tree Embedding

110

010

000 001

011

111

101

111

110 101 011

010001

000

• Node 111 is root:

9

Jörg Liebeherr, 2001

Performance Comparison (Part 1)

• Compare tree embeddings of:

– Shared Tree

– Hypercube

Jörg Liebeherr, 2001

Comparison of Hypercubes with Shared Trees(Infocom 98)

• Tl, is a control tree with root l

• wk(Tl) : The number of children at a node k in tree Tl,• vk(Tl) : The number of descendants in the sub-tree below

a node k in tree Tl,• pk(Tl) : The path length from a node k in tree Tl to the

root of the tree

kk

N

kkavg

N

llkk

ww

wN

w

TwN

w

max:

1:

)(1

:

max

1

1

=

=

=

=

=

kk

N

kkavg

N

llkk

vv

vN

v

TvN

v

max:

1:

)(1

:

max

1

1

=

=

=

=

=

kk

N

kkavg

pp

pN

p

max:

1:

max

1

=

= ∑=

10

Jörg Liebeherr, 2001

Average number of descendants in a tree

Jörg Liebeherr, 2001

HyperCast Protocol

• Goal: The goal of the protocol is to organize a members of a multicast group in a logical hypercube.

• Design criteria for scalability:– Soft-State (state is not permanent)– Decentralized (every node is aware only of its neighbors in the cube)– Must handle dynamic group membership efficiently

11

Jörg Liebeherr, 2001

HyperCast Protocol

• The HyperCast protocols maintains a stable hypercube:

• Consistent: No two nodes have the same logical address

• Compact: The dimension of the hypercube is as small as possible

• Connected: Each node knows the physical address of each of its neighbors in the hypercube

Jörg Liebeherr, 2001

Hypercast Protocol

1. A new node joins

2. A node fails

12

Jörg Liebeherr, 2001

110

010

000 001

011

beaconThe node with the highest logical address (HRoot) sends out beacons

The beacon message is received by all nodes

Other nodes use the beacon to determine if they are missing neighbors

The node with the highest logical address (HRoot) sends out beacons

The beacon message is received by all nodes

Other nodes use the beacon to determine if they are missing neighbors

Jörg Liebeherr, 2001

110

010

000 001

011

Each node sends ping messages to its neighbors periodically

Each node sends ping messages to its neighbors periodically

pingpin

g

pingping

pingping

pingpin

g

pingping

13

Jörg Liebeherr, 2001

New

A node that wants to join the hypercube will send a beacon message announcing its presence

A node that wants to join the hypercube will send a beacon message announcing its presence

110

010

000 001

011

Jörg Liebeherr, 2001

110

010

000 001

011

New

The HRoot receives the beacon and responds with a ping, containing the new logical address

The HRoot receives the beacon and responds with a ping, containing the new logical address

ping (as 111)

14

Jörg Liebeherr, 2001

110

010

000 001

011

111

The joining node takes on the logical address given to it and adds “110” as its neighbor

The joining node takes on the logical address given to it and adds “110” as its neighbor

ping (as 111)

Jörg Liebeherr, 2001

110

010

000 001

011

111

The new node responds with a ping.

The new node responds with a ping.

ping (as 111)ping

15

Jörg Liebeherr, 2001

110

010

000 001

011

111

The new node is now the new HRoot, and so it will beacon

The new node is now the new HRoot, and so it will beacon

Jörg Liebeherr, 2001

110

010

000 001

011

111

Upon receiving the beacon, some nodes realize that they have a new neighbor and ping in response

Upon receiving the beacon, some nodes realize that they have a new neighbor and ping in response

ping

16

Jörg Liebeherr, 2001

110

010

000 001

011

111

The new node responds to a ping from each new neighbor

The new node responds to a ping from each new neighbor

pingping

Jörg Liebeherr, 2001

The join operation is now complete, and the hypercube is once again in a stable state

The join operation is now complete, and the hypercube is once again in a stable state

110

010

000 001

011

111

17

Jörg Liebeherr, 2001

Hypercast Protocol

1. A new node joins

2. A node fails

Jörg Liebeherr, 2001

110

010

000 001

011

111

A node may fail unexpectedlyA node may fail unexpectedly

18

Jörg Liebeherr, 2001

110

010

000

011

111

!! Holes in the hypercube fabric are discovered via lack of response to pings

Holes in the hypercube fabric are discovered via lack of response to pings

ping

ping

pingping

pingpin

g

pingping

pingping

pingping

Jörg Liebeherr, 2001

110

010

000

011

111

The HRoot and nodes which are missing neighbors send beacons

The HRoot and nodes which are missing neighbors send beacons

19

Jörg Liebeherr, 2001

110

010

000

011

111

Nodes which are missing neighbors can move the HRootto a new (lower) logical address with a ping message

Nodes which are missing neighbors can move the HRootto a new (lower) logical address with a ping message

ping

(as

001)

Jörg Liebeherr, 2001

110

010

000

011

111

The HRoot sends leave messages as it leaves the 111 logical address

The HRoot sends leave messages as it leaves the 111 logical address

leave

leave

20

Jörg Liebeherr, 2001

110

010

000

011

001

The HRoot takes on the new logical address and pings back

The HRoot takes on the new logical address and pings back

ping

Jörg Liebeherr, 2001

110

010

000

011

001

The “old” 111 is now node 001, and takes its place in the cube

The “old” 111 is now node 001, and takes its place in the cube

21

Jörg Liebeherr, 2001

110

010

000

011

001

Jörg Liebeherr, 2001

110

010

000

011 001

22

Jörg Liebeherr, 2001

110

010

000

011

001

Jörg Liebeherr, 2001

110

010

000 001

011

The new HRoot and the nodes which are missing neighbors will beacon

The new HRoot and the nodes which are missing neighbors will beacon

23

Jörg Liebeherr, 2001

110

010

000 001

011

Upon receiving the beacons, the neighboring nodes ping each other to finish the repair operation

Upon receiving the beacons, the neighboring nodes ping each other to finish the repair operation

pingping

Jörg Liebeherr, 2001

110

010

000 001

011

The repair operation is now complete, and the hypercube is once again stable

The repair operation is now complete, and the hypercube is once again stable

24

Jörg Liebeherr, 2001

State Diagram of a Node

Joining/Joining

Wait

Incomplete

StartHypercube

Stable

HRoot/Stable

HRoot/Incomplete

NodebecomesHRoot

New HRoot

Timeout for findingany neighbor

Has ancestor

Neighborhoodbecomes complete

Neighborhoodbecomes incomplete

NIL

Has no ancestor

NodebecomesHRoot

Timeoutfor finding an HRoot

Neighborhoodbecomes complete

Neighborhoodbecomes incomplete

Any State

OutsideNode wants

to join

Leaving Nodeleaves

New HRoot

Rejoin Depart

Timeout forfinding an HRoot

HRoot/Repair

JoiningWait

Beacon fromJoining node

received

Timeout for beaconsfrom Joining nodes

Repair

Timeout whileattempting to contact

neighbor

Timeout whileattempting to contact

neighbor

Neighborhoodbecomes complete

Neighborhoodbecomes complete

NodebecomesHRoot

New HRoot

Joining

Jörg Liebeherr, 2001

Experiments

• Experimental Platform:Centurion cluster at UVA (cluster of Linux PCs) – 2 to 1024 hypercube nodes (on 32 machines)– current record: 10,080 nodes on 106 machines

• Experiment:– Add N new nodes to a hypercube with M nodes

• Performance measures:• Time until hypercube reaches a stable state• Rate of unicast transmissions• Rate of multicast transmissions

25

Jörg Liebeherr, 2001

log2 (# of Nodes in Hypercube) log2(# of Nodes Joining Hypercube)

Ti

me

(

t he

ar

tb

ea

t)

Experiment 1:Time for Join Operation vs. Size of Hypercube

and Number of Joining Nodes

NM

Jörg Liebeherr, 2001

log2 (# of Nodes in Hypercube) log2(# of Nodes Joining Hypercube)

Un

ic

as

t

By

te

Ra

te

(b

yt

es

/

t he

ar

tb

ea

t)

Experiment 1:Unicast Traffic for Join Operation vs. Size of Hypercube

and Number of Joining Nodes

NM

26

Jörg Liebeherr, 2001

log2 (# of Nodes in Hypercube) log2(# of Nodes Joining Hypercube)

Mu

lt

ic

as

t

By

te

Ra

te

(b

yt

es

/

t he

ar

tb

ea

t)

Experiment 1:Multicast Traffic for Join Operation vs. Size of Hypercube

and Number of Joining Nodes

NM

Jörg Liebeherr, 2001

Work in progress: Triangulations

Backbone

000101

001110

111

010

Backbone011

• Hypercube topology does not consider geographical proximity

• Triangulation can achieve that logical neighbors are “close by”

27

Jörg Liebeherr, 2001

12,0

10,8

5,2

4,9

0,6

The Voronoi region of a node is the region of the plane that is closer to this node than to any other node.

The Voronoi region of a node is the region of the plane that is closer to this node than to any other node.

Voronoi Regions

Jörg Liebeherr, 2001

The Delaunay triangulation has edges between nodes in neighboringVoronoi regions.

The Delaunay triangulation has edges between nodes in neighboringVoronoi regions.

12,0

10,8

5,2

4,9

0,6

Delaunay Triangulation

28

Jörg Liebeherr, 2001

12,0

10,8

5,2

4,9

0,6

beacon

Leader A Leader is a node with a Y-coordinate higher than any of its neighbors. The Leader periodically broadcasts a beacon.

A Leader is a node with a Y-coordinate higher than any of its neighbors. The Leader periodically broadcasts a beacon.

Jörg Liebeherr, 2001

Each node sends ping messages to its neighbors periodically

Each node sends ping messages to its neighbors periodically

12,0

5,2

4,9

0,6

10,8pingping

ping

ping

ping

pi n

g

pingping

ping

ping

ping

pi n

g

pingping

29

Jörg Liebeherr, 2001

A node that wants to join the triangulation will send a beacon message announcing its presence

A node that wants to join the triangulation will send a beacon message announcing its presence

12,0

10,8

5,2

4,9

0,6

8,4

New node

Jörg Liebeherr, 2001

The new node is located in one node’s Voronoi region.This node (5,2) updates its Voronoi region, and the triangulation

The new node is located in one node’s Voronoi region.This node (5,2) updates its Voronoi region, and the triangulation

12,0

4,9

0,6

8,4

10,8

5,2

30

Jörg Liebeherr, 2001

(5,2) sends a ping which contains info for contacting its clockwise and counterclockwise neighbors

(5,2) sends a ping which contains info for contacting its clockwise and counterclockwise neighbors

ping

12,0

4,9

0,6

8,4

10,8

5,2

Jörg Liebeherr, 2001

12,0

4,9

0,6

(8,4) contacts these neighbors ...(8,4) contacts these neighbors ...

8,4

10,8

5,2

ping

ping

12,0

4,9

31

Jörg Liebeherr, 2001

12,0

4,9

… which update their respective Voronoi regions.

… which update their respective Voronoi regions.

0,6

10,8

5,2

4,9

12,0

8,4

Jörg Liebeherr, 2001

12,0

4,9

0,6

(4,9) and (12,0) send pings and provide info for contacting their respective clockwise and counterclockwise neighbors.

(4,9) and (12,0) send pings and provide info for contacting their respective clockwise and counterclockwise neighbors.

10,8

5,2

4,9

12,0

8,4

ping

ping

32

Jörg Liebeherr, 2001

12,0

4,9

0,6

(8,4) contacts the new neighbor (10,8) ...

(8,4) contacts the new neighbor (10,8) ...

10,8

5,2

4,9

12,0

8,4

ping

10,8

Jörg Liebeherr, 2001

12,0

4,9

0,6

…which updates its Voronoi region...…which updates its Voronoi region...

5,2

4,9

12,0

10,8

8,4

33

Jörg Liebeherr, 2001

12,0

4,9

0,6

…and responds with a ping…and responds with a ping

5,2

4,9

12,0

10,8

8,4

ping

Jörg Liebeherr, 2001

12,0

4,9

This completes the update of the Voronoi regions and the Delaunay Triangulation

This completes the update of the Voronoi regions and the Delaunay Triangulation

0,6

5,2

4,9

12,0

10,8

8,4

34

Jörg Liebeherr, 2001

Problem with Delaunay Triangulations

• Delaunay triangulation considers location of nodes, but not the network topology

• 2 heuristics to achieve a better mapping

Jörg Liebeherr, 2001

Hierarchical Delaunay Triangulation

• 2-level hierarchy of Delaunay triangulations

• The node with the lowest x-coordinate in a domain DTis a member in 2 triangulations

35

Jörg Liebeherr, 2001

Multipoint Delaunay Triangulation

• Different (“implicit”) hierarchical organization

• “Virtual nodes” are positioned to form a “bounding box” around a cluster of nodes. All traffic to nodes in a cluster goes through one of the virtual nodes

Jörg Liebeherr, 2001

Work in Progress: Evaluation of Overlays

• Simulation:– Network with 1024 routers (“Transit-Stub” topology)– 2 - 512 hosts

• Performance measures for trees embedded in an overlay network:

– Degree of a node in an embedded tree

– “Relative Delay Penalty”: Ratio of delay in overlay to shortest path delay

– “Stress”: Number of overlay links over a physical link

36

Jörg Liebeherr, 2001

Transit-Stub Network

Transit-Stub• GA Tech graph

generator

• 4 transit domains

• 4×16 stub domains

• 1024 total routers

• 128 hosts on stub domain

Jörg Liebeherr, 2001

Overlay Topologies

Delaunay Triangulation and variants– Hierarchical DT– Multipoint DT

Degree-6 Graph– Similar to graphs generated in Narada

Degree-3 Tree– Similar to graphs generated in Yoid

Logical MST– Minimum Spanning Tree

Hypercube

37

Jörg Liebeherr, 2001

Maximum Relative Delay Penalty

Jörg Liebeherr, 2001

Maximum Node Degree

38

Jörg Liebeherr, 2001

Maximum Average Node Degree (over all trees)

Jörg Liebeherr, 2001

Max Stress (Single node sending)

39

Jörg Liebeherr, 2001

Maximum Average Stress (all nodes sending)

Jörg Liebeherr, 2001

Time To Stabilize

40

Jörg Liebeherr, 2001

90th Percentile Relative Delay Penalty

Jörg Liebeherr, 2001

Work in progress: Provide an API

• Goal: Provide a Socket like interface to applications

OLSocket

OLSocket Interface

Group Manager

ForwardingEngine

ARQ_Engine

Sta

ts

OLCast ApplicationReceiveBuffer

n Unconfirmed DatagramTransmission

n Confirmed Datagram Transmission

n TCP Transmission

OL_Node

OL_NodeInterface

Overlay Protocol

41

Jörg Liebeherr, 2001

Summary

• Overlay network for many-to-many multicast applications using Hypercubes and Delaunay Triangulations

• Performance evaluation of trade-offs via analysis, simulation, and experimentation

• “Socket like” API is close to completion• Proof-of-concept applications:

– MPEG-1 streaming– file transfer– interactive games

• Scalability to groups with > 100,000 members appears feasible