hypercast - support of many-to-many multicast communicationsmngroup/hypercast/designdoc/hy… ·...
TRANSCRIPT
1
Jörg Liebeherr, 2001
HyperCast -
Support of Many-To-Many Multicast Communications
Jörg LiebeherrUniversity of Virginia
Jörg Liebeherr, 2001
Acknowledgements
• Collaborators:
– Bhupinder Sethi – Tyler Beam – Burton Filstrup– Mike Nahas
– Dongwen Wang
– Konrad Lorincz
– Neelima Putreeva
• This work is supported in part by the National Science Foundation
2
Jörg Liebeherr, 2001
Many-to-Many Multicast Applications
Group Size
Number of Senders
Streaming
ContentDistribution
10 1,000
1
1,000,000
10
1,000
1,000,000
CollaborationTools
Games
DistributedSystems
Peer-to-PeerApplications
Jörg Liebeherr, 2001
Need for Multicasting ?
• Maintaining unicast connections is not feasible
• Infrastructure or services needs to support a “send to group”
3
Jörg Liebeherr, 2001
Problem with Multicasting
• Feedback Implosion: A node is overwhelmed with traffic or state
– One-to-many multicast with feedback (e.g., reliable multicast)– Many-to-one multicast (Incast)
NAK
NAK
NAK
NAK
NAK
Jörg Liebeherr, 2001
Multicast support in the network infrastructure (IP Multicast)
• Reality Check (after 10 years of IP Multicast):
– Deployment has encountered severe scalability limitations in both the size and number of groups that can be supported
– IP Multicast is still plagued with concerns pertaining to scalability, network management, deployment and support for error, flow and congestion control
4
Jörg Liebeherr, 2001
Host-based Multicasting
• Logical overlay resides on top of the Layer-3 network
• Data is transmitted between neighbors in the overlay
• No IP Multicast support needed
• Overlay topology should match the Layer-3 infrastructure
Jörg Liebeherr, 2001
Host-based multicast approaches (all after 1998)
• Build an overlay mesh network and embed trees into the mesh:– Narada (CMU)– RMX/Gossamer (UCB)
• Build a shared tree:– Yallcast/Yoid (NTT)– Banana Tree Protocol (UMich) – AMRoute (Telcordia, UMD – College Park)– Overcast (MIT)
• Other: Gnutella
5
Jörg Liebeherr, 2001
Our Approach
• Build virtual overlay is built as a graph with known properties – N-dimensional (incomplete) hypercube– Delaunay triangulation
• Advantages:– Routing in the overlay is implicit– Achieve good load-balancing – Exploit symmetry
• Claim: Can improve scalability of multicast applications by orders of magnitude over existing solutions
Jörg Liebeherr, 2001
Multicasting with Overlay Network
Approach:• Organize group members into a virtual overlay network.
• Transmit data using trees embedded in the virtual network.
6
Jörg Liebeherr, 2001
Introducing the Hypercube
• An n-dimensional hypercube has N=2n nodes, where
– Each node is labeled kn kn-1 … k1 (kn =0,1)– Two nodes are connected by an edge only if their labels
differ in one position.
n=1 n=2
0 1
100
00 01
10 11
n=3
000 001
010 011
101
110 111
n=4
Jörg Liebeherr, 2001
We Use a Gray Code for Ordering Nodes
• A Gray code, denoted by G(.), is defined by
– G(i) and G(i+1) differ in exactly one bit.
– G(2d-1) and G(0) differ in only one bit.
0 1 2 3 4 5 6 7“normal” 000 001 010 011 100 101 110 111Gray ordering 000 001 011 010 110 111 101 100
7
Jörg Liebeherr, 2001
Nodes are added in a Gray order
000
110
001
011010
111
101100
Jörg Liebeherr, 2001
Tree Embedding Algorithm
Input: G (i) := I = In …I2 I1, G (r) := R =Rn … R2 R1Output: Parent of node I in the embedded tree rooted at R.
Procedure Parent (I,R)If (G-1(I) < G-1(R))
Parent := In In-1…Ik+1(1 - Ik)Ik-1 …I2 I1with k = mini(Ii!= RI).
ElseParent := In I{n-1}…I{k+1} (1 - Ik) Ik-1…I2 I1
with k = maxi(Ii!= Ri).Endif
8
Jörg Liebeherr, 2001
Tree Embedding
110
010
000 001
011
111
101
000
001010
110 101 011
111
• Node 000 is root:
Jörg Liebeherr, 2001
Another Tree Embedding
110
010
000 001
011
111
101
111
110 101 011
010001
000
• Node 111 is root:
9
Jörg Liebeherr, 2001
Performance Comparison (Part 1)
• Compare tree embeddings of:
– Shared Tree
– Hypercube
Jörg Liebeherr, 2001
Comparison of Hypercubes with Shared Trees(Infocom 98)
• Tl, is a control tree with root l
• wk(Tl) : The number of children at a node k in tree Tl,• vk(Tl) : The number of descendants in the sub-tree below
a node k in tree Tl,• pk(Tl) : The path length from a node k in tree Tl to the
root of the tree
kk
N
kkavg
N
llkk
ww
wN
w
TwN
w
max:
1:
)(1
:
max
1
1
=
=
=
∑
∑
=
=
kk
N
kkavg
N
llkk
vv
vN
v
TvN
v
max:
1:
)(1
:
max
1
1
=
=
=
∑
∑
=
=
kk
N
kkavg
pp
pN
p
max:
1:
max
1
=
= ∑=
10
Jörg Liebeherr, 2001
Average number of descendants in a tree
Jörg Liebeherr, 2001
HyperCast Protocol
• Goal: The goal of the protocol is to organize a members of a multicast group in a logical hypercube.
• Design criteria for scalability:– Soft-State (state is not permanent)– Decentralized (every node is aware only of its neighbors in the cube)– Must handle dynamic group membership efficiently
11
Jörg Liebeherr, 2001
HyperCast Protocol
• The HyperCast protocols maintains a stable hypercube:
• Consistent: No two nodes have the same logical address
• Compact: The dimension of the hypercube is as small as possible
• Connected: Each node knows the physical address of each of its neighbors in the hypercube
Jörg Liebeherr, 2001
Hypercast Protocol
1. A new node joins
2. A node fails
12
Jörg Liebeherr, 2001
110
010
000 001
011
beaconThe node with the highest logical address (HRoot) sends out beacons
The beacon message is received by all nodes
Other nodes use the beacon to determine if they are missing neighbors
The node with the highest logical address (HRoot) sends out beacons
The beacon message is received by all nodes
Other nodes use the beacon to determine if they are missing neighbors
Jörg Liebeherr, 2001
110
010
000 001
011
Each node sends ping messages to its neighbors periodically
Each node sends ping messages to its neighbors periodically
pingpin
g
pingping
pingping
pingpin
g
pingping
13
Jörg Liebeherr, 2001
New
A node that wants to join the hypercube will send a beacon message announcing its presence
A node that wants to join the hypercube will send a beacon message announcing its presence
110
010
000 001
011
Jörg Liebeherr, 2001
110
010
000 001
011
New
The HRoot receives the beacon and responds with a ping, containing the new logical address
The HRoot receives the beacon and responds with a ping, containing the new logical address
ping (as 111)
14
Jörg Liebeherr, 2001
110
010
000 001
011
111
The joining node takes on the logical address given to it and adds “110” as its neighbor
The joining node takes on the logical address given to it and adds “110” as its neighbor
ping (as 111)
Jörg Liebeherr, 2001
110
010
000 001
011
111
The new node responds with a ping.
The new node responds with a ping.
ping (as 111)ping
15
Jörg Liebeherr, 2001
110
010
000 001
011
111
The new node is now the new HRoot, and so it will beacon
The new node is now the new HRoot, and so it will beacon
Jörg Liebeherr, 2001
110
010
000 001
011
111
Upon receiving the beacon, some nodes realize that they have a new neighbor and ping in response
Upon receiving the beacon, some nodes realize that they have a new neighbor and ping in response
ping
16
Jörg Liebeherr, 2001
110
010
000 001
011
111
The new node responds to a ping from each new neighbor
The new node responds to a ping from each new neighbor
pingping
Jörg Liebeherr, 2001
The join operation is now complete, and the hypercube is once again in a stable state
The join operation is now complete, and the hypercube is once again in a stable state
110
010
000 001
011
111
17
Jörg Liebeherr, 2001
Hypercast Protocol
1. A new node joins
2. A node fails
Jörg Liebeherr, 2001
110
010
000 001
011
111
A node may fail unexpectedlyA node may fail unexpectedly
18
Jörg Liebeherr, 2001
110
010
000
011
111
!! Holes in the hypercube fabric are discovered via lack of response to pings
Holes in the hypercube fabric are discovered via lack of response to pings
ping
ping
pingping
pingpin
g
pingping
pingping
pingping
Jörg Liebeherr, 2001
110
010
000
011
111
The HRoot and nodes which are missing neighbors send beacons
The HRoot and nodes which are missing neighbors send beacons
19
Jörg Liebeherr, 2001
110
010
000
011
111
Nodes which are missing neighbors can move the HRootto a new (lower) logical address with a ping message
Nodes which are missing neighbors can move the HRootto a new (lower) logical address with a ping message
ping
(as
001)
Jörg Liebeherr, 2001
110
010
000
011
111
The HRoot sends leave messages as it leaves the 111 logical address
The HRoot sends leave messages as it leaves the 111 logical address
leave
leave
20
Jörg Liebeherr, 2001
110
010
000
011
001
The HRoot takes on the new logical address and pings back
The HRoot takes on the new logical address and pings back
ping
Jörg Liebeherr, 2001
110
010
000
011
001
The “old” 111 is now node 001, and takes its place in the cube
The “old” 111 is now node 001, and takes its place in the cube
22
Jörg Liebeherr, 2001
110
010
000
011
001
Jörg Liebeherr, 2001
110
010
000 001
011
The new HRoot and the nodes which are missing neighbors will beacon
The new HRoot and the nodes which are missing neighbors will beacon
23
Jörg Liebeherr, 2001
110
010
000 001
011
Upon receiving the beacons, the neighboring nodes ping each other to finish the repair operation
Upon receiving the beacons, the neighboring nodes ping each other to finish the repair operation
pingping
Jörg Liebeherr, 2001
110
010
000 001
011
The repair operation is now complete, and the hypercube is once again stable
The repair operation is now complete, and the hypercube is once again stable
24
Jörg Liebeherr, 2001
State Diagram of a Node
Joining/Joining
Wait
Incomplete
StartHypercube
Stable
HRoot/Stable
HRoot/Incomplete
NodebecomesHRoot
New HRoot
Timeout for findingany neighbor
Has ancestor
Neighborhoodbecomes complete
Neighborhoodbecomes incomplete
NIL
Has no ancestor
NodebecomesHRoot
Timeoutfor finding an HRoot
Neighborhoodbecomes complete
Neighborhoodbecomes incomplete
Any State
OutsideNode wants
to join
Leaving Nodeleaves
New HRoot
Rejoin Depart
Timeout forfinding an HRoot
HRoot/Repair
JoiningWait
Beacon fromJoining node
received
Timeout for beaconsfrom Joining nodes
Repair
Timeout whileattempting to contact
neighbor
Timeout whileattempting to contact
neighbor
Neighborhoodbecomes complete
Neighborhoodbecomes complete
NodebecomesHRoot
New HRoot
Joining
Jörg Liebeherr, 2001
Experiments
• Experimental Platform:Centurion cluster at UVA (cluster of Linux PCs) – 2 to 1024 hypercube nodes (on 32 machines)– current record: 10,080 nodes on 106 machines
• Experiment:– Add N new nodes to a hypercube with M nodes
• Performance measures:• Time until hypercube reaches a stable state• Rate of unicast transmissions• Rate of multicast transmissions
25
Jörg Liebeherr, 2001
log2 (# of Nodes in Hypercube) log2(# of Nodes Joining Hypercube)
Ti
me
(
t he
ar
tb
ea
t)
Experiment 1:Time for Join Operation vs. Size of Hypercube
and Number of Joining Nodes
NM
Jörg Liebeherr, 2001
log2 (# of Nodes in Hypercube) log2(# of Nodes Joining Hypercube)
Un
ic
as
t
By
te
Ra
te
(b
yt
es
/
t he
ar
tb
ea
t)
Experiment 1:Unicast Traffic for Join Operation vs. Size of Hypercube
and Number of Joining Nodes
NM
26
Jörg Liebeherr, 2001
log2 (# of Nodes in Hypercube) log2(# of Nodes Joining Hypercube)
Mu
lt
ic
as
t
By
te
Ra
te
(b
yt
es
/
t he
ar
tb
ea
t)
Experiment 1:Multicast Traffic for Join Operation vs. Size of Hypercube
and Number of Joining Nodes
NM
Jörg Liebeherr, 2001
Work in progress: Triangulations
Backbone
000101
001110
111
010
Backbone011
• Hypercube topology does not consider geographical proximity
• Triangulation can achieve that logical neighbors are “close by”
27
Jörg Liebeherr, 2001
12,0
10,8
5,2
4,9
0,6
The Voronoi region of a node is the region of the plane that is closer to this node than to any other node.
The Voronoi region of a node is the region of the plane that is closer to this node than to any other node.
Voronoi Regions
Jörg Liebeherr, 2001
The Delaunay triangulation has edges between nodes in neighboringVoronoi regions.
The Delaunay triangulation has edges between nodes in neighboringVoronoi regions.
12,0
10,8
5,2
4,9
0,6
Delaunay Triangulation
28
Jörg Liebeherr, 2001
12,0
10,8
5,2
4,9
0,6
beacon
Leader A Leader is a node with a Y-coordinate higher than any of its neighbors. The Leader periodically broadcasts a beacon.
A Leader is a node with a Y-coordinate higher than any of its neighbors. The Leader periodically broadcasts a beacon.
Jörg Liebeherr, 2001
Each node sends ping messages to its neighbors periodically
Each node sends ping messages to its neighbors periodically
12,0
5,2
4,9
0,6
10,8pingping
ping
ping
ping
pi n
g
pingping
ping
ping
ping
pi n
g
pingping
29
Jörg Liebeherr, 2001
A node that wants to join the triangulation will send a beacon message announcing its presence
A node that wants to join the triangulation will send a beacon message announcing its presence
12,0
10,8
5,2
4,9
0,6
8,4
New node
Jörg Liebeherr, 2001
The new node is located in one node’s Voronoi region.This node (5,2) updates its Voronoi region, and the triangulation
The new node is located in one node’s Voronoi region.This node (5,2) updates its Voronoi region, and the triangulation
12,0
4,9
0,6
8,4
10,8
5,2
30
Jörg Liebeherr, 2001
(5,2) sends a ping which contains info for contacting its clockwise and counterclockwise neighbors
(5,2) sends a ping which contains info for contacting its clockwise and counterclockwise neighbors
ping
12,0
4,9
0,6
8,4
10,8
5,2
Jörg Liebeherr, 2001
12,0
4,9
0,6
(8,4) contacts these neighbors ...(8,4) contacts these neighbors ...
8,4
10,8
5,2
ping
ping
12,0
4,9
31
Jörg Liebeherr, 2001
12,0
4,9
… which update their respective Voronoi regions.
… which update their respective Voronoi regions.
0,6
10,8
5,2
4,9
12,0
8,4
Jörg Liebeherr, 2001
12,0
4,9
0,6
(4,9) and (12,0) send pings and provide info for contacting their respective clockwise and counterclockwise neighbors.
(4,9) and (12,0) send pings and provide info for contacting their respective clockwise and counterclockwise neighbors.
10,8
5,2
4,9
12,0
8,4
ping
ping
32
Jörg Liebeherr, 2001
12,0
4,9
0,6
(8,4) contacts the new neighbor (10,8) ...
(8,4) contacts the new neighbor (10,8) ...
10,8
5,2
4,9
12,0
8,4
ping
10,8
Jörg Liebeherr, 2001
12,0
4,9
0,6
…which updates its Voronoi region...…which updates its Voronoi region...
5,2
4,9
12,0
10,8
8,4
33
Jörg Liebeherr, 2001
12,0
4,9
0,6
…and responds with a ping…and responds with a ping
5,2
4,9
12,0
10,8
8,4
ping
Jörg Liebeherr, 2001
12,0
4,9
This completes the update of the Voronoi regions and the Delaunay Triangulation
This completes the update of the Voronoi regions and the Delaunay Triangulation
0,6
5,2
4,9
12,0
10,8
8,4
34
Jörg Liebeherr, 2001
Problem with Delaunay Triangulations
• Delaunay triangulation considers location of nodes, but not the network topology
• 2 heuristics to achieve a better mapping
Jörg Liebeherr, 2001
Hierarchical Delaunay Triangulation
• 2-level hierarchy of Delaunay triangulations
• The node with the lowest x-coordinate in a domain DTis a member in 2 triangulations
35
Jörg Liebeherr, 2001
Multipoint Delaunay Triangulation
• Different (“implicit”) hierarchical organization
• “Virtual nodes” are positioned to form a “bounding box” around a cluster of nodes. All traffic to nodes in a cluster goes through one of the virtual nodes
Jörg Liebeherr, 2001
Work in Progress: Evaluation of Overlays
• Simulation:– Network with 1024 routers (“Transit-Stub” topology)– 2 - 512 hosts
• Performance measures for trees embedded in an overlay network:
– Degree of a node in an embedded tree
– “Relative Delay Penalty”: Ratio of delay in overlay to shortest path delay
– “Stress”: Number of overlay links over a physical link
36
Jörg Liebeherr, 2001
Transit-Stub Network
Transit-Stub• GA Tech graph
generator
• 4 transit domains
• 4×16 stub domains
• 1024 total routers
• 128 hosts on stub domain
Jörg Liebeherr, 2001
Overlay Topologies
Delaunay Triangulation and variants– Hierarchical DT– Multipoint DT
Degree-6 Graph– Similar to graphs generated in Narada
Degree-3 Tree– Similar to graphs generated in Yoid
Logical MST– Minimum Spanning Tree
Hypercube
38
Jörg Liebeherr, 2001
Maximum Average Node Degree (over all trees)
Jörg Liebeherr, 2001
Max Stress (Single node sending)
39
Jörg Liebeherr, 2001
Maximum Average Stress (all nodes sending)
Jörg Liebeherr, 2001
Time To Stabilize
40
Jörg Liebeherr, 2001
90th Percentile Relative Delay Penalty
Jörg Liebeherr, 2001
Work in progress: Provide an API
• Goal: Provide a Socket like interface to applications
OLSocket
OLSocket Interface
Group Manager
ForwardingEngine
ARQ_Engine
Sta
ts
OLCast ApplicationReceiveBuffer
n Unconfirmed DatagramTransmission
n Confirmed Datagram Transmission
n TCP Transmission
OL_Node
OL_NodeInterface
Overlay Protocol
41
Jörg Liebeherr, 2001
Summary
• Overlay network for many-to-many multicast applications using Hypercubes and Delaunay Triangulations
• Performance evaluation of trade-offs via analysis, simulation, and experimentation
• “Socket like” API is close to completion• Proof-of-concept applications:
– MPEG-1 streaming– file transfer– interactive games
• Scalability to groups with > 100,000 members appears feasible