tapestry deployment and fault-tolerant routing
DESCRIPTION
Tapestry Deployment and Fault-tolerant Routing. Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January 2003. Scaling Network Applications. Complexities of global deployment Network unreliability - PowerPoint PPT PresentationTRANSCRIPT
Tapestry Deployment and Fault-tolerant
RoutingBen Y. ZhaoL. Huang, S. Rhea, J. Stribling,A. D. Joseph, J. D. Kubiatowicz
Berkeley Research RetreatJanuary 2003
UCB Winter Retreat [email protected]
2
Scaling Network Applications Complexities of global deployment
Network unreliability BGP slow convergence, redundancy unexploited
Lack of administrative control over components Constrains protocol deployment: multicast,
congestion ctrl. Management of large scale resources /
components Locate, utilize resources despite failures
UCB Winter Retreat [email protected]
3
Enabling Technology: DOLR(Decentralized Object Location and Routing)
GUID1
DOLR
GUID1GUID2
UCB Winter Retreat [email protected]
4
What is Tapestry? DOLR driving OceanStore global storage
(Zhao, Kubiatowicz, Joseph et al. 2000)
Network structure Nodes assigned bit sequence nodeIds from
namespace: 0-2160, based on some radix (e.g. 16) keys from same namespace
Keys dynamically map to 1 unique live node: root Base API
Publish / Unpublish (Object ID) RouteToNode (NodeId) RouteToObject (Object ID)
UCB Winter Retreat [email protected]
5
4
2
3
3
3
2
2
1
2
4
1
2
3
3
1
34
1
1
4 3
2
4
NodeID0xEF34
NodeID0xEF31NodeID
0xEFBA
NodeID0x0921
NodeID0xE932
NodeID0xEF37
NodeID0xE324
NodeID0xEF97
NodeID0xEF32
NodeID0xFF37
NodeID0xE555
NodeID0xE530
NodeID0xEF44
NodeID0x0999
NodeID0x099F
NodeID0xE399
NodeID0xEF40
NodeID0xEF34
Tapestry Mesh
UCB Winter Retreat [email protected]
7
Talk Outline
Introduction
Architecture
Node architecture
Node implementation
Deployment Evaluation
Fault-tolerant Routing
UCB Winter Retreat [email protected]
8
Single Node Architecture
Transport Protocols
Network Link Management
Application Interface / Upcall API
DecentralizedFile Systems
Application-LevelMulticast
ApproximateText Matching
RouterRouting Table
&Object Pointer DB
Dynamic Node
Management
UCB Winter Retreat [email protected]
9
Single Node Implementation
Application Programming Interface
Applications
Dynamic Tapestry Core Router Patchwork
Network StageDistance Map
SEDA Event-driven FrameworkJava Virtual Machine
Enter/leaveTapestry
State Maint.Node Ins/del
Routing LinkMaintenance
Node Ins/del
Messages
UDP Pingsro
ute
tono
de /
obj
AP
I ca
lls
Upc
alls
fault
detec
t
heart
beat
msgs
UCB Winter Retreat [email protected]
10
Deployment Status
C simulator Packet level simulation Scales up to 10,000 nodes
Java implementation 50000 semicolons of Java, 270 class files Deployed on local area cluster (40 nodes) Deployed on Planet Lab global network (~100
distributed nodes)
UCB Winter Retreat [email protected]
11
Talk Outline
Introduction
Architecture
Deployment Evaluation
Micro-benchmarks
Stable network performance
Single and parallel node insertion
Fault-tolerant Routing
UCB Winter Retreat [email protected]
12
Micro-benchmark Methodology
Experiment run in LAN, GBit Ethernet Sender sends 60001 messages at full speed Measure inter-arrival time for last 50000 msgs
10000 msgs: remove cold-start effects 50000 msgs: remove network jitter effects
SenderControl
ReceiverControl
Tapestry TapestryLANLink
UCB Winter Retreat [email protected]
13
Micro-benchmark Results
Message Processing Latency
0.01
0.1
1
10
100
0.01 0.1 1 10 100 1000 10000
Message Size (KB)
Tim
e / m
sg (
ms)
Sustainable Throughput
0
5
10
15
20
25
30
0.01 0.1 1 10 100 1000 10000
Message Size (KB)
TP
ut (
MB
/s)
Constant processing overhead ~ 50s Latency dominated by byte copying For 5K messages, throughput = ~10,000 msgs/sec
UCB Winter Retreat [email protected]
14
Large Scale Methodology
PlanetLab global network 101 machines at 42 institutions, in North America, Europe,
Australia (~ 60 machines utilized) 1.26Ghz PIII (1GB RAM), 1.8Ghz P4 (2GB RAM) North American machines (2/3) on Internet2
Tapestry Java deployment 6-7 nodes on each physical machine IBM Java JDK 1.30 Node virtualization inside JVM and SEDA Scheduling between virtual nodes increases latency
UCB Winter Retreat [email protected]
15
Node to Node Routing
Ratio of end-to-end routing latency to shortest ping distance between nodes
All node pairs measured, placed into buckets
0
5
10
15
20
25
30
35
0 50 100 150 200 250 300
Internode RTT Ping time (5ms buckets)
RD
P (m
in, m
ed, 9
0%) Median=31.5, 90th percentile=135
UCB Winter Retreat [email protected]
16
Object Location
Ratio of end-to-end latency for object location, to shortest ping distance between client and object location
Each node publishes 10,000 objects, lookup on all objects
0
5
10
15
20
25
0 20 40 60 80 100 120 140 160 180 200
Client to Obj RTT Ping time (1ms buckets)
RD
P (
min
, m
ed
ian
, 9
0%
) 90th percentile=158
UCB Winter Retreat [email protected]
17
Latency to Insert Node
Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry
Humps due to expected filling of each routing level
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 100 200 300 400 500
Size of Existing Network (nodes)
Inte
gra
tio
n L
ate
nc
y (
ms
)
UCB Winter Retreat [email protected]
18
Bandwidth to Insert Node
Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network
Per node bandwidth decreases with size of network
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 50 100 150 200 250 300 350 400
Size of Existing Network (nodes)
Co
ntr
ol
Tra
ffic
BW
(K
B)
UCB Winter Retreat [email protected]
19
Parallel Insertion Latency
Latency to dynamically insert nodes in unison into an existing Tapestry of 200
Shown as function of insertion group size / network size
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 0.05 0.1 0.15 0.2 0.25 0.3
Ratio of Insertion Group Size to Network Size
La
ten
cy
to
Co
nv
erg
en
ce
(m
s) 90th percentile=55042
UCB Winter Retreat [email protected]
20
Talk Outline
Introduction
Architecture
Deployment Evaluation
Fault-tolerant Routing
Tunneling through scalable overlays
Example using Tapestry
UCB Winter Retreat [email protected]
21
Adaptive and Resilient Routing Goals
Reachability as a service Agility / adaptability in routing Scalable deployment Useful for all client endpoints
UCB Winter Retreat [email protected]
22
Existing Redundancy in DOLR/DHTs Fault-detection via soft-state beacons
Periodically sent to each node in routing table Scales logarithmically with size of network
Worst case overhead: 240 nodes, 160b ID 20 hex1 beacon/sec, 100B each = 240 kbpscan minimize B/W w/ better techniques (Hakim, Shelley)
Precomputed backup routes Intermediate hops in overlay path are flexible
Keep list of backups for outgoing hops(e.g. 3 node pointers for each route entry in Tapestry)
Maintain backups using node membership algorithms(no additional overhead)
UCB Winter Retreat [email protected]
23
Bootstrapping Non-overlay Endpoints Goal
Allow non-overlay nodes to benefit Endpoints communicate via overlay proxies
Example: legacy nodes L1, L2
Li registers w/ nearby overlay proxy P i
Pi assigns Li a proxy name Di
s.t. Di is the closest possible unique name to P i
(e.g. start w/ Pi, increment for each node)
Li and L2 exchange new proxy names messages route to nodes using proxy names
UCB Winter Retreat [email protected]
24
Tunneling through an Overlay
P1L1
P2 L2
L1 registers with P1 as document D1 L2 registers with P2 as document D2 Traffic tunnels through overlay via proxies
D2
D1
Overlay Network
UCB Winter Retreat [email protected]
27
Bandwidth Overhead for MisrouteIncrease in Latency for 1 Misroute (Secondary Route)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 1 2 3 4Position of Branch (Hop)
Pro
po
rtio
nal
Incr
ease
to
P
ath
Lat
ency
20 ms 26.66 ms 60 ms 80 ms 93.33 ms
Status: under deployment on PlanetLab
UCB Winter Retreat [email protected]
28
For more information …
Tapestry and related projects (and these slides):http://www.cs.berkeley.edu/~ravenben/tapestry
OceanStore:http://oceanstore.cs.berkeley.edu
Related papers:http://oceanstore.cs.berkeley.edu/publicationshttp://www.cs.berkeley.edu/~ravenben/publications
UCB Winter Retreat [email protected]
30
The Naming Problem
Tracking modifiable objects Example: email, Usenet articles, tagged audio Goal: verifiable names, robust to small changes
Current approaches Content-based hashed naming Content-independent naming
ADOLR Project: (Feng Zhou, Li Zhuang) Approximate names based on feature vectors Leverage to match / search for similar content
UCB Winter Retreat [email protected]
31
Approximation Extension to DOLR/DHT Publication using features
Objects are described using a set of features:AO ≡ Feature Vector (FV) = {f1, f2, f3, …, fn}
Locate AOs in DOLR ≡ find all AOs in the network with |FV* ∩ FV| ≥ Thres, 0 < Thres ≤ |FV|
Driving application: decentralized spam filter Humans are the only fool-proof spam filter Mark spam, publish spam by text feature vector Incoming mail filtered by FV query on P2P overlay
UCB Winter Retreat [email protected]
32
Evaluation on Real Emails Accuracy of feature vector matching on real emails
Spam (29631 Junk Emails from www.spamarchive.org) 14925 (unique), 86% of spam ≤ 5K
Normal Emails 9589 (total) = 50% newsgroup posts, 50% personal emails
Status Prototype implemented as Outlook Plug-in Interfaces w/ Tapestry overlay http://www.cs.berkeley.edu/~zf/spamwatch
THRES Detected Fail %
3/10 3356 84 97.56
4/10 3172 268 92.21
“Similarity” Test3440 modified copies of
39 emails Match FP # pair probability
2/10 4 2.79e-8
>2/10 0 0
“False Positive” Test
9589(normal)×14925(spam)
UCB Winter Retreat [email protected]
33
State of the Art Routing
High dimensionality and coordinate-based P2P routing Tapestry, Pastry, Chord, CAN,
etc… Sub-linear storage and # of
overlay hops per route Properties dependent on random
name distribution Optimized for uniform mesh style
networks
UCB Winter Retreat [email protected]
34
Reality
AS-2
P2P Overlay Network
AS-1
AS-3
S R
Transit-stub topology, disparate resources per node Result: Inefficient inter-domain routing (b/w, latency)
UCB Winter Retreat [email protected]
35
Landmark Routing on P2P
Brocade Exploit non-uniformity Minimize wide-area routing hops / bandwidth
Secondary overlay on top of Tapestry Select super-nodes by admin. domain
Divide network into cover sets Super-nodes form secondary Tapestry
Advertise cover set as local objects brocade routes directly into destination’s local network,
then resumes p2p routing
UCB Winter Retreat [email protected]
36
AS-2
P2P Network
AS-1
AS-3
Brocade Layer
S D
Original Route
Brocade Route
Brocade Routing
UCB Winter Retreat [email protected]
37
Overlay Routing Networks CAN: Ratnasamy et al., (ACIRI /
UCB) Uses d-dimensional coordinate space
to implement distributed hash table Route to neighbor closest to
destination coordinate Chord: Stoica, Morris, Karger, et al.,
(MIT / UCB) Linear namespace modeled as
circular address space “Finger-table” point to logarithmic # of
inc. remote hosts Pastry: Rowstron and Druschel
(Microsoft / Rice ) Hypercube routing similar to PRR97 Objects replicated to servers by name
Fast Insertion / DeletionConstant-sized routing stateUnconstrained # of hopsOverlay distance not prop. to physical distance
Simplicity in algorithmsFast fault-recovery
Log2(N) hops and routing stateOverlay distance not prop. to physical distance
Fast fault-recoveryLog(N) hops and routing stateData replication required for fault-tolerance
UCB Winter Retreat [email protected]
38
2175
0880
0123
0157
0154
Routing in Detail
2175 0 1 2 3 4 5 6 7
0880 0 1 2 3 4 5 6 7
0123 0 1 2 3 4 5 6 7
0154 0 1 2 3 4 5 6 7
0157 0 1 2 3 4 5 6 7
Example: Octal digits, 212 namespace, 2175 0157
UCB Winter Retreat [email protected]
39
Publish / Lookup Details Publish object with ObjectID:
// route towards “virtual root,” ID=ObjectID
For (i=0, i<Log2(N), i+=j) { //Define hierarchy j is # of bits in digit size, (i.e. for hex digits, j = 4 ) Insert entry into nearest node that matches on
last i bits If no matches found, deterministically choose alternative Found real root node, when no external routes left
Lookup objectTraverse same path to root as publish, except search for entry at each node
For (i=0, i<Log2(N), i+=j) { Search for cached object location Once found, route via IP or Tapestry to object
UCB Winter Retreat [email protected]
40
Dynamic Insertion1. Build up new node’s routing map
Send messages to each hop along path from gateway to current node N’ that best approximates N
The ith hop along the path sends its ith level route table to N N optimizes those tables where necessary
2. Notify via acked multicast nodes with null entries for N’s ID
3. Notified node issues republish message for relevant objects
4. Notify local neighbors
UCB Winter Retreat [email protected]
41
Dynamic Insertion Example
NodeID0x243FE
NodeID0x913FENodeID
0x0ABFE
NodeID0x71290
NodeID0x5239E
NodeID0x973FE
NEW0x143FE
NodeID0x779FE
NodeID0xA23FE
Gateway0xD73FF
NodeID0xB555E
NodeID0xC035E
NodeID0x244FE
NodeID0x09990
NodeID0x4F990
NodeID0x6993E
NodeID0x704FE
4
2
3
3
3
2
1
2
4
1
2
3
3
1
34
1
1
4 3
2
4
NodeID0x243FE
UCB Winter Retreat [email protected]
42
Dynamic Root Mapping
Problem: choosing a root node for every object Deterministic over network changes Globally consistent
Assumptions All nodes with same matching suffix contains
same null/non-null pattern in next level of routing map
Requires: consistent knowledge of nodes across network
UCB Winter Retreat [email protected]
43
PRR Solution
Given desired ID N, Find set S of nodes in existing network nodes n matching
most # of suffix digits with N Choose Si = node in S with highest valued ID
Issues: Mapping must be generated statically using global
knowledge Must be kept as hard state in order to operate in changing
environment Mapping is not well distributed, many nodes in n get no
mappings
UCB Winter Retreat [email protected]
44
Tapestry Solution
Globally consistent distributed algorithm: Attempt to route to desired ID Ni
Whenever null entry encountered, choose next “higher” non-null pointer entry
If current node S is only non-null pointer in rest of route map, terminate route, f (N) = S
Assumes: Routing maps across network are up to date Null/non-null properties identical at all nodes sharing same
suffix
UCB Winter Retreat [email protected]
45
Analysis
Globally consistent deterministic mapping Null entry no node in network with suffix consistent map identical null entries across same route
maps of nodes w/ same suffix
Additional hops compared to PRR solution: Reduce to coupon collector problem
Assuming random distribution With n ln(n) + cn entries, P(all coupons) = 1-e-c
For n=b, c=b-ln(b), P(b2 nodes left) = 1-b/eb = 1.8 10-6
# of additional hops Logb(b2) = 2
Distributed algorithm with minimal additional hops
UCB Winter Retreat [email protected]
46
Node vanishes undetected Routing proceeds on invalid link, fails No backup router, so proceed to surrogate routing
Node enters network undetected; messages going to surrogate node instead New node checks with surrogate after all such nodes have
been notified Route info at surrogate is moved to new node
Dynamic Mapping Border Cases
UCB Winter Retreat [email protected]
48
Network Assumption
Nearest neighbor is hard in general metric Assume the following:
Ball of radius 2r contains only a factor of c more nodes than ball of radius r.
Also, b > c2
[Both assumed by PRR] Start knowing one node; allow distance
queries
UCB Winter Retreat [email protected]
49
Algorithm Idea
Call a node a level i node if it matches the new node in i digits.
The whole network is contained in forest of trees rooted at highest possible imax.
Let list[imax] contain the root of all trees. Then, starting at imax, while i > 1 list[i-1] = getChildren(list[i])
Certainly, list[i] contains level i neighbors.
UCB Winter Retreat [email protected]
50
NodeID0xEF34
NodeID0xEF31NodeID
0xEFBA
NodeID0x0921
NodeID0xE932
NodeID0xEF37
NodeID0xE324
NodeID0xEF97
NodeID0xEF32
NodeID0xFF37
NodeID0xE555
NodeID0xE530
NodeID0xEF44
NodeID0x0999
NodeID0x099F
NodeID0xE399
NodeID0xEF40
4
4
4
1
1
1
1
2
2
22
2
3
3
3
3
NodeID0xEF34
We Reach The Whole Network
UCB Winter Retreat [email protected]
51
The Real Algorithm
Simplified version ALL nodes in the network. But far away nodes are not likely to have
close descendents Trim the list at each step. New version: while i > 1
List[i-1] = getChildren(list[i]) Trim(list[i-1])
UCB Winter Retreat [email protected]
52
How to Trim Consider circle of
radius r with at least one level i node.
Level-(i-1) node in little circle must must point to a level-i node in the big circle
Want: list[i] had radius three times list[i-1] and list[i –1] contains one level i
<2r
r
UCB Winter Retreat [email protected]
54
True in Expectation
Want: list[i] had radius three times list[i-1] and list[i –1] contains one level i
Suppose list[i-1] has k elements and radius r Expect ball of radius 4r to contain kc2/b Ball of radius 3r contains less than k nodes, so
keeping k all along is enough. To work with high probability,
k = O(log n)
UCB Winter Retreat [email protected]
55
Steps of Insertion
Find node with closest matching ID (surrogate) and get preliminary neighbor table If surrogate’s table is hole-free, so is this one.
Find all nodes that need to put new node in routing table via multicast
Optimize neighbor table w.h.p. contacted nodes in building table only ones that
need to update their own tables Need:
No fillable holes. Keep objects reachable
UCB Winter Retreat [email protected]
56
• Need-to-know = a node with a hole in neighbor table filled by new node
• If 1234 is new node, and no 123s existed, must notify 12?? Nodes
• Acknowledged multicast to all matching nodes
Need-to-know nodes
UCB Winter Retreat [email protected]
57
Locates & Contacts all nodes with a given prefix
• Create a tree based on IDs as we go • Nodes send acks when all children reached• Starting node knows when all nodes reached
5434554340
The node then sends to any 5430?, any 5431?, any 5434?, etc. if possible
543??
5431? 5434?
Acknowledged Multicast Algorithm