tapestry deployment and fault-tolerant routing

57
Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat

Upload: trella

Post on 06-Jan-2016

46 views

Category:

Documents


2 download

DESCRIPTION

Tapestry Deployment and Fault-tolerant Routing. Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January 2003. Scaling Network Applications. Complexities of global deployment Network unreliability - PowerPoint PPT Presentation

TRANSCRIPT

Tapestry Deployment and Fault-tolerant

RoutingBen Y. ZhaoL. Huang, S. Rhea, J. Stribling,A. D. Joseph, J. D. Kubiatowicz

Berkeley Research RetreatJanuary 2003

UCB Winter Retreat [email protected]

2

Scaling Network Applications Complexities of global deployment

Network unreliability BGP slow convergence, redundancy unexploited

Lack of administrative control over components Constrains protocol deployment: multicast,

congestion ctrl. Management of large scale resources /

components Locate, utilize resources despite failures

UCB Winter Retreat [email protected]

3

Enabling Technology: DOLR(Decentralized Object Location and Routing)

GUID1

DOLR

GUID1GUID2

UCB Winter Retreat [email protected]

4

What is Tapestry? DOLR driving OceanStore global storage

(Zhao, Kubiatowicz, Joseph et al. 2000)

Network structure Nodes assigned bit sequence nodeIds from

namespace: 0-2160, based on some radix (e.g. 16) keys from same namespace

Keys dynamically map to 1 unique live node: root Base API

Publish / Unpublish (Object ID) RouteToNode (NodeId) RouteToObject (Object ID)

UCB Winter Retreat [email protected]

5

4

2

3

3

3

2

2

1

2

4

1

2

3

3

1

34

1

1

4 3

2

4

NodeID0xEF34

NodeID0xEF31NodeID

0xEFBA

NodeID0x0921

NodeID0xE932

NodeID0xEF37

NodeID0xE324

NodeID0xEF97

NodeID0xEF32

NodeID0xFF37

NodeID0xE555

NodeID0xE530

NodeID0xEF44

NodeID0x0999

NodeID0x099F

NodeID0xE399

NodeID0xEF40

NodeID0xEF34

Tapestry Mesh

UCB Winter Retreat [email protected]

6

Object Location

UCB Winter Retreat [email protected]

7

Talk Outline

Introduction

Architecture

Node architecture

Node implementation

Deployment Evaluation

Fault-tolerant Routing

UCB Winter Retreat [email protected]

8

Single Node Architecture

Transport Protocols

Network Link Management

Application Interface / Upcall API

DecentralizedFile Systems

Application-LevelMulticast

ApproximateText Matching

RouterRouting Table

&Object Pointer DB

Dynamic Node

Management

UCB Winter Retreat [email protected]

9

Single Node Implementation

Application Programming Interface

Applications

Dynamic Tapestry Core Router Patchwork

Network StageDistance Map

SEDA Event-driven FrameworkJava Virtual Machine

Enter/leaveTapestry

State Maint.Node Ins/del

Routing LinkMaintenance

Node Ins/del

Messages

UDP Pingsro

ute

tono

de /

obj

AP

I ca

lls

Upc

alls

fault

detec

t

heart

beat

msgs

UCB Winter Retreat [email protected]

10

Deployment Status

C simulator Packet level simulation Scales up to 10,000 nodes

Java implementation 50000 semicolons of Java, 270 class files Deployed on local area cluster (40 nodes) Deployed on Planet Lab global network (~100

distributed nodes)

UCB Winter Retreat [email protected]

11

Talk Outline

Introduction

Architecture

Deployment Evaluation

Micro-benchmarks

Stable network performance

Single and parallel node insertion

Fault-tolerant Routing

UCB Winter Retreat [email protected]

12

Micro-benchmark Methodology

Experiment run in LAN, GBit Ethernet Sender sends 60001 messages at full speed Measure inter-arrival time for last 50000 msgs

10000 msgs: remove cold-start effects 50000 msgs: remove network jitter effects

SenderControl

ReceiverControl

Tapestry TapestryLANLink

UCB Winter Retreat [email protected]

13

Micro-benchmark Results

Message Processing Latency

0.01

0.1

1

10

100

0.01 0.1 1 10 100 1000 10000

Message Size (KB)

Tim

e / m

sg (

ms)

Sustainable Throughput

0

5

10

15

20

25

30

0.01 0.1 1 10 100 1000 10000

Message Size (KB)

TP

ut (

MB

/s)

Constant processing overhead ~ 50s Latency dominated by byte copying For 5K messages, throughput = ~10,000 msgs/sec

UCB Winter Retreat [email protected]

14

Large Scale Methodology

PlanetLab global network 101 machines at 42 institutions, in North America, Europe,

Australia (~ 60 machines utilized) 1.26Ghz PIII (1GB RAM), 1.8Ghz P4 (2GB RAM) North American machines (2/3) on Internet2

Tapestry Java deployment 6-7 nodes on each physical machine IBM Java JDK 1.30 Node virtualization inside JVM and SEDA Scheduling between virtual nodes increases latency

UCB Winter Retreat [email protected]

15

Node to Node Routing

Ratio of end-to-end routing latency to shortest ping distance between nodes

All node pairs measured, placed into buckets

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300

Internode RTT Ping time (5ms buckets)

RD

P (m

in, m

ed, 9

0%) Median=31.5, 90th percentile=135

UCB Winter Retreat [email protected]

16

Object Location

Ratio of end-to-end latency for object location, to shortest ping distance between client and object location

Each node publishes 10,000 objects, lookup on all objects

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180 200

Client to Obj RTT Ping time (1ms buckets)

RD

P (

min

, m

ed

ian

, 9

0%

) 90th percentile=158

UCB Winter Retreat [email protected]

17

Latency to Insert Node

Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry

Humps due to expected filling of each routing level

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 100 200 300 400 500

Size of Existing Network (nodes)

Inte

gra

tio

n L

ate

nc

y (

ms

)

UCB Winter Retreat [email protected]

18

Bandwidth to Insert Node

Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network

Per node bandwidth decreases with size of network

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 50 100 150 200 250 300 350 400

Size of Existing Network (nodes)

Co

ntr

ol

Tra

ffic

BW

(K

B)

UCB Winter Retreat [email protected]

19

Parallel Insertion Latency

Latency to dynamically insert nodes in unison into an existing Tapestry of 200

Shown as function of insertion group size / network size

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 0.05 0.1 0.15 0.2 0.25 0.3

Ratio of Insertion Group Size to Network Size

La

ten

cy

to

Co

nv

erg

en

ce

(m

s) 90th percentile=55042

UCB Winter Retreat [email protected]

20

Talk Outline

Introduction

Architecture

Deployment Evaluation

Fault-tolerant Routing

Tunneling through scalable overlays

Example using Tapestry

UCB Winter Retreat [email protected]

21

Adaptive and Resilient Routing Goals

Reachability as a service Agility / adaptability in routing Scalable deployment Useful for all client endpoints

UCB Winter Retreat [email protected]

22

Existing Redundancy in DOLR/DHTs Fault-detection via soft-state beacons

Periodically sent to each node in routing table Scales logarithmically with size of network

Worst case overhead: 240 nodes, 160b ID 20 hex1 beacon/sec, 100B each = 240 kbpscan minimize B/W w/ better techniques (Hakim, Shelley)

Precomputed backup routes Intermediate hops in overlay path are flexible

Keep list of backups for outgoing hops(e.g. 3 node pointers for each route entry in Tapestry)

Maintain backups using node membership algorithms(no additional overhead)

UCB Winter Retreat [email protected]

23

Bootstrapping Non-overlay Endpoints Goal

Allow non-overlay nodes to benefit Endpoints communicate via overlay proxies

Example: legacy nodes L1, L2

Li registers w/ nearby overlay proxy P i

Pi assigns Li a proxy name Di

s.t. Di is the closest possible unique name to P i

(e.g. start w/ Pi, increment for each node)

Li and L2 exchange new proxy names messages route to nodes using proxy names

UCB Winter Retreat [email protected]

24

Tunneling through an Overlay

P1L1

P2 L2

L1 registers with P1 as document D1 L2 registers with P2 as document D2 Traffic tunnels through overlay via proxies

D2

D1

Overlay Network

UCB Winter Retreat [email protected]

25

Failure Avoidance in Tapestry

UCB Winter Retreat [email protected]

26

Routing Convergence

UCB Winter Retreat [email protected]

27

Bandwidth Overhead for MisrouteIncrease in Latency for 1 Misroute (Secondary Route)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 1 2 3 4Position of Branch (Hop)

Pro

po

rtio

nal

Incr

ease

to

P

ath

Lat

ency

20 ms 26.66 ms 60 ms 80 ms 93.33 ms

Status: under deployment on PlanetLab

UCB Winter Retreat [email protected]

28

For more information …

Tapestry and related projects (and these slides):http://www.cs.berkeley.edu/~ravenben/tapestry

OceanStore:http://oceanstore.cs.berkeley.edu

Related papers:http://oceanstore.cs.berkeley.edu/publicationshttp://www.cs.berkeley.edu/~ravenben/publications

[email protected]

UCB Winter Retreat [email protected]

29

Backup Slides Follow…

UCB Winter Retreat [email protected]

30

The Naming Problem

Tracking modifiable objects Example: email, Usenet articles, tagged audio Goal: verifiable names, robust to small changes

Current approaches Content-based hashed naming Content-independent naming

ADOLR Project: (Feng Zhou, Li Zhuang) Approximate names based on feature vectors Leverage to match / search for similar content

UCB Winter Retreat [email protected]

31

Approximation Extension to DOLR/DHT Publication using features

Objects are described using a set of features:AO ≡ Feature Vector (FV) = {f1, f2, f3, …, fn}

Locate AOs in DOLR ≡ find all AOs in the network with |FV* ∩ FV| ≥ Thres, 0 < Thres ≤ |FV|

Driving application: decentralized spam filter Humans are the only fool-proof spam filter Mark spam, publish spam by text feature vector Incoming mail filtered by FV query on P2P overlay

UCB Winter Retreat [email protected]

32

Evaluation on Real Emails Accuracy of feature vector matching on real emails

Spam (29631 Junk Emails from www.spamarchive.org) 14925 (unique), 86% of spam ≤ 5K

Normal Emails 9589 (total) = 50% newsgroup posts, 50% personal emails

Status Prototype implemented as Outlook Plug-in Interfaces w/ Tapestry overlay http://www.cs.berkeley.edu/~zf/spamwatch

THRES Detected Fail %

3/10 3356 84 97.56

4/10 3172 268 92.21

“Similarity” Test3440 modified copies of

39 emails Match FP # pair probability

2/10 4 2.79e-8

>2/10 0 0

“False Positive” Test

9589(normal)×14925(spam)

UCB Winter Retreat [email protected]

33

State of the Art Routing

High dimensionality and coordinate-based P2P routing Tapestry, Pastry, Chord, CAN,

etc… Sub-linear storage and # of

overlay hops per route Properties dependent on random

name distribution Optimized for uniform mesh style

networks

UCB Winter Retreat [email protected]

34

Reality

AS-2

P2P Overlay Network

AS-1

AS-3

S R

Transit-stub topology, disparate resources per node Result: Inefficient inter-domain routing (b/w, latency)

UCB Winter Retreat [email protected]

35

Landmark Routing on P2P

Brocade Exploit non-uniformity Minimize wide-area routing hops / bandwidth

Secondary overlay on top of Tapestry Select super-nodes by admin. domain

Divide network into cover sets Super-nodes form secondary Tapestry

Advertise cover set as local objects brocade routes directly into destination’s local network,

then resumes p2p routing

UCB Winter Retreat [email protected]

36

AS-2

P2P Network

AS-1

AS-3

Brocade Layer

S D

Original Route

Brocade Route

Brocade Routing

UCB Winter Retreat [email protected]

37

Overlay Routing Networks CAN: Ratnasamy et al., (ACIRI /

UCB) Uses d-dimensional coordinate space

to implement distributed hash table Route to neighbor closest to

destination coordinate Chord: Stoica, Morris, Karger, et al.,

(MIT / UCB) Linear namespace modeled as

circular address space “Finger-table” point to logarithmic # of

inc. remote hosts Pastry: Rowstron and Druschel

(Microsoft / Rice ) Hypercube routing similar to PRR97 Objects replicated to servers by name

Fast Insertion / DeletionConstant-sized routing stateUnconstrained # of hopsOverlay distance not prop. to physical distance

Simplicity in algorithmsFast fault-recovery

Log2(N) hops and routing stateOverlay distance not prop. to physical distance

Fast fault-recoveryLog(N) hops and routing stateData replication required for fault-tolerance

UCB Winter Retreat [email protected]

38

2175

0880

0123

0157

0154

Routing in Detail

2175 0 1 2 3 4 5 6 7

0880 0 1 2 3 4 5 6 7

0123 0 1 2 3 4 5 6 7

0154 0 1 2 3 4 5 6 7

0157 0 1 2 3 4 5 6 7

Example: Octal digits, 212 namespace, 2175 0157

UCB Winter Retreat [email protected]

39

Publish / Lookup Details Publish object with ObjectID:

// route towards “virtual root,” ID=ObjectID

For (i=0, i<Log2(N), i+=j) { //Define hierarchy j is # of bits in digit size, (i.e. for hex digits, j = 4 ) Insert entry into nearest node that matches on

last i bits If no matches found, deterministically choose alternative Found real root node, when no external routes left

Lookup objectTraverse same path to root as publish, except search for entry at each node

For (i=0, i<Log2(N), i+=j) { Search for cached object location Once found, route via IP or Tapestry to object

UCB Winter Retreat [email protected]

40

Dynamic Insertion1. Build up new node’s routing map

Send messages to each hop along path from gateway to current node N’ that best approximates N

The ith hop along the path sends its ith level route table to N N optimizes those tables where necessary

2. Notify via acked multicast nodes with null entries for N’s ID

3. Notified node issues republish message for relevant objects

4. Notify local neighbors

UCB Winter Retreat [email protected]

41

Dynamic Insertion Example

NodeID0x243FE

NodeID0x913FENodeID

0x0ABFE

NodeID0x71290

NodeID0x5239E

NodeID0x973FE

NEW0x143FE

NodeID0x779FE

NodeID0xA23FE

Gateway0xD73FF

NodeID0xB555E

NodeID0xC035E

NodeID0x244FE

NodeID0x09990

NodeID0x4F990

NodeID0x6993E

NodeID0x704FE

4

2

3

3

3

2

1

2

4

1

2

3

3

1

34

1

1

4 3

2

4

NodeID0x243FE

UCB Winter Retreat [email protected]

42

Dynamic Root Mapping

Problem: choosing a root node for every object Deterministic over network changes Globally consistent

Assumptions All nodes with same matching suffix contains

same null/non-null pattern in next level of routing map

Requires: consistent knowledge of nodes across network

UCB Winter Retreat [email protected]

43

PRR Solution

Given desired ID N, Find set S of nodes in existing network nodes n matching

most # of suffix digits with N Choose Si = node in S with highest valued ID

Issues: Mapping must be generated statically using global

knowledge Must be kept as hard state in order to operate in changing

environment Mapping is not well distributed, many nodes in n get no

mappings

UCB Winter Retreat [email protected]

44

Tapestry Solution

Globally consistent distributed algorithm: Attempt to route to desired ID Ni

Whenever null entry encountered, choose next “higher” non-null pointer entry

If current node S is only non-null pointer in rest of route map, terminate route, f (N) = S

Assumes: Routing maps across network are up to date Null/non-null properties identical at all nodes sharing same

suffix

UCB Winter Retreat [email protected]

45

Analysis

Globally consistent deterministic mapping Null entry no node in network with suffix consistent map identical null entries across same route

maps of nodes w/ same suffix

Additional hops compared to PRR solution: Reduce to coupon collector problem

Assuming random distribution With n ln(n) + cn entries, P(all coupons) = 1-e-c

For n=b, c=b-ln(b), P(b2 nodes left) = 1-b/eb = 1.8 10-6

# of additional hops Logb(b2) = 2

Distributed algorithm with minimal additional hops

UCB Winter Retreat [email protected]

46

Node vanishes undetected Routing proceeds on invalid link, fails No backup router, so proceed to surrogate routing

Node enters network undetected; messages going to surrogate node instead New node checks with surrogate after all such nodes have

been notified Route info at surrogate is moved to new node

Dynamic Mapping Border Cases

UCB Winter Retreat [email protected]

47

SPAA slides follow

UCB Winter Retreat [email protected]

48

Network Assumption

Nearest neighbor is hard in general metric Assume the following:

Ball of radius 2r contains only a factor of c more nodes than ball of radius r.

Also, b > c2

[Both assumed by PRR] Start knowing one node; allow distance

queries

UCB Winter Retreat [email protected]

49

Algorithm Idea

Call a node a level i node if it matches the new node in i digits.

The whole network is contained in forest of trees rooted at highest possible imax.

Let list[imax] contain the root of all trees. Then, starting at imax, while i > 1 list[i-1] = getChildren(list[i])

Certainly, list[i] contains level i neighbors.

UCB Winter Retreat [email protected]

50

NodeID0xEF34

NodeID0xEF31NodeID

0xEFBA

NodeID0x0921

NodeID0xE932

NodeID0xEF37

NodeID0xE324

NodeID0xEF97

NodeID0xEF32

NodeID0xFF37

NodeID0xE555

NodeID0xE530

NodeID0xEF44

NodeID0x0999

NodeID0x099F

NodeID0xE399

NodeID0xEF40

4

4

4

1

1

1

1

2

2

22

2

3

3

3

3

NodeID0xEF34

We Reach The Whole Network

UCB Winter Retreat [email protected]

51

The Real Algorithm

Simplified version ALL nodes in the network. But far away nodes are not likely to have

close descendents Trim the list at each step. New version: while i > 1

List[i-1] = getChildren(list[i]) Trim(list[i-1])

UCB Winter Retreat [email protected]

52

How to Trim Consider circle of

radius r with at least one level i node.

Level-(i-1) node in little circle must must point to a level-i node in the big circle

Want: list[i] had radius three times list[i-1] and list[i –1] contains one level i

<2r

r

UCB Winter Retreat [email protected]

53

Animation

new

UCB Winter Retreat [email protected]

54

True in Expectation

Want: list[i] had radius three times list[i-1] and list[i –1] contains one level i

Suppose list[i-1] has k elements and radius r Expect ball of radius 4r to contain kc2/b Ball of radius 3r contains less than k nodes, so

keeping k all along is enough. To work with high probability,

k = O(log n)

UCB Winter Retreat [email protected]

55

Steps of Insertion

Find node with closest matching ID (surrogate) and get preliminary neighbor table If surrogate’s table is hole-free, so is this one.

Find all nodes that need to put new node in routing table via multicast

Optimize neighbor table w.h.p. contacted nodes in building table only ones that

need to update their own tables Need:

No fillable holes. Keep objects reachable

UCB Winter Retreat [email protected]

56

• Need-to-know = a node with a hole in neighbor table filled by new node

• If 1234 is new node, and no 123s existed, must notify 12?? Nodes

• Acknowledged multicast to all matching nodes

Need-to-know nodes

UCB Winter Retreat [email protected]

57

Locates & Contacts all nodes with a given prefix

• Create a tree based on IDs as we go • Nodes send acks when all children reached• Starting node knows when all nodes reached

5434554340

The node then sends to any 5430?, any 5431?, any 5434?, etc. if possible

543??

5431? 5434?

Acknowledged Multicast Algorithm