principles of reliable distributed systems lecture 2: distributed hash tables (dht), chord
DESCRIPTION
Principles of Reliable Distributed Systems Lecture 2: Distributed Hash Tables (DHT), Chord. Spring 2008 Idit Keidar. Today’s Material. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Stoica et al. Reminder: Peer-to-Peer Lookup. Insert (key, file) Lookup (key) - PowerPoint PPT PresentationTRANSCRIPT
1Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Principles of Reliable Distributed Systems
Lecture 2: Distributed Hash
Tables (DHT), Chord
Spring 2008 Idit Keidar
2Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Today’s Material
• Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications– Stoica et al.
3Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Reminder: Peer-to-Peer Lookup
• Insert (key, file)• Lookup (key)
– Should find keys inserted in any node
4Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Reminder: Overlay Networks
• A virtual structure imposed over the physical network (e.g., the Internet)– over the Internet, there is a
(IP level) link between every pair of nodes
– an overlay uses a fixed subset of these
• Why restrict to a subset?
5Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Routing/Lookup in Overlays
• How does one route a packet to its destination in an overlay?
• How about lookup (key)?• Unstructured overlay: (last week)
– Flooding or random walks• Structured overlay: (today)
– The links are chosen according to some rule– Tables define next-hop for routing and lookup
6Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Structured Lookup Overlays• Many academic systems –
– CAN, Chord , D2B, Kademlia, Koorde, Pastry, Tapestry, Viceroy, …
• OverNet based on the Kademlia algorithm• Symmetric, no hierarchy• Decentralized self management• Structured overlay – data stored in a defined place,
search goes on a defined path• Implement Distributed Hash Table (DHT)
abstraction
7Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Reminder: Hashing
• Data structure supporting the operations:– void insert( key, item ) – item search( key )
• Implementation uses hash function for mapping keys to array cells
• Expected search time O(1)– provided that there are few collisions
8Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Distributed Hash Tables (DHTs)
• Nodes store table entries– The role of array cells
• Good abstraction for lookup? – Why?
9Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
The DHT Service Interface
lookup( key ) returns the location of the node currently
responsible for this keykey is usually numeric (in some range)
10Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Using the DHT Interface
• How do you publish a file?• How do you find a file?• Requirements for an application being able
to use DHTs?– Data identified with unique keys– Nodes can (agree to) store keys for each other
• location of object (pointer) or actual object (data)
11Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
What Does a DHT Implementation Need to Do?
• Map keys to nodes– Needs to be dynamic as nodes join and leave– How does this affect the service interface?
• Route a request to the appropriate node– Routing on the overlay
12Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Lookup Example
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
insert(K1,V1)
K V(K1,V1)
lookup(K1)
13Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Mapping Keys to Nodes
• Goal: load balancing– Why?
• Typical approach: – Give an m-bit id to each node and each key
(e.g., using SHA-1 on the key, IP address)– Map key to node whose id is “close” to the key
(need distance function) – How is load balancing achieved?
14Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Routing Issues
• Each node must be able to forward each lookup query to a node closer to the destination
• Maintain routing tables adaptively– Each node knows some other nodes– Must adapt to changes (joins, leaves, failures)– Goals?
15Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Handling Join/Leave
• When a node joins it needs to assume responsibility for some keys – Ask the application to move these keys to it– How many keys will need to be moved?
• When a nodes fails or leaves, its keys have to be moved to others– What else is needed in order to implement this?
16Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
P2P System Interface
• Lookup• Join• Move keys
17Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Chord
Stoica, Morris, Karger, Kaashoek, and Balakrishnan
18Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Chord Logical Structure
• m-bit ID space (2m IDs), usually m=160.• Think of nodes as organized in a logical ring
according to their IDs.N1
N8
N10
N14
N21
N30N38
N42
N48
N51N56
19Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Consistent Hashing: Assigning Keys to Nodes
• Key k is assigned to first node whose ID equals or follows k – successor(k)
N1N8
N10
N14
N21
N30N38
N42
N48
N51N56
K54
20Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Moving Keys upon Join/Leave
• When a node joins, it becomes responsible for some keys previously assigned to its successor – Local change– Assuming load is balanced, how many keys
should move?• And what happens when a node leaves?
21Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Consistent Hashing Guarantees• For any set of N nodes and K keys, w.h.p.:
– Each node is responsible for at most (1 + )K/N keys– When an (N + 1)st node joins or leaves,
responsibility for O(K/N) keys changes hands (only to or from the joining or leaving node)
• For the scheme described above, = O(logN) can be reduced to an arbitrarily small constant
by having each node run (logN) virtual nodes, each with its own identifier
22Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Simple Routing Solutions
• Each node knows only its successor – Routing around the circle– Good idea?
• Each node knows all other nodes– O(1) routing– Cost?
23Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Chord Skiplist Routing• Each node has “fingers” to nodes ½ way around the ID
space from it, ¼ the way…• finger[i] at n contains successor(n+2i-1)• successor is finger[1]
N0N8
N10
N14
N21
N30N38
N42
N48
N51N56
How many entries in the finger table?
24Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Example: Chord FingersN0
N10
N21
N30
N47
finger[1..4]
N72
N82
N90
N114
finger[5]
finger[6]
finge
r[7]
m entrieslog N distinct fingers with high probability
25Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Chord Data Structures (At Each Node)
• Finger table• First finger is successor• Predecessor
26Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Forwarding Queries
• Query for key k is forwarded to finger with highest ID not exceeding k
K54 Lookup( K54 )N0
N8N10
N14
N21
N30N38
N42
N48
N51N56
27Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
How long does it take?
Remote Procedure Call (RPC)
28Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Routing Time• Node n looks up a key stored at node p• p is in n’s ith interval:
p ((n+2i-1)mod 2m, (n+2i)mod 2m] • n contacts f=finger[i]
– The interval is not empty (because p is in it) so: f ((n+2i-1)mod 2m, (n+2i)mod 2m]
– RPC f• f is at least 2i-1 away from n• p is at most 2i-1 away from f• The distance is halved: maximum m steps
29Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Routing Time Refined
• Assuming uniform node distribution around the circle, the number of nodes in the search space is halved at each step: – Expected number of steps: log N
• Note that:– m = 160 – For 1,000,000 nodes, log N = 20
30Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
What About Network Distance?K54
Lookup( K54 )N0N8
N10
N14
N21
N30N38
N42
N48
N51N56
Haifa
Texas
China
31Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Joining Chord
• Goals?• Required steps:
– Find your successor– Initialize finger table and predecessor– Notify other nodes that need to change their
finger table and predecessor pointer• O(log2N)
– Learn the keys that you are responsible for; notify others that you assume control over them
32Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Join Algorithm: Take II
• Observation: for correctness, successors suffice – Fingers only needed for performance
• Upon join, update successor only• Periodically,
– Check that successors and predecessors are consistent
– Fix fingers
33Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Creation and Join
34Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
35Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Join Examplejoiner finds successor
getskeys
stabilizefixes
successor
stabilizefixes
predecessor
36Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Join Stabilization Guarantee
• If any sequence of join operations is executed interleaved with stabilizations,– Then at some time after the last join – The successor pointers form a cycle on all the
nodes in the network• Model assumptions?
37Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Performance with Concurrent Joins
• Assume a stable network with N nodes with correct finger pointers
• Now, another set of up to N nodes joins the network, – And all successor pointers (but perhaps not all
finger pointers) are correct, • Then lookups still take O(logN) time w.h.p.• Model assumptions?
38Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Failure Handling
• Periodically fixing fingers • List of r successors instead of one successor• Periodically probing predecessors:
39Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Failure Detection
• Each node has a local failure detector module• Uses periodic probes and timeouts to check
liveness of successors and fingers– If the probed node does not respond by a designated
timeout, it is suspected to be faulty• A node that suspects its successor (finger) finds a
new successor (finger)• False suspicion - the suspected node is not faulty
– Suspected due to communication problems
40Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
The Model?• Reliable messages among correct nodes
– No network partitions• Node failures can be accurately detected!
– No false suspicions• Properties hold as long as failure is bounded:
– Assume a list of r = (logN) successors– Start from stable state and then each node fails with prob. 1/2– Then w.h.p. find successor returns the closest living successor to
the query key– And the expected time to execute find successor is O(logN)
41Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
What Can Partitions Do?
N0N8
N10
N14
N21N38
N42
N51N56
Suspect successor
N30Suspect
successor
N48
Suspect successor
42Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
What About Moving Keys?
• Left up to the application• Solution: keep soft state, refreshed
periodically– Every refresh operation performs lookup(key)
before storing the key in the right place• How can we increase reliability for the time
between failure and refresh?
43Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
Summary: DHT Advantages
• Peer-to-peer: no centralized control or infrastructure
• Scalability: O(log N) routing, routing tables, join time
• Load-balancing• Overlay robustness
44Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2008
DHT Disadvantages
• No control where data is stored• In practice, organizations want:
– Content Locality – explicitly place data where we want (inside the organization)
– Path Locality – guarantee that local traffic (a user in the organization looks for a file of the organization) remains local
• No prefix search