constructing distributed doubly linked lists without distributed locking

Constructing Distributed Doubly Linked List

without Distributed Locking

IEEE Peer-to-Peer Conference 2015 Sep 23rd–24th, 2015

Kota Abe, Osaka City University / NICT, Japan Mikio Yoshida, BBR Inc., Japan

1

Outline

BackgroundWhat is distributed doubly linked listConventional approaches

The DDLL algorithmProcedure for node insertion, deletion and traversalProcedure for recovery from failure

EvaluationComparison with conventional algorithms

Conclusion

2

Outline




Conclusion

3

Distributed Doubly Linked List

aka Bidirectional RingCommonly used in structured P2P networks

Chord, Chord#, Skip Graph, SkipNet, etc.

StructurePointer (e.g. IP address) to the next (successor) node and previous (predecessor) node

We call right and left pointersSorted by node-specific keyCircular

4

0

2060

40

70 10

50 30

Maintaining Distributed Doubly Linked List

ChallengesNodes are distributed and may be simultaneously and independently inserted and deletedNodes may fail

5

u

p qu

p q

Insertion Deletion

up q

Recovery

p q r

Traversal

up q

Conventional Approaches (1/2)Eventual Consistency Approach

Node insertion and deletion temporarily breaks the list structureStabilizing procedure recovers

6

p qu

up q

Distributed Locking ApproachUse a lock🔒 to mutually exclude node insertion / deletion

up q

JoinDone

JoinPoint

NewSuccAck

🔒

🔓

🔒

🔓

NewSucc

JoinReq

Chord

Atomic Ring Maintenance (Ghodsi)

up q

Conventional Approach (2/2)Eventual Consistency Approach

Pros 👍Easy to recover from failure

Cons 👎No lookup consistency: Lookup results may differ depending on the querying node

Distributed Locking ApproachPros 👍

Lookup consistencyCons 👎

Lock disturbs another node insertion / deletion

When a node fails, locking duration may be quite long

Recovery procedure is rather complicated

Release a lock by timeout, which may be premature

→ locks should not be used if possible

Outline




Conclusion

8

Our Contribution — DDLL Algorithm

DDLL = Distributed algorithm for constructing distributed doubly linked lists

Acronym of “Distributed Doubly Linked List”Guarantees lookup consistency without using distributed locking (in absence of failure)Simple and EfficientProved correctness (insertion and deletion procedure)Practical

Works with non-FIFO channels (e.g. UDP)Used in our PIAX P2P platform as a foundation of Skip Graph and Chord# implementations

9

Node Insertion

10

u

p q

up q

u

p q

u

p q

(1) u.l := p, u.r := q

(2) Update right link:Change p’s right link to u

(3) Update left link:Change q’s left link to u

u is going to be inserted between p and q

Updating Right Link (1/3)

11

u

p qv

u

qo

p has been deleted

We want to change p’s right link only ifthere is no conflict u

p r

q has been deleted

q

p

Conflictsanother node has been

inserted between p and q

SetR message is used for updating a right linkSetR message contains:

new right nodeexpected right node of the recipient node

When a SetR request is accepted, p returns a SetRAck messageOtherwise, p returns SetRNak message


12

u

p q

u

p q

SetR(u, q)

Please change your right link to me (u) if your right link still points to q and you has not initiated deletion

SetRAck

Ok!

Right links are always correct without using locking


13

u

p qv

another node has been inserted between p and q

SetR(u, q)

p.r != q

Conflict case example:

u

p qv

SetRNak

Sorry!

Updating Left Link (1/3)

p q

p q

uSetR(u, q)

Message Sequence

14

up q

u

v

u v

SetL(v)SetRAckSetL(u)

SetRAck

p q

SetR(v, q)

Problem:Multiple SetL messages arrive from different nodes in arbitrary order (because we do not want to use locking)Node must determine which SetL message is newer

!?

p q

Topology Change

v


Solution:SetL message contains a sequence number (seq)Each node holds a sequence number for its right node (rseq)

rseq is transferred using SetRAck

Each node holds the max sequence number of SetL messages received so far (lseq)SetL message is accepted only if msg.seq > lseq

15

p qrseq = 0

lseq = 0

up q

rseq = 1SetRAck(1)

lseq = 0SetL(u, 1)

u

p q

rseq = 1

lseq = 1

u

p q

rseq = 2v

lseq = 2

up q

rseq = 2v

SetL(u, 2)lseq = 1

SetRAck(2)


p q

uSetR(u, q, 0)

Message Sequence

16

up q

u

v

vSetL(v, 2)SetRAck(2)

SetL(u, 1)

p q

SetR(v, q)

How our scheme solves the previous case

p q0

0

SetRAck(1)00

1

0

00

2

00

2

This SetL message is staled and ignored

Topology Change

Lock is not necessary !

lseq = 0

lseq = 2

rseq = 0

rseq = 1

rseq = 2

Node Insertion Sequence

u

p q

p qi

u

p q

i

00

i

u

SetR(u, q, 0)

SetRAck(i+1)

SetL(u, i+1)

Message Sequence

17

Topology Change

qp

00

i+1

i+1

Node Deletion Sequence

u

p q

u

p q up q

SetR(q, u, i2+1)

SetRAck(i1+1)

SetL(p, i2+1)

Message Sequence

18

Topology Change

u

p qi2 + 1

i2 + 1

i2

i2i1i1

i2 + 1

i2

i1+1 is not used

Insertion and Deletion

3 messages are required for insertion/deletionA node is atomically inserted/deleted when SetR message is acceptedIf SetRNak message is received, application retries insertion/deletionRight links are always correctLeft links are correct when there is no SetL message in transmissionNo distributed lockingDoes not require FIFO channel (UDP friendly)

19

Traversals

Every inserted node can be looked up either rightward or leftwardTraversing rightward: easyTraversing leftward:

left links are not always correct1. Node X visits q and fetches

q.l (= p)2. X visits p and fetches p.l

and p.r (= u)3. X detects that u is missed

(because p.r != q) and X visits u

20

u

p q

X1.visit2.visit

Incorrect left link

3.visit

traversing leftward

Insertion Retry OptimizationInsertion requires pointers to the immediate left and right nodesWhen an inserting node receives SetRNak, the node retriesOptimization: SetRNak contains the pointer to the right node

Extra messages can be eliminated if p is not initiated deletion AND u ∈ (p, p.r)

2121

qpvu SetR

SetRAck

SetL

qpvu SetR

SetRAck

SetLSetRNak

MyR(v)GetR

SetRAckSetL

SetRAckSetL

Unoptimized

SetRNak(v)

SetR(u, v)

Optimized

SetRSetR

SetR(u, v)

Handling failure

So far, no failure is assumedDDLL algorithm considers:

Crash failureOmission failureTiming failure

In asynchronous network, it is impossible to distinguish slow nodes and failed nodes

Erroneously suspected nodes are temporarily removed but eventually recovered

22

}Omitted in this presentation

Recovery | Basic

Each node maintains a neighbor node set N

N contains sufficient number of left-side nodes

Each node u periodically finds live closest left-side node vu obtains v.r and v.rseq

If (v = u.l) ∧ (v.r = u)∧ (v.rseq = u.lseq) then OK

23

A C

A C

A C?

?BA Crseq uv

lseq

uv

Otherwise, start recovery

B

B

B

SetR(C, B, ?)

Recovery | Sequence Number (1)Let’s consider the sequence number of the recovered link

24

A C

A C

A C

i

i

i +1

i +1

i +1

B

B

B

SetR(C, B, i+1)

Assigning C.lseq + 1 ?

A C

A C?

?

B

B

SetR(C, B, ?)

Recovery | Sequence Number (2)

Both A and X have the same right node (C) and the same rseq (i +1)

25

A Xi +1

iC

A X C

A X C

SetL

SetL

i +1

i +1

i +1

i +1

i +1

B

B

B

SetR(C, B, i +1)

C’s left link may rollback !

A Xi +1

CSetLBX inserts between B and C

B fails while SetL to C is still in transmission

C starts recoveryw/o noticing X

Subtle Case


Solution:Extend sequence number:(recovery-number, seq)Recovery number is increased only on recovery Left links do not rollback!

26

A X(0, i +1)

(0, i)C

A X C

A X C

SetL

SetL

(1, 0) (0, i +1)

(1, 0)

BA(0, i)

(0, i)C

B

B

B

SetR(C, B, (1, 0))(0, i +1)

Outline




Conclusion

27

Evaluation

ComparisonDDLL(without optimization)DDLL(with optimization)Atomic Ring Maintenance (distributed-locking)

A. Ghodsi, “Distributed k-ary System: Algorithms for distributed hash tables,” PhD Dissertation, KTH—Royal Institute of Technology, 2006.

Li’s algorithm (distributed locking, no finger table)X. Li, et. al., “Concurrent maintenance of rings.” Distributed Comp., vol. 19, no. 2, pp. 126–148, 2006.

Chord (eventual consistency, no finger table)I. Stoica, et. al., “Chord: A scalable peer-to-peer lookup protocol for internet applications,” IEEE/ACM Trans. on Net., vol. 11, no. 1, pp. 17–32, 2003.

28

Eval | Insertion Sequence

29

u

p q

Join(u)

Ack(p, q)

Grant(u)

🔒

🔓

🔒

🔓

Li’s

Done

up q

JoinReq

JoinDone

JoinPoint

NewSuccAck

🔒

🔓

🔒

🔓

Atomic Ring Maintenance

NewSucc

DDLL

qp

SetLSetRAck

uSetR

Eval | Time for Concurrent Insertion

Simulated on a discrete event simulatorInsert an initial nodeInsert n nodes in parallel (n = 1 to 100)Measured time required to converge all links

Time includes lookup messages for searching node insertion position

30

0

20

40

60

80

100

120

0 20 40 60 80 100

time

# of simultaneously inserting nodes

DDLL(Opt)DDLL(NoOpt)

AtomicLi's

Chord

DDLL(Opt) converges quickly

Time to convergetime unit = one-way message

transmission time

Eval | # of Msgs for Concurrent Insertion

31

0

1

2

3

4

5

0 20 40 60 80 100

#ofmessages(x1000)

# of simultaneously inserting nodes

DDLL(Opt)DDLL(NoOpt)

AtomicLi's

Chord

# of messages to convergeMeasured # of messages required to converge all links

DDLL(Opt) uses less messages

Outline




Conclusion

32

Conclusion

DDLL algorithm for constructing distributed doubly linked lists

No distributed lockingRight links are always correct, Left links converge quicklyMaintains lookup consistency (in absence of failure)More efficient than conventional algorithmsRecovery procedure is providedNo FIFO channel is requiredCorrectness proofs for insertion and deletion procedure

DDLL is suitable for ring-based structured P2P networksReal example: DDLL is used as a foundation of Skip Graph and Chord# implementations in PIAX P2P platform

33

Spare Slides

34


X is excluded from the linked list but eventually returns

35

BA X C(1, 0) (0, i +1)

(1, 0)

BA X C(0, i +1)

(1, 0)

SetR(X, C, (0, 0))

BA X C(0, 0) (1, 1)

(1, 0)

(1, 0)

BA X C(0, i +1)

(1, 0)

(0, 0)

SetRAck((1,1))

(0, 0)

DDLL pseudo code

36

1 p r o c e s s u2 var s : {out , i n s , in , d e l}3 l , r : {p o i n t e r t o a node or n i l}4 lseq , rseq : { i n t e g e r or n i l}5 i n i t s = o u t ; l = r = n i l ; lseq = 0 ; rseq = n i l6 begin7 {Cr ea t e a l i n k e d l i s t }8 (A1 ) r e c e i v e C r e a t e ( ) from app →9 l , r , s , lseq , rseq := u , u , in , 0 , 0

10 { I n s e r t be tween p and q}11 [ ] (A2 ) r e c e i v e I n s e r t ( p , q ) from app →12 i f ( s ̸= o u t ∨ u ̸∈ (p, q) ) then error ; f i13 l , r , s := p , q , i n s14 send SetR ( u , r , lseq ) to l15 {D e l e t e}16 [ ] (A3 ) r e c e i v e D e l e t e ( ) from app →17 i f ( s ̸= i n ) then error18 e l s e i f ( u = r ) then { i n case o f t h e l a s t node}19 s := o u t20 e l s e s := d e l ; send SetR ( r , u , rseq + 1) to l ; f i21 [ ] (A4 ) r e c e i v e SetR ( rnew , rcur , rnewseq ) from v →22 i f ( s = i n ∧ r = rcur ) then23 i f ( rnew = v ) then { i n s e r t i o n case}24 send SetL ( rnew , rseq + 1) to r25 e l s e { d e l e t i o n case}26 send SetL ( u , rnewseq ) to rnew ; f i27 send SetRAck ( rseq + 1) to v28 r , rseq := rnew , rnewseq

29 e l s e send SetRNak ( ) to v ; f i30 [ ] (A5 ) r e c e i v e SetRAck ( rnewseq ) from v →31 i f ( s = i n s ) then32 s , rseq := in , rnewseq

33 e l s e i f ( s = d e l ) then34 s := o u t ; f i35 [ ] (A6 ) r e c e i v e SetRNak ( ) from v →36 i f ( s = i n s ) then37 s := o u t ; error {app r e t r i e s i n s e r t i o n l a t e r}38 e l s e i f ( s = d e l ) then39 s := i n ; error ; f i {app r e t r i e s d e l e t i o n l a t e r}40 [ ] (A7 ) r e c e i v e SetL ( lnew , seq ) from v →41 i f ( lseq< seq ) then l , lseq := lnew , seq ; f i42 end

Fig. 1: DDLL algorithm (without optimization)

are executed.

(A2) u sets u’s left link and right link to p andq, respectively. u also sets u.s as ins to indicate u isinserting. u sends a SetR message to p, which containsu (as the new right node), q (as the expected currentright node, or rcur), and zero (as the new right sequencenumber, or rnewseq).

(A4) On receiving the SetR message, p checkswhether its status is in and rcur equals p.r. If the formeris false, either p has not received a SetRAck messageafter its insertion (as we describe next, SetRAck mes-sage is to inform that node insertion or deletion issucceeded), or p has started its deletion. If the latter isfalse, it indicates either that another node has inserted atthe right side of p, or that q has been deleted. In eithercase, p rejects the request and sends a SetRNak messageto u to notify that the insertion failed. Otherwise, psends a SetL message to p’s right node (q in this case)to update its left link to u. The SetL message contains

u (as the new left node) and p.rseq+1(= i+1) (as thesequence number of the SetL message). Next, p sendsa SetRAck message to u to notify that the insertionwas successful. Because left(q) is changed from p to u,the incremented right sequence number for q should betransferred from p to u. For this purpose, the SetRAckmessage contains p.rseq+1(= i+1). Finally, p changesp.r to u and p.rseq to 0 (rnewseq). Because u’s right linkhas already been set to q, the rightward linked list isnever interrupted, even for a moment. Note that at thismoment, p.rseq = u.lseq holds.

(A5) On receiving the SetRAck message, u confirmsthat u is successfully inserted. Node u updates u.s toin to indicate that u is inserted, and sets u.rseq to i+1.

(A7) On receiving the SetL message, q compares thesequence number of the SetL message with q.lseq. If theformer is larger (we assume this case), q updates q.l tou and q.lseq to i+1. Otherwise, q ignores the message.

In the scenario above, it is assumed that a SetRAckmessage is sent to u in A4. If a SetRNak message issent (i.e., in the case of insertion failure), then (A6) u.sis reverted to out and u retries the insertion procedurefrom locating its insertion position.

Note that a node u might receive a SetL messagebefore receiving a SetRAck message. This happens,for example, when another node is inserted betweenp and u while the SetRAck message from p to u isstill in transmission. This is normal and the algorithmcan handle this situation. Actually we consider a nodeu becomes inserted at the moment when a SetRAckmessage is sent to u (see Section V).

Figure 3 depicts the situation where two nodes senda SetL message to the same node. There are 4 nodes A,B, C and D (A < B < C < D) and nodes A and Dare initially inserted. A.rseq and D.lseq are i. Nodes Band C are then inserted in this order. When D receivesthe SetL message from C, its left link is updated to Cand its left sequence number is updated to i+2. WhenD later receives the SetL message from B, D ignores itbecause its sequence number (i+1) is smaller than D’sleft sequence number (i+ 2). Thus, the receiving orderof the SetL message does not affect the final results.

E. Deletion

Let us assume that node u, which is inserted betweenp and q, is going to be deleted. We also assume that bothp.rseq and u.lseq are i1 and that both u.rseq and q.lseqare i2 (Fig. 4). To delete node u, u sends a messageDelete() to u. Then, the following actions are executed.

(A3) If u.s is not in, deletion is rejected because it isuncertain whether u is inserted. If u is the last node (i.e.,

1 p r o c e s s u2 var s : {out , i n s , in , d e l}3 l , r : {p o i n t e r t o a node or n i l}4 lseq , rseq : { i n t e g e r or n i l}5 i n i t s = o u t ; l = r = n i l ; lseq = 0 ; rseq = n i l6 begin7 {Cr ea t e a l i n k e d l i s t }8 (A1 ) r e c e i v e C r e a t e ( ) from app →9 l , r , s , lseq , rseq := u , u , in , 0 , 0

10 { I n s e r t be tween p and q}11 [ ] (A2 ) r e c e i v e I n s e r t ( p , q ) from app →12 i f ( s ̸= o u t ∨ u ̸∈ (p, q) ) then error ; f i13 l , r , s := p , q , i n s14 send SetR ( u , r , lseq ) to l15 {D e l e t e}16 [ ] (A3 ) r e c e i v e D e l e t e ( ) from app →17 i f ( s ̸= i n ) then error18 e l s e i f ( u = r ) then { i n c ase o f t h e l a s t node}19 s := o u t20 e l s e s := d e l ; send SetR ( r , u , rseq + 1) to l ; f i21 [ ] (A4 ) r e c e i v e SetR ( rnew , rcur , rnewseq ) from v →22 i f ( s = i n ∧ r = rcur ) then23 i f ( rnew = v ) then { i n s e r t i o n case}24 send SetL ( rnew , rseq + 1) to r25 e l s e { d e l e t i o n ca se}26 send SetL ( u , rnewseq ) to rnew ; f i27 send SetRAck ( rseq + 1) to v28 r , rseq := rnew , rnewseq

29 e l s e send SetRNak ( ) to v ; f i30 [ ] (A5 ) r e c e i v e SetRAck ( rnewseq ) from v →31 i f ( s = i n s ) then32 s , rseq := in , rnewseq

33 e l s e i f ( s = d e l ) then34 s := o u t ; f i35 [ ] (A6 ) r e c e i v e SetRNak ( ) from v →36 i f ( s = i n s ) then37 s := o u t ; error {app r e t r i e s i n s e r t i o n l a t e r}38 e l s e i f ( s = d e l ) then39 s := i n ; error ; f i {app r e t r i e s d e l e t i o n l a t e r}40 [ ] (A7 ) r e c e i v e SetL ( lnew , seq ) from v →41 i f ( lseq< seq ) then l , lseq := lnew , seq ; f i42 end

Fig. 1: DDLL algorithm (without optimization)

are executed.

(A2) u sets u’s left link and right link to p andq, respectively. u also sets u.s as ins to indicate u isinserting. u sends a SetR message to p, which containsu (as the new right node), q (as the expected currentright node, or rcur), and zero (as the new right sequencenumber, or rnewseq).

(A4) On receiving the SetR message, p checkswhether its status is in and rcur equals p.r. If the formeris false, either p has not received a SetRAck messageafter its insertion (as we describe next, SetRAck mes-sage is to inform that node insertion or deletion issucceeded), or p has started its deletion. If the latter isfalse, it indicates either that another node has inserted atthe right side of p, or that q has been deleted. In eithercase, p rejects the request and sends a SetRNak messageto u to notify that the insertion failed. Otherwise, psends a SetL message to p’s right node (q in this case)to update its left link to u. The SetL message contains

u (as the new left node) and p.rseq+1(= i+1) (as thesequence number of the SetL message). Next, p sendsa SetRAck message to u to notify that the insertionwas successful. Because left(q) is changed from p to u,the incremented right sequence number for q should betransferred from p to u. For this purpose, the SetRAckmessage contains p.rseq+1(= i+1). Finally, p changesp.r to u and p.rseq to 0 (rnewseq). Because u’s right linkhas already been set to q, the rightward linked list isnever interrupted, even for a moment. Note that at thismoment, p.rseq = u.lseq holds.

(A5) On receiving the SetRAck message, u confirmsthat u is successfully inserted. Node u updates u.s toin to indicate that u is inserted, and sets u.rseq to i+1.

(A7) On receiving the SetL message, q compares thesequence number of the SetL message with q.lseq. If theformer is larger (we assume this case), q updates q.l tou and q.lseq to i+1. Otherwise, q ignores the message.

In the scenario above, it is assumed that a SetRAckmessage is sent to u in A4. If a SetRNak message issent (i.e., in the case of insertion failure), then (A6) u.sis reverted to out and u retries the insertion procedurefrom locating its insertion position.

Note that a node u might receive a SetL messagebefore receiving a SetRAck message. This happens,for example, when another node is inserted betweenp and u while the SetRAck message from p to u isstill in transmission. This is normal and the algorithmcan handle this situation. Actually we consider a nodeu becomes inserted at the moment when a SetRAckmessage is sent to u (see Section V).

Figure 3 depicts the situation where two nodes senda SetL message to the same node. There are 4 nodes A,B, C and D (A < B < C < D) and nodes A and Dare initially inserted. A.rseq and D.lseq are i. Nodes Band C are then inserted in this order. When D receivesthe SetL message from C, its left link is updated to Cand its left sequence number is updated to i+2. WhenD later receives the SetL message from B, D ignores itbecause its sequence number (i+1) is smaller than D’sleft sequence number (i+ 2). Thus, the receiving orderof the SetL message does not affect the final results.

E. Deletion

Let us assume that node u, which is inserted betweenp and q, is going to be deleted. We also assume that bothp.rseq and u.lseq are i1 and that both u.rseq and q.lseqare i2 (Fig. 4). To delete node u, u sends a messageDelete() to u. Then, the following actions are executed.

(A3) If u.s is not in, deletion is rejected because it isuncertain whether u is inserted. If u is the last node (i.e.,