lecture 4 – locking and message representationntucsiecloud98.appspot.com/files/4/1001/chubby...

85
Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers 922EU3870 – Cloud Computing and Mobile Platforms, Autumn 2009 (2009/10/5) Ping Yeh ( 葉平 ), Google, Inc.

Upload: others

Post on 18-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

Lecture 4 – Locking and Message Representation

Chubby and Protocol Buffers

922EU3870 – Cloud Computing and Mobile Platforms,

Autumn 2009 (2009/10/5)

Ping Yeh (葉平 ), Google, Inc.

Page 2: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

2

Outline

• Overview

• Distributed consensus problem

• The Paxos Algorithm

• Chubby – the distributed lock service in Google

• Protocol Buffers – cross platform language-neutral data representation

Page 3: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

3

References

• Paxos

– Lamport, L., “The part-time parliament”. ACM Trans. on Computer Systems 16, 2 (1998), 133–169; “Paxos made simple”. ACM SIGACT News 32, 4 (Dec. 2001), 18–25.

– David Mazieres, “Paxos Made Practical”, http://www.cs.cornell.edu/courses/cs6464/2009sp/papers/paxos_practical.pdf

– Tushar, C.; Griesemer, R; Redstone J., "Paxos Made Live - An Engineering Perspective". Proc. 26th ACM Symp. on Principles of Distributed Computing (2007), pp. 398-407

• Chubby:

– Burrows, M., “The Chubby lock service for loosely-coupled distributed systems”. Proc. 7th USENIX Symp. on Operating Systems Design and Implementation (2006), pp. 335-350

• Consensus Algorithm

– Fischer, M. J., Lynch N. A., Paterson M. S., “Impossibility of Distributed Consensus with One Faulty Process”, JACM 32 (2) (1985), 374–382.

Page 4: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

Overview

Page 5: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

5

Distributed Lock Service

• A lock service is useful in a distributed system

– Locking of distributed resources

– Master election: processes compete for a lock, whoever gets it is the new master

• The lock service must be...

– Available: the show must go on even if one lock server is down

– Consistent: multiple lock servers must have the same data

– Fault-tolerant: disk crash, message loss/delay, servers goes down/up

• Chubby is Google's distributed lock service using the Paxos consensus algorithm

Page 6: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

6

Consensus Problem

• Consensus problem: reaching agreement among a collection of N processes that can propose values

– Each process can have a value V[i], i = 1 … M

– At the end of the consensus algorithm, every process has chosen the same value V[k] for some k in 1 ... M

– Can't choose more than 1 values or non-proposed values

– Can't learn that a value is chosen until it is actually chosen

• A hard problem in the presence of failures

– Crash failures: processes may crash and recover, disks may get corrupted

– Omission failures: messages may be lost

– Byzantine failures: processes may lie

• Lots of literatures

Page 7: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

7 7

State Machine Replication

• Any server essentially a state machine

– Disk, RAM, CPU, registers from the state

– Instructions transition among states

– User requests cause instructions to be executed, so cause transitions among states

• Replicate state machine on multiple hosts

– Every replica must see same operations in same order

– If deterministic, replicas end in same state

• Use a consensus algorithm to make fault-tolerant log on all replicas, even when there are failures.

– Paxos is the most widely used consensus algorithm for fault-tolerant agreement in state machine replication

Page 8: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

The Paxos Algorithm

Page 9: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

9

Terminology

• Value: any computer datum (e.g., a string, int, float, …)

• Process: a program that participates in the algorithm

• Consensus: a value agreed by a collection of processes

• Quorum: absolute majority of processes agreeing on a value

• Consensus Algorithm: an algorithm that produces a consensus from a collection of processes

• Safety: the algorithm never goes into an illegal state (e.g., data consistency is maintained across processes)

• Liveness: algorithm makes progress

Page 10: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

10 10

Consensus

Replica A Replica B Replica C Replica D Replica E Replica FX

Page 11: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

11 11

Approach

• Build a fault-tolerant consensus algorithm

– A set of replicas agrees on a given value

– The “value” may be an operation on a database

• Apply consensus algorithm repeatedly to build a fault-tolerant log

– The log may contain a sequence of operations

• Starting from identifcal database state, applying an identical sequence of operations results in the same database state

Page 12: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

12 12

Replica A Replica B

Log A DB A

wombat

Log B DB B

wombat

1. insert lion,

4. insert bumblebee bat,

3. delete lion

2. insert wombat,

1. insert lion,

2. insert wombat,

3. delete lion

1. insert lion,

2. insert wombat,

3. delete lionbumblebee bat bumblebee bat

4. insert bumblebee bat,

4. insert bumblebee bat,

lion lion

Page 13: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

13 13

Replica Log Framework

• Clients Submit values

• Consensus algorithm issues client Callbacks on each replica for agreed upon values

Replica A

Application submits value

Replica B

Replica C

Application callbackApplication callback

Application callback

Page 14: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

14

FLP Impossibility

• Famous paper by Fischer, Lynch and Paterson, "Impossibility of Distributed Consensus with One Faulty Process," Journal of the ACM, April 1985, 32(2):374-382.

– Winner of Dijkstra Prize 2001.

• “It is impossible for a set of processors in an asynchronous distributed system to agree on a binary value, even if only a single processor is subject to an unannounced crash.”

– Intuition

Page 15: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

15

The Paxos Parliament

• Early in this millennium, the Aegean island of Paxos was a thriving mercantile center.

• Wealth led to political sophistication

– the Paxons replaced their ancienttheocracy with a parliamentary formof government.

• But trade came before civic duty, and no one in Paxos was willing todevote his life to Parliament.

• The Paxon Parliament had to functioneven though legislators continually wanderedin and out of the parliamentary Chamber

• How did the Parliament decide on anything?

Page 16: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

16

Features of the Paxos Algorithm

• Invented by Leslie Lamport in 1998

• Solves the distributed consensus problem with the following features

– 3 roles in the algorithm: proposers, acceptors, learners

– Asynchronous: agents operate at arbitrary speed, messages may take arbitrarily long to deliver

– Fault-tolerant

• Agents may stop and may restart

• Messages can be duplicated or lost, but never corrupted

• Doesn't handle Byzantine failures: agents must recover its state after a restart (a big requirement!)

– Always safe

– Live if a quorum of processes are OK (up and connected with each other)

Page 17: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

17

Paxos: the Simplest Scenario

proposer acceptor acceptor

propose

promise

commit

accept, v

ack

propose

promise

accept, v

ack

commit

if promised by a quorum

Phase 1:lobby phase(prepare phase)

Phase 2:vote phase

Phase 3:commit phase

Page 18: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

18

Multiple Proposers

• There may be more than one proposers in the system, each proposing a value

– Single proposer can fail

– Every node must be willing to become a proposer

• All nodes wait a maximum period (timeout) for messages they expect

– Upon timeout, a node declares itself a proposer and initiates a new Phase 1 of algorithm

Page 19: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

19 19

Ballot Numbers

• To which proposer should an acceptor promise?

• Solution: use ballot numbers to identify proposals

– Each proposer has its own ballot numbers

– No 2 proposers can use the same ballot number

– Usually: proposer #k uses ballot numbers B that B%N = k

• How to reach a consensus?

– If a proposal with value V is chosen, all higher numbered ballots must propose the same value V

Page 20: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

20

Acceptors in the Lobby Phase

• An acceptor doesn't have to respond to any message

– No response has the same effect as lost messages

– But it should try to respond in order to make progress

• Reponding a “promise” to a lobby means the acceptor won't help proposals with smaller ballot numbers.

• When receiving a proposal with ballot number B:

– If the acceptor has promised to a higher numbered ballot, ignore ballot B

– Otherwise it promises to help ballot B, with optional data

• If the acceptor has voted for value V from ballot X, optional data = (V, X)

• Otherwise the optional data is empty

Page 21: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

21

The Vote Phase

• If the proposer gets promises from a quorum, it starts the vote phase

– If no promise contains accepted values, send its own value

– If some promise contain accepted values, send the value from the largest ballot number

• If it doesn't get enough promises

– Propose a higher numbered ballot, probably after a back-off time is passed

– When should the proposer stop?

• An acceptor accepts value from ballot B that it promised to help, unless it has promised to help a higher numbered ballot.

Page 22: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

22 22

Commit Phase

• If proposer receives ack messages from majority of acceptors

– send commit message to all participants

• If acceptor receives commit

– done = true

– agreement reached; agreed-on value is V

Page 23: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

23

Definition of chosen

• A value is chosen at ballot number B iff majority of acceptors accept that value in vote phase of the ballot number.

• How about other acceptors?

Page 24: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

24

Paxos’s properties

• P1: Any ballot number is unique.

• P2: Any two set of acceptors have at least one acceptor in common.

• P3: the value sent out in vote phase is the value of the highest-numbered ballot of all the responses in ballot phase.

Page 25: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

25

Proof of safety

• Claim: if a value V is chosen at ballot number N, any value that is sent out in vote phase of any later ballot numbers must be also V.

• Proof (by contradiction):

– Let m be the first ballot number that is later than N and the value sent out is not V in vote phase

Page 26: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

26

Proof

n+1 v …

m v’ …

m-1 v …

n v …

# value pool of acceptors

a

a

the highest # that a accept ≥ nthe highest # chosen in phase 2 ≥

Page 27: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

27

Paxos: more complete picture

proposer acceptor acceptor

propose(B)

promise [B*,V*]

commit

accept(B, V)

ack

propose(B)

promise [B', V']

accept(B, V)

ack

commit

if promised by a quorum,choose value V to vote

otherwise pick a larger B and propose again, maybe after a back-off time

Phase 1:lobby phase(prepare phase)

Phase 2:vote phase

Phase 3:commit phase

Page 28: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

28

How about 2 Proposers?

proc 1

proc 3

proc 5

proc 2 proc 4

PP

Page 29: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

29

How about network partition?

proc 1

proc 3

proc 5

proc 2 proc 4

P

Page 30: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

30

How about late-comers?

proc 1

proc 3

proc 5

proc 2 proc 4

V V

V

V

Page 31: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

31

Liveness

• Two proposers, each sending out a higher numbered ballot before the other has a chance to send accept, produces a race condition and there is no progress

• Solution: Choose a leader (lead proposer)

– The leader is the only one issuing proposals

• Question:

– Can all nodes have a consensus on who the leader is?

– Chicken and egg?

– FLP impossibility implies that randomness or real time must be used to elect a leader

– What if the leader crashes?

Page 32: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

32

Learning a chosen value

• There are some options:

– Each acceptor, whenever it accepts a proposal, informs all the learners.

– Acceptors informs a distinguished learner (usually the proposer) and let the distinguished learner broadcast the result.

Page 33: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

33

Persistency requirements

• Each agent must write to disk what it's about to send before sending a message, otherwise it may renege its promises.

– Necessary to avoid Byzantine failures

– What if the disk fails?

– What if the data is corrupt?

Page 34: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

Chubby

Burrows, M., “The Chubby lock service for loosely-coupled distributed systems”. Proc. 7th USENIX Symp. on Operating

Systems Design and Implementation (2006), pp. 335-350

Page 35: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

35

Chubby

• Central reliable, low-throughput service for distributed coordination

– store coordination information

– helps synchronization

– master election: many servers try to get the same lock, the one who gets it is the master

• Used by many Google technologies like GFS and BigTable

Page 36: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

36

What is Chubby

• A distributed file service, providing locks and small files.

• It comprises:

– a client library

– partitioned, replicated servers

• A single partition:

client client client

replica replicamaster

Page 37: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

37

Major Features

• Lock service (and not consensus library)

– Course-grained locks

• File system

– Serve small files

– Caching of files (consistent caching)

• Event notification mechanism

• Support large-scale concurrent file viewing

• Security (access control)

Page 38: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

38

Chubby Cell

• A set of replicas (typically 5)

• Use Paxos to elect master (master = leader in Paxos)

– won't elect new master for some time (master lease)

– master renews the lease with replicas before it expires

• Maintain copies of simple database

• Writes satisfied by majority quorum

• Reads satisfied by master alone

– not feasible if there are 2 proposers (why?)

• Replacement system for failed replicas

– DNS change, data catch up

– can't vote until it has processed a write request

Page 39: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

39

Chubby Clients

• Link against library

• Master location requests are sent to any replica

– replies the master location if it is not the master

• All other requests are sent directly to master

– until it ceases to respond or says it is no more the master

Page 40: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

40

Chubby file

• Chubby exports a file system with GFS-like names

– /ls/cellname/directory/directory/filename

• Files are consistently cached by clients

• Files and directories have associated locks

• Locks are advisory shared/exclusive locks

– Does not prevent other clients from accessing the file

– Do prevent other clients from getting the lock in a different mode

• Clients talk to servers periodically or lose their locks

Page 41: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

41

Namespace

• /ls/cellname/pathname

– …

• /ls/global/pathname

– replica spans across data centers so single data center outage doesn't affect the availability

– useful for ACL

• /ls/local/pathname

– local cell in each data center, for most accesses

Page 42: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

42

Partitioning

• Namespace is partitioned by directory name

– N partitions, each with master and replicas

– Everything in a directory is in the same partition

• node D/file stored on partition P = hash(D) mod N

• meta-data for D may be on different partition

• allows for simpler/faster directory reads and rename if wanted

• possible imbalance

• Little cross-partition communications desirable

– permission checks

– directory deletion

– caching helps mitigate this

Page 43: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

43

Consistent Caching

• Files are cached by the client library

• Writer: write-through cache

• Other clients:

– Clients are notified of updates

– Update proceed only after clients acknowledge that they've invalidated their cache

Page 44: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

44

Files and Directories

• File system interface

– specialized API, also via interface used by GFS

• Features derived from partitioning and caching

– don't support moving files between 2 directories

– don't maintain directory modified time

– file permission doesn't depend on parent directories

– no file last-access time for easier caching of metadata (why? hint: distributed)

Page 45: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

45

Nodes

• A node = a file or a directory

• Ephemeral nodes

– such files are deleted with when no client has them open

– such directories are deleted when empty

– useful for temp and heartbeat

• Metadata

Page 46: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

46

Metadata of a Node

• ACL:

– 3 names of ACLs per node (Read, Write, Change ACL name)

– Write_ACL(File F) == foo and user bar is listed in ACL file foo � user bar can write file F (analog: group membership)

– authentication built into RPC

• 4 monotonically increasing 64-bit numbers for versioning

– instance, content, lock, ACL.

• 64-bit file-content checksum

Page 47: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

47

Handles

• Analogous to UNIX file descriptors

– including check digits to prevent client from forging handles

• Support for use across master changes

– sequence number per master

– mode information for recreating state

Page 48: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

48

Granularity of Locks

• Course-grained locks

– should be used rarely (e.g., elect a new master)

– a lock lasts a long time (days/weeks)

– less load on lock server

– less delay when lock server fails

– should survive lock server failures

– less lock servers and availability required

• Fine-grained locks

– usually used often (e.g., lock a region of a file)

– a lock lasts a short time (seconds/minutes)

– heavier lock server load, more client stalling on fail

– can be implemented on client side

Page 49: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

49

Locks

• Any node can act as lock (shared or exclusive)

• Advisory (vs. mandatory)

– protected resources are of other services: would need complicated coupling to implement mandatory locks

– can't force apps to shutdown for debugging / admin.

– no value in extra guards by mandatory locks

• Write permission needed to acquire

– prevents unprivileged reader blocking progress

Page 50: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

50

Sequencers

• Use sequence #’s in interactions using locks

• Sequencer

– opaque byte-string

– state of lock immediately after acquisition

– passed by client to servers, servers validate

• Alternative: lock-delay

Page 51: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

51

Events

• At open time, clients can request notification of:

– file contents or ACL modified

– child node added / removed / modified

– Chubby master failed over / session expiry

– handle / lock have become invalid

– lock acquired / conflicting lock request (rarely used)

• Delivered asynchronously to application via callback from Chubby client library

• A client can maintain an up-to-date view

Page 52: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

52

API

• Open()

– the only call with name as arguments, all other calls use handles

– specify how handle will be used (access checks here)

– events to subscribe to

– lock-delay

– whether new file/directory should be created

• Close() vs. Poison()

• Other operations:

– GetContentsAndStat(), SetContents(), Delete(), Acquire(), TryAcquire(), Release(), GetSequencer(), SetSequencer(), CheckSequencer()

Page 53: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

53

Primary election

• All candidates attempt to open lock file / get lock

– winner writes identity with SetContents()

– replicas find out with GetContentsAndStat(), possibly after file-modification event

• After a candidate becomes the primary:

– it obtains a sequencer (GetSequencer())

– it passes it to application server

– the app calls CheckSequencer() to make sure it is still valid

Page 54: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

54

Sessions and KeepAlives

• Session maintained through KeepAlives handshakes

• Handles, locks and cached data remain valid while the session is valid

– client must acknowledge cache invalidation messages (see below)

• Terminated explicitly, or after lease timeout

• Lease timeout advanced by master when

– session is created

– master responds to KeepAlive RPC

– master fail-over occurs

Page 55: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

55

KeepAlives

• Master responds:

– close to lease timeout for sending new lease timeout

• longer timeouts when master is busy

– or earlier for sending events and cache invalidation signals

• Client sends another KeepAlive immediately

– likely blocked at the master

– client must acknowledge invalidate to maintain session

– RPC’s flow from client to master

– allows operation through firewalls

Page 56: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

56

Session Lifetime

• Client maintains local lease timeout

– conservative approximation

– must assume known restrictions on clock skew

• When local lease expires

– client empties and disables cache

– session is in jeopardy, client waits in grace period

– cache enabled on reconnect before grace period ends

– otherwise the session expires and the RPC call returns an error

• Application informed about jeopardy event, safe event, expired event

• All operations on handles fail when the session expires

Page 57: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

57

Caching

• Client caches file data, node meta-data, handles and locks

– write-through cache are held in memory

– use events to inform clients of conflicting lock requests

• Strict consistency

– weaker models is harder to learn and use for programmers

– do not want to alter diverse preexisting communication protocols: using sequence numbers in every message for synchronization is not feasible

Page 58: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

58

Cache Invalidation

• Invalidation

– master keeps list of what clients may have cached

– write requests are blocked

• master sends invalidations

• clients flush changed data, acknowledge with KeepAlive

– writes proceed after all clients acknowledge or cache-expire

– data uncachable until invalidation acknowledged

– allows reads to happen without delay

• Invalidates data but does not update

– update may be arbitrarily inefficient

Page 59: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

59

Fail-over

• When the master fails...

– In-memory state discarded (sessions, handles, locks, etc.)

– Lease timer “stops” (doesn't expire)

– Re-elect a master

• Quick re-election

– client reconnect before leases expire

• Slow re-election

– clients disable cache, enter grace period

– allows sessions across fail-overs

Page 60: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

60

Sequence of Events in a Fail-over

Page 61: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

61

New Master's actions

Steps of newly-elected master:

1. Pick new epoch number

2. Respond only to master location requests

3. Build in-memory state for sessions / locks from DB

4. Respond to KeepAlives

5. Emit fail-over events to caches

6. Wait for acknowledgements / session expire

7. Allow all operations to proceed

8. Handle created pre-fail-over used

– master recreates in memory, honors call

– if closed, record that in memory

9. Delete ephemeral files w/o open handles after an interval

Page 62: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

62

Backup

• Every few hours

• Snapshot of database to GFS server

– different building to tolerate building damages and cyclic dependencies

• Disaster recovery

• Initialize new replica

– avoid load on in-service replicas

Page 63: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

63

Mirroring

• Collection of files mirrored across cells

• Mostly for configuration files

– /ls/global/master mirrored to /ls/cell/slave

• global cell’s replicas spread around world

– Chubby’s own ACLs

– Files advertising presence / location

– pointers to Bigtable cells

– etc.

Page 64: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

64

Proxies

• Proxies pass requests from clients to cell

• Can handle KeepAlives and reads

• Not writes, but they are << 1% of workload

• KeepAlive traffic by far most dominant

• Disadvantages:

– additional RPC for writes / first time reads

– increased unavailability probability

– fail-over strategy not ideal (will come back to this)

Page 65: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

65

Mechanisms for Scaling

• Achieved scalability: observed 90,000 client processes for a single master

– Note that server machines are identical to client ones

• Mechanisms for scaling: reduce communications is the key

– Partitions

– Caching

– Proxies

– Dynamically increase lease time

– Protocol-conversion servers

Page 66: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

66

Use and Observations

• Many files for naming

• Config, ACL, meta-data common

• 10 clients use each cached file,on avg.

• Few locks held, no shared locks

• KeepAlives dominate RPC traffic

Page 67: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

67

Outages

• Sample of cells

– 61 outages over few weeks (700 cell-days)

– due to network congestion, maintenance, overload, errors in software, hardware, operators

• 52 outages under 30s

– applications not significantly affected

• Few dozen cell-years of operation

– data lost on 6 occasions (bugs & operator error)

Page 68: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

68

Java Clients

• Most of Google infrastructure is in C++

• Growing # of Java applications

• Googlers dislike JNI

– would rather translate library to Java

– maintaining it would require great expense

• Java users run protocol-conversion server

– exports protocol similar to Chubby’s client API

Page 69: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

69

Name Service

• DNS uses TTL values

– entries must be refreshed within that time

– huge (and variable) load on DNS server

• Chubby's most popular use is as a DNS

– provides name service for most Google systems

– invalidations, no polling

• client builds up needed entries in cache

• name entries further grouped in batches

– full consistency not needed

– reduce load with protocol-conversion server

Page 70: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

70

Abusive Clients

• Company environment assumed

• Requests to use Chubby thoroughly reviewed

• Abuses:

– lack of aggressive caching

• absence of files, open file handles

– lack of quotas

• 256kB limit on file size introduced

• encouraged use of appropriate storage systems

– publish/subscribe

Page 71: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

71

Lessons Learned

• Developers rarely consider availability

– should plan for short Chubby outages

– crashed applications on fail-over event

• Fine-grained locking not essential

• Poor API choices

– handles acquiring locks cannot be shared

• RPC use affects transport protocols

– forced to send KeepAlives by UDP for timeliness

Page 72: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

72

Related Work

Chubby

• locks, storage system, session/lease in one service

• target audience – wide range

• higher-level interface

• lost lock expensive for clients

• could use locks and sequencers with other systems

Boxwood

• 3 separate services

– lock, Paxos, failure detection

– could be used independently

• fewer, more sophisticated developers

• different default parameters

• lacks grace period

• uses locks primarily within

Page 73: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

73

Summary

• Distributed lock service

– course-grained synchronization for Google’s distributed systems

• Design based on well-known ideas

– distributed consensus, caching, notifications, file-system interface

• Primary internal name service

• Repository for files requiring high availability

Page 74: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

Protocol Buffers

http://code.google.com/apis/protocolbuffers/

Page 75: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

The following 5 slides are taken from Facebook's

presentation of Thrift (http://wiki.apache.org/thrift/Slides?action=AttachFile&do=get&target=Thrift-FSOSS-

Seneca-2007.ppt)

Page 76: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

phpHigh-Level Goal: Enable transparent interaction between these.

…and some others too.

Page 77: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

Problems

• Type systems

• IO libraries

• Serialization

• Runtimes

• Performance

• PHP

No problem can stand the assault of sustained thinking.

Voltaire

Page 78: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

Solution

Create an IDL. Statically generate code.

Facebook.thrift

Facebook.php Facebook.py Facebook.cpp Facebook.java

“Sounds like a lot of work, is that really necessary?”

Page 79: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

Yes. It is.

Page 80: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

Hasn’t this been done before? (yes.) ‏

• SOAPXML, XML, and more XML

• CORBABloated? Remote bindings?

• COMFace-Win32ClientSoftware.dll-Book

• PillarSlick! But no versioning/abstraction.

• Protocol BuffersClosedOpen source Google deliciousness.

Page 81: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

81

What is Protocol Buffers?

Protocol Buffers defines two things:

• A text-based description language (.proto)

• A compact binary serialization format (pb)

And provides two software parts:

• .proto parser / code generator

• Runtime serialization library (object to pb and back)

Open source (BSD-like license)

Page 82: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

82

What is a .proto?

Yet another description language...

message SearchRecord { required float score = 1; required fixed64 docid = 2; optional WebResult details = 3; ...};

message SearchResult { required int32 estimated_results = 1 optional string error_message = 2; repeated SearchRecord record = 3; ...};

See http://code.google.com/apis/protocolbuffers/docs/proto.html for details.

Page 83: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

83

Features

• Compact serialized form (Tag-Length-Value):

– Lots of performance work

• minimal framing overhead

• no string field names

• uses variable-length encoding where sensible

– Format used to store data persistently (not just for RPCs)

– Details: http://code.google.com/apis/protocolbuffers/docs/encoding.html.

• Portable:

– Platform-neutral

– Automatic generation of wrappers for many languages

• C++, Java, Python (by Google)

• Perl, Ruby, Javascript, C#, Erlang, Haskell, … (by community)

Page 84: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

84 84

Features (cont)

• Graceful client and server upgrades

– systems ignore tags they don't understand, but pass the information through (gradual upgrades)

• Fast: efficient to parse

• Also allow service specifications:

service Search { rpc DoSearch(SearchRequest) returns

(SearchResponse); rpc DoSnippets(SnippetRequest) returns (SnippetResponse); rpc Ping(EmptyMessage) returns (EmptyMessage) { protocol=udp; }};

Page 85: Lecture 4 – Locking and Message Representationntucsiecloud98.appspot.com/files/4/1001/Chubby (New).pdf · Lecture 4 – Locking and Message Representation Chubby and Protocol Buffers

85 85

Other technologies

• Corba

• COM/DCOM

• WSDL

• XML

• Thrift