an introduction to consensus with raft diego ongaro john ousterhout stanford university
TRANSCRIPT
![Page 1: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/1.jpg)
An Introduction toConsensus with Raft
Diego Ongaro John Ousterhout
Stanford University
http://raftconsensus.github.io
![Page 2: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/2.jpg)
April 2014 Raft Consensus Algorithm Slide 2
![Page 3: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/3.jpg)
April 2014 Raft Consensus Algorithm Slide 3
Distributed Systems
availability or consistency
![Page 4: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/4.jpg)
● TODO: eliminate single point of failure
● An ad hoc algorithm “This case is rare and typically occurs as a result
of a network partition with replication lag.” Watch out for @aphyr
– OR –
● A consensus algorithm (built-in or library) Paxos, Raft, …
● A consensus service ZooKeeper, etcd, consul, …
April 2014 Raft Consensus Algorithm Slide 4
Inside a Consistent System
![Page 5: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/5.jpg)
● Agreement on shared state (single system image)
● Recovers from server failures autonomously Minority of servers fail: no problem Majority fail: lose availability, retain consistency
April 2014 Raft Consensus Algorithm Slide 5
What is Consensus?
Servers
![Page 6: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/6.jpg)
● Key to building consistent storage systems
● Top-level system configuration Which server is my SQL master? What shards exist in my storage system? Which servers store shard X?
● Sometimes used to replicate entire database state (e.g., Megastore, Spanner)
April 2014 Raft Consensus Algorithm Slide 6
Why Is Consensus Needed?
![Page 7: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/7.jpg)
● Replicated log replicated state machine All servers execute same commands in same order
● Consensus module ensures proper log replication
● System makes progress as long as any majority of servers are up
● Failure model: fail-stop (not Byzantine), delayed/lost messagesApril 2014 Raft Consensus Algorithm Slide 7
Goal: Replicated Log
x3 y2 x1 z6Log
ConsensusModule
StateMachine
Log
ConsensusModule
StateMachine
Log
ConsensusModule
StateMachine
Servers
Clients
x 1
y 2
z 6
x3 y2 x1 z6
x 1
y 2
z 6
x3 y2 x1 z6
x 1
y 2
z 6
z6
![Page 8: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/8.jpg)
● Leslie Lamport, 1989
● Nearly synonymous with consensus
● Hard to understand“The dirty little secret of the NSDI community is that at most five people really, truly understand every part of Paxos ;-).” – Anonymous NSDI reviewer
● Bad foundation for building systems“There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system…the final system will be based on an unproven protocol.” – Chubby authors
April 2014 Raft Consensus Algorithm Slide 8
Paxos Protocol
![Page 9: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/9.jpg)
● We wanted the best algorithm for building real systems Must be correct, complete, and perform well Must also be understandable
● “What would be easier to understand or explain?” Fundamentally different decomposition than Paxos Less complexity in state space Less mechanism
April 2014 Raft Consensus Algorithm Slide 9
Raft’s Design for Understandability
![Page 10: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/10.jpg)
User study
April 2014 Raft Consensus Algorithm Slide 10
Quiz Grades
Survey Results
![Page 11: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/11.jpg)
1. Leader election Select one of the servers to act as leader Detect crashes, choose new leader
2. Log replication (normal operation) Leader takes commands from clients, appends them to its log Leader replicates its log to other servers (overwriting
inconsistencies)
3. Safety Only elect leaders with all committed entries in their logs
April 2014 Raft Consensus Algorithm Slide 11
Raft Overview
![Page 12: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/12.jpg)
● At any given time, each server is either: Follower: completely passive replica (issues no RPCs, responds
to incoming RPCs) Candidate: used to elect a new leader Leader: handles all client interactions, log replication
● At most one viable leader at a time
April 2014 Raft Consensus Algorithm Slide 12
Server States
Follower Candidate Leader
start
time out,start election
receive votes frommajority of servers
![Page 13: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/13.jpg)
● Time divided into terms: Election Normal operation under a single leader
● At most one leader per term● Each server maintains current term value● Key role of terms: identify obsolete information
April 2014 Raft Consensus Algorithm Slide 13
Terms
Term 1 Term 2 Term 3 Term 4 Term 5
time
Elections Normal OperationSplit Vote
![Page 14: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/14.jpg)
Leaders send heartbeats to maintain authority.
Upon election timeout, start new election:● Increment current term● Change to Candidate state● Vote for self
● Send Request Vote RPCs to all other servers,wait until either:
1. Receive votes from majority of servers:● Become leader, send heartbeats to all other servers
2. Receive RPC from valid leader:● Return to follower state
3. No-one wins election (election timeout elapses):● Increment term, start new election
Slide 14
Leader Election
![Page 15: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/15.jpg)
● The Secret Lives of Data http://thesecretlivesofdata.com
● Visualizes distributed algorithms, starting with Raft
● Project by Ben Johnson (author of go-raft)
April 2014 Raft Consensus Algorithm Slide 15
Leader Election Visualization
![Page 16: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/16.jpg)
● If we choose election timeouts randomly,
● One server usually times out and wins election before others wake up
Slide 16
Randomized Timeouts
![Page 17: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/17.jpg)
● Log replication
● Client interaction
● Cluster membership changes
● Log compaction
● To appear: 2014 USENIX Annual Technical Conf. June 19-20 in Philadelphia Draft on Raft website
April 2014 Raft Consensus Algorithm Slide 17
Raft Paper
![Page 18: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/18.jpg)
kanaka/raft.js JS Joel Martin
go-raft Go Ben Johnson (Sky) and Xiang Li (CoreOS)
hashicorp/raft Go Armon Dadgar (HashiCorp)
LogCabin C++ Diego Ongaro (Stanford)
ckite Scala Pablo Medina
peterbourgon/raft Go Peter Bourgon
rafter Erlang Andrew Stone (Basho)
barge Java Dave Rusek
py-raft Python Toby Burress
ocaml-raft OCaml Heidi Howard (Cambridge)
…
April 2014 Raft Consensus Algorithm Slide 18
Raft Implementations
![Page 19: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/19.jpg)
April 2014 Raft Consensus Algorithm Slide 19
Best Logo: go-raft
by Brandon Philips (CoreOS)
![Page 20: An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University](https://reader035.vdocuments.mx/reader035/viewer/2022062802/56649eef5503460f94bff835/html5/thumbnails/20.jpg)
● Consensus is key to building consistent systems
● Design for understandability
● Raft separates leader election from log replication Leader election uses voting and randomized timeouts
More at http://raftconsensus.github.io:
● Paper draft, other talks
● 10 to 50+ implementations
● raft-dev mailing list
Diego Ongaro @ongardie
Slide 20
Summary