distributed consensus a.k.a. "what do we eat for lunch?"
DESCRIPTION
Distributed Consensus is everywhere! Even if not obvious at first, most apps nowadays are distributed systems, and these sometimes have to "agree on a value", this is where consensus algorithms come in. In this session we'll look at the general problem and solve a few example cases using the RAFT algorithm implemented using Akka's Actor and Cluster modules.TRANSCRIPT
Konrad 'ktoso' Malawski GeeCON 2014 @ Kraków, PL
Konrad `@ktosopl` Malawski
Distributed Consensus A.K.A.
“What do we eat for lunch?”
Konrad 'ktoso' Malawski GeeCON 2014 @ Kraków, PL
Distributed Consensus A.K.A.
“What do we eat for lunch?”
Konrad `@ktosopl` Malawski
real world edition
Konrad `@ktosopl` Malawski
hAkker @
Konrad `@ktosopl` Malawski
typesafe.com geecon.org
Java.pl / KrakowScala.pl sckrk.com / meetup.com/Paper-Cup @ London
GDGKrakow.pl meetup.com/Lambda-Lounge-Krakow
hAkker @
You?
Distributed systems?
You?
Distributed systems?
?
You?
Distributed systems?
?
?
What is this talk about?
The network. !
How to think about distributed systems. !
Some healthy madness.
Code in slides covers only “simplest possible case”.
Ordering[T]
Slightly chronological. !
By no means is it “worst to best”.
Consensus
Consensus - informal
“we all agree on something”
Consensus - formalTermination
Every correct process decides some value.
!
Validity If all correct processes propose the same value v,
then all correct processes decide v.
!
Integrity If a correct process decides v,
then v must have been proposed by some correct process.
!
Agreement Every correct process must agree on the same value.
Consensus
Consensus
Distributed Consensus
Distributed Consensus
What is a distributed system anyway?
Distributed system definition
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
— Leslie Lamport
http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
Distributed system definition
A system in which participants communicate asynchronously using messages.
http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
Distributed Systems - failure detection
Distributed Systems - failure detection
Distributed Systems - failure detection
Jim had quit CorpSoft a while ago, but no-one ever told Bob…
Distributed Systems - failure detection
Distributed Systems - failure detection
Failure detection:• can only rely on external knowledge • but what if there’s no-one to tell you?
• thus: must be in-some-way time based
Two Generals Problem
Two Generals ProblemYellow and Blue armies must attack Pink City.
They must attack together, otherwise they’ll die in vain. Now they must agree on the exact time of the attack.
!They can only send messengers, which Pink may intercept and kill.
Two Generals Problem
Two Generals Problem - happy case
I need to inform blue about my attack plan.
I don’t know when yellow will attack…
Two Generals Problem - happy case
1) Initial message not lost
Two Generals Problem - happy case
I don’t know if Blue will also attack at 13:37… I’ll wait until I hear back from him.
Two Generals Problem - happy case
I don’t know if Blue will also attack at 13:37… I’ll wait until I hear back from him.
Why?
2) Message might have not reached blue
Blue must confirm the reception of the command
1) Yellow is now sure, but Blue isn’t!
1) Yellow is now sure, but Blue isn’t!
Why?
2) Blue’s confirmation might have been lost!
This is exactly mirrors the initial situation!
2 Generals Problem Translated to Akka
2 Generals translated to Akka:
Akka Actors implement the Actor Model: !
Actors: • communicate via messages • create other actors • change their behaviour on receiving a msg
!
2 Generals translated to Akka:
Akka Actors implement the Actor Model: !
Actors: • communicate via messages • create other actors • change their behaviour on receiving a msg
!
Gains? Distribution / separation / modelling abstraction
2 Generals translated to Akka:
case class AttackAt(when: Date)
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka:! !class General(general: Option[ActorRef]) extends Actor {!!! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)!! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }!! def otherGeneralName = !! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }!
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka:! !class General(general: Option[ActorRef]) extends Actor {!!! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)!! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }!! def otherGeneralName = !! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }!
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka:! !class General(general: Option[ActorRef]) extends Actor {!!! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)!! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }!! def otherGeneralName = !! ! ! if(self.path.name == “blue")!“yellow" else "blue"! }!
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka:! !class General(general: Option[ActorRef]) extends Actor {!!! val WhenIWantToAttack: Date = ???! ! general foreach { _ ! AttackAt(WhenIWantToAttack) }! ! def receive = {! case AttackAt(when) =>! println(s”General ${otherGeneralName} attacks at $when”)!! ! ! println(s”I must confirm this!")! ! sender() ! AttackAt(when)! }!! def otherGeneralName = !! ! ! if (self.path.name == “blue")!"yellow" else "blue"! }!
Presentation–sized–snippet = does not cover all cases
2 Generals translated to Akka:
val system = ActorSystem("two-generals")!!val blue = ! system.actorOf(Props(new General(general = None)), name = "blue")!!val yellow = ! system.actorOf(Props(new General(Some(blue))), name = "yellow")!
The blue general attacks at 13:37, I must confirm this!!The yellow general attacks at 13:37, I must confirm this!!The blue general attacks at 13:37, I must confirm this!!...
Presentation–sized–snippet = does not cover all cases
8 Fallacies of Distributed Computing
8 Fallacies of Distributed Computing
1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn’t change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.
Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
Failure Models
Failure models:
Fail – Stop
Fail – Recover
Byzantine
Failure models:
Fail – Stop
Fail – Recover
Byzantine
Failure models:
Fail – Stop
Fail – Recover
Byzantine
Failure models:
Fail – Stop
Fail – Recover
Byzantine
2-phase commit
2PC - step 1: Propose value
2PC - step 1: Propose value
2PC - step 1: Promise to agree on write
2PC - step 2: Commit the write
2PC - step 1: Propose value, and die
2PC - step 1: Propose value to 1 node, and die
2PC: Prepare needs timeouts
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
Still can’t tolerate if the “accepted value” Actor dies
2PC: Timeouts + recovery committer
2PC: Timeouts + recovery committer
2 Phase Commit translated to Akka
2PC translated to Akka
case class Prepare(value: Any)!case object Commit!!sealed class AcceptorStatus!case object Prepared extends AcceptorStatus!case object Conflict extends AcceptorStatus!!
Presentation–sized–snippet = does not cover all cases
2PC translated to Akka
case class Prepare(value: Any)!case object Commit!!sealed class AcceptorStatus!case object Prepared extends AcceptorStatus!case object Conflict extends AcceptorStatus!!
Presentation–sized–snippet = does not cover all cases
2PC translated to Akka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0!! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }!! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }!! case Conflict =>!! ! ! ! ! context stop self! }! }!
Presentation–sized–snippet = does not cover all cases
2PC translated to Akka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0!! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }!! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }!! case Conflict =>!! ! ! ! ! context stop self! }! }!
Presentation–sized–snippet = does not cover all cases
2PC translated to Akka class Proposer(acceptors: List[ActorRef]) extends Actor {! var transactionId = 0! var preparedAcceptors = 0!! def receive = {! case value: String =>! transactionId += 1! acceptors foreach { _ ! Prepare(transactionId, value) }!! case Prepared =>! preparedAcceptors += 1! ! if (preparedAcceptors == acceptors.size)! acceptors foreach { _ ! Commit }!! case Conflict =>!! ! ! ! ! context stop self! }! }!
Presentation–sized–snippet = does not cover all cases
2PC with ResumeProposer in Akka
case class Prepare(value: Any)!case object Commit!!sealed class AcceptorStatus!case object Prepared extends AcceptorStatus!case object Conflict extends AcceptorStatus!case class Committed(value: Any) extends AcceptorStatus!
Presentation–sized–snippet = does not cover all cases
2PC with ResumeProposer in Akka!class ResumeProposer(! proposer: ActorRef, ! acceptors: List[ActorRef]) extends Actor {!! context watch proposer!! var anyAcceptorCommitted = false!! def receive = {! case Terminated(`proposer`) =>! println("Proposer died! Try to finish the transaction...")! acceptors map { _ ! StatusPlz }!! case _: AcceptorStatus =>! // impl of recovery here! }!}
Presentation–sized–snippet = does not cover all cases
2PC with ResumeProposer in Akka
Presentation–sized–snippet = does not cover all cases
Quorum
Quorum voting
From the perspective of the Omnipotent Observer *
Quorum voting
* does not exist in a running system
From the perspective of the Omnipotent Observer *
Quorum voting
Quorum voting
Quorum voting
Quorum voting
Quorum voting
Quorum voting
Quorum voting – split votes
Quorum voting – split votes
Quorum voting – split votes
Quorum voting – split votes
Quorum voting – split votes
James Mickens “The Saddest Moment” http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf
Paxos
Basic Paxos =
“choose exactly one value”
Paxos: a high-level overview
It’s the distributed systems algorithm
Paxos: a high-level overview
JavaZone had a full session on Paxos already today…
A few Paxos whitepapers
"Reaching Agreement in the Presence of Faults” – Lamport, 1980 …
“FLP Impossibility Result” – Fisher et al, 1985 “The Part Time Parliament” – Lamport, 1998
… “Paxos made Simple” – Lamport, 2001
“Fast Paxos” – Lamport, 2005 …
“Paxos made Live” – Chandra et al, 2007 …
“Paxos made Moderately Complex” – Rennesse, 2011 ;-)
Lamport’s “Replicated State Machine”
Paxos: The cast
Paxos: The cast
Paxos: The cast
Paxos: The cast
Paxos: The cast
Paxos: The cast
!
Consensus time! Chose a value (raise your hand)
Consensus time! Chose a value (raise your hand):
v1 = Basic Paxos + Raft v2 = Just Raft
Consensus time! Chose a value (raise your hand):
v1 = Basic Paxos + Raft v2 = Just Raft
Consensus time! Chose a value (raise your hand):
v2 = Just Raftv1 = Basic Paxos + Raft
Consensus time! Chose a value (raise your hand):
v1 = Basic Paxos + Raft v2 = Just Raft (if enough time, Paxos)
Basic Paxos simple example
Paxos: Proposals
ProposalNr must: • be greaterThan any prev proposalNr
used by this Proposer • example: [roundNr|serverId]
Paxos: 2 phases
Phase 1: Prepare Phase 2: Accept
Paxos, Prepare Phase
n = nextSeqNr()
Paxos, Prepare Phase
acceptors ! Prepare(n, value)
Paxos, Prepare Phase
case Prepare(n, value) =>! if (n > minProposal) {! minProposal = n! accVal = value! }!! sender() ! Accepted(minProposal, accVal)
Paxos, Prepare Phase
case Prepare(n, value) =>! if (n > minProposal) {! minProposal = n! accVal = value! }!! sender() ! Accepted(minProposal, accVal)
Paxos, Prepare Phase
value = highestN(responses).accVal ! // replace my value, with accepted value!
Paxos, Accept Phase
acceptors ! Accept(n, value)
Paxos, Accept Phase
case Accept(n, value) =>! if (n >= minProposal) {! acceptedProposal = minProposal = n! acceptedValue = value! }!!learners ! Learn(value)!sender() ! minProposal
Paxos, Accept Phase
Paxos, Accept Phase
Paxos, Accept Phase
if (acceptedN > n) restartPaxos()!else println(n + “ was chosen!”)
Basic Paxos
Basic Paxos, needs extensions for the “real world”.
Additions: • “stable leader” • performance (basic = 2 * broadcast roundtrip) • ensure full replication • configuration changes
Multi Paxos
Multi Paxos
“Basically everyone does it, but everyone does it differently.”
Multi Paxos
• Keeps the Leader • Clients find and talk to the Leader
• Skips Phase 1, in stable state • 2 delays instead of 4, until learning a value
Raft
Raft – inspired by Paxos
Paxos is great. Multi-Paxos is great, but no “common understanding”. !
!
Raft wants to be understandable and just as solid."In search of an understandable consensus protocol" (2013)
Raft – inspired by Paxos!
!
• Leader based • Less processes than Paxos • It’s goal is simplicity • “Basic” includes snapshotting / membership
Raft - summarised on one page
Diego Ongaro & John Ouserhout – In search of an understandable consensus protocol
Raft
Raft
Raft - starting the cluster
Raft - Election timeout
Raft - 1st election
Raft - 1st election
Raft - Election Timeout
Raft - Election Phase
Raft
Raft
Raft
Raft
Raft
Raft
Raft
Raft
Raft
Raft
Raft – heartbeat = empty entries
Raft – heartbeat = empty entries
Akka–Raft !
(community project) (work in progress)
Raft, reminder:
Raft translated to Akka
abstract class RaftActor !! extends Actor ! ! with FSM[RaftState, Metadata]
Raft translated to Akka
abstract class RaftActor !! extends Actor ! ! with FSM[RaftState, Metadata]
Raft translated to Akka
onTransition {!! case Follower -> Candidate =>! self ! BeginElection! resetElectionDeadline()!! // ...!}
Raft translated to Akka
onTransition {!! case Follower -> Candidate =>! self ! BeginElection! resetElectionDeadline()!! // ...!}
Raft translated to Akka
! case Event(BeginElection, m: ElectionMeta) =>! log.info("Init election (among {} nodes) for {}”,! m.config.members.size, m.currentTerm)!! val request = RequestVote(m.currentTerm, m.clusterSelf, replicatedLog.lastTerm, replicatedLog.lastIndex)!! m.membersExceptSelf foreach { _ ! request }!! val includingThisVote = m.incVote! stay() using includingThisVote.withVoteFor(m.currentTerm, m.clusterSelf)! }!
Raft translated to Akka
Raft Heartbeat using Akka
akka-raft is a work in progress community project – it may change a lot
sendHeartbeat(m)!log.info("Starting hearbeat, with interval: {}", heartbeatInterval)!setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)!
Raft Heartbeat using Akka
akka-raft is a work in progress community project – it may change a lot
sendHeartbeat(m)!log.info("Starting hearbeat, with interval: {}", heartbeatInterval)!setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)!
Raft Heartbeat using Akka
akka-raft is a work in progress community project – it may change a lot
sendHeartbeat(m)!log.info("Starting hearbeat, with interval: {}", heartbeatInterval)!setTimer(HeartbeatName, SendHeartbeat, heartInterval, repeat = true)!
val leaderBehaviour = {! // ...! case Event(SendHeartbeat, m: LeaderMeta) =>! sendHeartbeat(m)! stay()!}
Akka-Raft in User-Land //alpha!!!
class WordConcatRaftActor extends RaftActor {!! type Command = Cmnd!! var words = Vector[String]()!! /** Applied when command committed by Raft consensus */! def apply = {! case AppendWord(word) =>! words = words :+ word! word!! case GetWords =>! log.info("Replying with {}", words.toList)! words.toList! }!}!
akka-raft is a work in progress community project – it may change a lot
FLP Impossibility
FLP Impossibility Proof (19
Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
FLP Impossibility Result
Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
FLP Impossibility Result
Impossibility of Distributed Consensus with One Faulty Process 1985 by Fisher, Lynch, Paterson
ktoso @ typesafe.com twitter: ktosopl github: ktoso blog: project13.pl team blog: letitcrash.com
JavaZone @ Oslo 2014
!
!
Takk! Dzięki! Thanks! ありがとう!
akka.io
Konrad 'ktoso' Malawski GeeCON 2014 @ Kraków, PL
Happy Byzantine Lunch-time!
©Typesafe 2014 – All Rights Reserved
Links1. http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf 2. http://hydra.infosys.tuwien.ac.at/teaching/courses/AdvancedDistributedSystems/download/
1975_Akkoyunlu,%20Ekanadham,%20Huber_Some%20constraints%20and%20tradeoffs%20in%20the%20design%20of%20network%20communications.pdf
3. http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf 4. http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf 5. http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf 6. http://the-paper-trail.org/blog/consensus-protocols-paxos/ 7. http://static.googleusercontent.com/media/research.google.com/en//archive/
paxos_made_live.pdf 8. http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-
osdi06.pdf 9. https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf 10. Recent Leslie Lamport interview: http://www.se-radio.net/2014/04/episode-203-leslie-
lamport-on-distributed-systems/ 11. http://book.mixu.net/distsys/ 12. http://codahale.com/you-cant-sacrifice-partition-tolerance/
Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
Links1. Excellent Paxos lecture by Diego Ongaro
https://www.youtube.com/watch?v=JEpsBg0AO6o 2. Fallacies, actual paper: http://www.rgoarchitects.com/Files/fallacies.pdf 3. Diego Ongaro & John Ouserhout – In search of an understandable consensus protocol 4. http://macs.citadel.edu/rudolphg/csci604/ImpossibilityofConsensus.pdf
Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html
Images / drawings1. Paxos Island Photo – Luigi Piazzi (CC license) https://www.flickr.com/photos/photolupi/
3686769346/in/photolist-6BME5J-orKHL2-58qmez-58uz7s-7bRwTj-7bRvHY-6DdRC2-fBqFFU-35KTg7-8vbe23-bsBGL7-58qq6z-58uAjG-8vbeCd-d1Sqqw-d1Smsj-d1Sqi5-d1SoMA-d1SmBE-d1SpVo-d1Sk2U-d1SoBQ-d1SoXu-d1SoqN-d1Spqu-d1Sq4w-d1SpLU-d1SKDG-d1Skcu-d1Sp8f-d1Sqaq-d1SpCw-75YaVN-d1SLs1-d1SK15-d1SJiC-d1Suiu-d1SKtS-d1SjQS-d1StyU-d1SKi1-d1SxGS-d1Sm6j-d1Sxdh-d1SKMN-d1SxAq-d1SwgC-d1Smgj-d1SvhJ-d1SjC7
2. Drawings – myself (use-them-at-will-unless-mocking-my-horrible-drawing-skills-license)
Peter Deutsch “The Eight Fallacies of Distributed Computing” https://blogs.oracle.com/jag/resource/Fallacies.html