dsn 2008 1 byzantine replication under attack yair amir, jonathan kirsch, john lane johns hopkins...
TRANSCRIPT
DSN 2008 1
Byzantine Replication Under Attack
Yair Amir, Jonathan Kirsch, John LaneJohns Hopkins University
Brian CoanTelcordia Technologies
DSN 2008 2
Byzantine Replication Under Attack
Yair Amir, Jonathan Kirsch, John LaneJohns Hopkins University
Brian CoanTelcordia Technologies
DSN 2008 3
⢠Society depends on large-scale, distributed computer systems for critical infrastructure.
⢠Insider attacks are a real threat, even for systems designed with security in mind.
⢠Byzantine replication provides fault tolerance by protecting against partial system compromises.â Attacker must compromise more than some threshold fraction
of the system to cause inconsistency or prevent the system from functioning.
â Systems perform well in fault-free or benign fault runs.â What about performance when under attack?
Motivation
DSN 2008 4
The Downside of Asynchrony
⢠Existing correctness criteria: safety and livenessâ Safety: servers remain consistent.â Liveness: each update is eventually executed.
⢠Protocols are designed to be safe in all executions.â Do not rely on synchrony for safety!â Guarantee liveness only when the network is sufficiently stable.
⢠Real systems are not completely asynchronous.â Systems can satisfy much stronger performance guarantees
than liveness during stable periods.
⢠Consequence: Performance attacks!â An attacker can exploit the gap between what is promised
during stable periods (liveness) and what is possible.
DSN 2008 5
Performance Attacks:A First-Hand Look
⢠Red-team attack on Steward [DSN 06]. ⢠Goal was to violate safety or liveness.⢠Steward survived all of the attacks!
⢠Most did not affect performance.
⢠The system was slowed down in one experiment.â Speed of update ordering was slowed down by a factor of 5.
⢠Big problem: â A better attack could slow the system down by a factor of 100. â But the system is still considered live!
⢠Liveness is a necessary but insufficient correctness criterion for practical systems on wide-area networks.
DSN 2008 6
Byzantine Performance Failures
⢠If the adversary cannot violate safety and liveness, the next best thing is to slow down the system beyond usefulness.
⢠Performance failures: send correct messages slowly but without triggering timeouts.
Failure TypeFailure
BehaviorMitigated by
Value DomainSending incorrect,
conflicting, or
invalid messages
Cryptography,
agreement protocols
Time DomainMessages arrive after timeouts or
not at all
Timeouts,
view change
Previously Considered Byzantine Failures
DSN 2008 7
A New Problem: Performance Under Attack
⢠Existing systems are vulnerable to performance attacks.â A small number of faulty servers can cause the system to make
progress at an extremely slow rate -- indefinitely!
⢠Leader-based protocols are vulnerable to performance attacks by a malicious leader.â Problem is magnified in wide-area networks, where it is difficult
to predict the performance that should be expected of the leader.
⢠Main challenges:â Developing meaningful performance metrics for evaluating
Byzantine replication protocols.â Designing protocols that perform well according to these
metrics, even when the system is under attack.
DSN 2008 8
⢠Motivation
⢠Byzantine Performance Failures
⢠Relevant Prior Work
⢠Case Study: BFT Under Attack
⢠The Prime Replication System⢠Bounded-Delay
⢠Protocol Overview
⢠Experimental Results
⢠Summary
Outline
DSN 2008 9
Relevant Prior Work
⢠Leader-based Byzantine replicationâ BFT [Castro, Liskov 99]â Separating agreement from execution [Yin et al. 03]â Fast Byzantine Consensus [Martin, Alvisi 05]â Zyzzyva [Kotla et al. 07]
⢠Randomized Byzantine replicationâ SINTRA [Cachin, Portiz 02]â RITAS [Moniz et al. 06]
⢠Quorum-based Byzantine replicationâ Q/U [Abd-El-Malek et al. 05]â HQ [Cowling et al. 06]
DSN 2008 10
Case Study: BFT Under Attack [Castro and Liskov 99]
Client
0
1
2
request pre-prepare prepare reply
3
commit
(Leader)
⢠Attack 1: Pre-Prepare Delayâ Malicious leader can add delay into the ordering path by
withholding its Pre-Prepare.â Non-leaders maintain a FIFO queue of pending updates.
⢠Use timeouts to monitor the leader.⢠Timeout placed on execution of first update in queue.
â Malicious leader can stay in power by ordering one update per queue per timeout period!
DSN 2008 11
Case Study: BFT Under Attack [Castro and Liskov 99]
Client
0
1
2
request pre-prepare prepare reply
3
commit
(Leader)
⢠Attack 2: Timeout Manipulationâ Timeout doubles every time the leader is replaced.â Use a denial of service attack to increase the timeout,
then stop on a malicious leader.
⢠Each update is eventually executed, but performance is much worse than if there were only correct servers.
DSN 2008 12
⢠Performance-Oriented Replication in Malicious Environmentsâ Leader-based protocol providing Bounded-Delay, a stronger
guarantee than liveness, when the network is stable.
⢠System components:â Prime Ordering Protocol (Preordering phase, Global ordering phase)
â Suspect-Leader Protocol for detecting malicious leaders.
⢠Main Ideas:â Resources needed by the leader to do its job are bounded and
independent of system throughput.⢠Leader has âno excuseâ for not sending timely messages.
â Non-leader servers compute a threshold level of acceptable performance that the leader should meet.
⢠Upper-bounded by a function of the latency between correct servers after the network stabilizes.
The Prime Replication System
DSN 2008 13
Bounded Delay⢠Prime-Stability: There is a time after which the following
condition holds for a set of at least 2f+1 correct servers (the stable servers):
⢠For each pair of stable servers r and s, there exists a value Min_Lat(r,s), unknown to the servers, such that if r sends a message to s, it will arrive with delay , where
⢠Bounded-Delay: There exists a time after which the update latency for any update initiated by a stable server is upper-bounded.
DSN 2008 14
Prime: Ordering Protocol
⢠Preordering (PO) Phase: â Each server, o, disseminates its updates to the other servers
(PO-Request).â Agreement protocol binds update u to preorder identifier (o, i), where
u is the ith update originated by server o (PO-ACK).â Each server cumulatively acknowledges the updates it preorders
(PO-ARU).
No
Att
ac
k
L
O
L = Leader
O = Originator
= Aggregation Delay
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
DSN 2008 15
Prime: Ordering ProtocolN
o A
tta
ck
L
O
L = Leader
O = Originator
= Aggregation Delay
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
PreorderingProtocol
ua, ub, ucServer 1
Server 2
Server 3
Server 4ug, uh, ui
(1, 1, ua), (1, 2, ub), (1, 3, uc)
(4, 1, ug), (4, 2, uh), (4, 3, ui)
ud, ue (2, 1, ud), (2, 2, ue)
uf(3, 1, uf)
3 2 1 3
PO-ARU
DSN 2008 16
Prime: Ordering Protocol
⢠Global Ordering Phase:â Similar to BFT (Pre-Prepare, Prepare, Commit)â Leader periodically sends a Pre-Prepare containing a proof matrix
(vector of PO-ARU messages). â Each globally ordered Pre-Prepare maps to a batch of preordered
updates based on contents of proof matrix.â Final total order is obtained by deterministically ordering the
updates in each batch based on preorder identifier.
No
Att
ac
k
L
O
L = Leader
O = Originator
= Aggregation Delay
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
DSN 2008 17
Prime: Ordering ProtocolN
o A
tta
ck
L
O
L = Leader
O = Originator
= Aggregation Delay
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
Global OrderingProtocol
Pre-Prepare 1 Pre-Prepare 2
PP1 PP2
âŚ
Final Total Order
PO-ARU1
PO-ARU2
PO-ARU3
PO-ARU4
PO-ARU1â
PO-ARU2â
PO-ARU3â
PO-ARU4â
DSN 2008 18
Attack AnalysisN
o A
tta
ck
L
O
L = Leader
O = Originator
= Aggregation Delay
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
⢠Key Points:â Preordering phase for updates sent by correct servers cannot be
slowed down by faulty servers.â Once all correct servers receive a Pre-Prepare, global ordering
cannot be slowed down by faulty servers.
⢠Possible Attacks:â 1. Leader sends its Pre-Prepare to only some correct servers.â 2. Leader sends a Pre-Prepare with out-of-date PO-ARUs.â 3. Leader delays its Pre-Prepare.
DSN 2008 19
Addition 1: Pre-Prepare Flooding
O
Att
ac
k
⢠Intuition: 1. The leader must withhold the Pre-Prepare from all correct servers to significantly impact
latency. 2. If we can force the leader to send timely, up-to-date Pre-Prepares to at least one correct
server, we can ensure timely ordering!
No
Att
ac
k
L
O
L = Leader
O = Originator
= Aggregation Delay
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
L
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
DSN 2008 20
O
Att
ac
k
L
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
⢠Each server periodically sends a Proof-Matrix message, containing the latest PO-ARU messages it has received, to the leader.â A correct server expects a leader to include, in its next Pre-
Prepare, PO-ARU messages that are at least as up-to-date as those in the Proof-Matrix message.
⢠Why is this expectation justified?â A correct leader can simply adopt any PO-ARU messages that are
more up to date than what it currently has.
PROOFMATRIX
Addition 2: Proof Matrix Messages
DSN 2008 21
Key Idea: Turn-Around Time
O
Att
ac
k
L
POREQUEST
POACK
POARU
PREPREPARE PREPARE COMMIT
PROOFMATRIX
⢠Turn-around timeâ Time between sending a Proof-Matrix message, PM, and receiving a Pre-
Prepare âcoveringâ all of the PO-ARU messages in PM.
⢠Key Observation:â The resources required by the leader to send a Pre-Prepare (bandwidth, CPU)
are bounded and independent of system throughput. â We can use turn-around time as a measure by which to judge the leader!
⢠Intuition: Force the leader to be timely by ensuring that it provides a fast enough turn-around time to at least one correct server.
DSN 2008 22
Suspect-Leader Protocol⢠Protocol Strategy:
â Dynamically determine an acceptable turn-around time based on roundtrip measurements (TAT_acceptable).
â Use turn-around times measured in the current view to compute a measure of the current leaderâs performance (TAT_leader).
â Suspect the leader if TAT_leader > TAT_acceptable.
⢠Design Challenges: â Malicious servers can lie to try to lower expectation of acceptable
performance.
⢠Leader could remain in power while going slowly.
â Malicious servers can lie to make a correct leader look bad.⢠Would lead to continuous view changes.
DSN 2008 23
⢠Any server that retains a role as leader must provide a TAT to at least one correct server that is no more than
â Maximum update latency:
⢠There exists a set of at least f+1 correct servers that will not be suspected by any correct server if elected leader.â Aggressive but not overly aggressive.
= Maximum delay between correct servers
= Aggregation delay
Bounded-Delay!
Suspect-Leader: Key Properties
DSN 2008 24
Experimental Results⢠7 servers (f = 2)
⢠Symmetric networkâ 50ms diameter, 10 Mbps links
⢠Leader performs just well enough to stay in power.
⢠BFT: aggressive timeout (300ms)
⢠BFT: Pre-Prepare delay
⢠Prime: â Leader adds as much delay as
possible.â Non-leader servers force as
much reconciliation as possible.
Update Throughput vs. Clients50ms Diameter, 10Mbps Links
0
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500
Number of Clients
Update Throughput (updates/sec)
BFT - No Attack
Prime - No Attack
Prime - Under Attack
BFT - Under Attack
Update Latency vs. Clients50ms Diameter, 10Mbps Links
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 100 200 300 400 500
Number of Clients
Update Latency (s)
BFT - No Attack
Prime - No Attack
Prime - Under Attack
BFT - Under Attack
DSN 2008 25
Summary
⢠Existing leader-based Byzantine replication protocols are vulnerable to performance attacks.â Liveness is not a meaningful performance metric for
evaluating Byzantine replication protocols.
⢠Bounded-Delay: a new performance metric.â Can we provide stronger guarantees?â Can we guarantee a minimum throughput?
⢠Prime: a new Byzantine replication protocol. â Achieves Bounded-Delay when the network is sufficiently
stable.
DSN 2008 26
Questions?â˘