check before you change - usenix · server1 (s1) 172.28.228.21 indaas weighted maxsat solver + +...
TRANSCRIPT
Check before You Change: Preventing Correlated Failures in Service Updates
Ennan Zhai Ang Chen Ruzica Piskac Mahesh Balakrishnan
Bingchuan Tian Bo Song Haoliang Zhang
Background
• Cloud services ensure reliability by redundancy: - Storing data redundantly - Replicating service states across multiple nodes
• Examples: - Amazon AWS, AliCloud, Google Cloud, etc. replicate their
data and service states
However, cloud outages still occur
Why redundancy does not help?
An AWS Outage in 2018
Elastic Compute Cloud (EC2)
Elastic Block Store (EBS)
... ...... ...
EBS Cluster2EBS Cluster1
Elastic Compute Cloud (EC2)
... ...... ...
VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4
Netflix
EBS Cluster2EBS Cluster1
... ...... ...
VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4
Netflix
EBS Cluster2EBS Cluster1
... ...... ...
VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4
EBS Cluster2EBS Cluster1
Netflix
... ...... ...
VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4
EBS Cluster2EBS Cluster1
Netflix
... ...... ...
VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4
EBS Cluster2EBS Cluster1
Netflix
Correlated failures resulting from deep dependencies
Correlated Failures
• Correlated failures are harmful and epidemic: - Propagated to all the redundant instances - Undermine redundancy and fault tolerance efforts
Correlated failures are prevalent
Service initialization
Service Runtime
State of the Art
Service initialization
Service Runtime
Post-Failure Forensics
1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …
State of the Art
Service initialization
Service Runtime
Post-Failure Forensics
1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …
State of the Art
Proactive Auditing
1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …
CNF = (A2) ˄ (A1 ˅ A3)
let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist
1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2) 10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Server1 (S1)172.28.228.21
INDaaS
Weighted MaxSAT solver
++
Replication
Replica1 Replica2
A2A1 A3 <Weight Vector>
Auditing Engine
Auditing Program in RAL
Auditing Results
Service Deployment (network/software stacks)
Proactive Auditing
• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence
CNF = (A2) ˄ (A1 ˅ A3)
let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist
1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2) 10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Server1 (S1)172.28.228.21
INDaaS
Weighted MaxSAT solver
++
Replication
Replica1 Replica2
A2A1 A3 <Weight Vector>
Auditing Engine
Auditing Program in RAL
Auditing Results
Service Deployment (network/software stacks)
Proactive Auditing
• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence
SW
Net
CNF = (A2) ˄ (A1 ˅ A3)
let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist
1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2) 10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Server1 (S1)172.28.228.21
INDaaS
Weighted MaxSAT solver
++
Replication
Replica1 Replica2
A2A1 A3 <Weight Vector>
Auditing Engine
Auditing Program in RAL
Auditing Results
Service Deployment (network/software stacks)
Proactive Auditing
• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence
CNF = (A2) ˄ (A1 ˅ A3)
let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist
1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2) 10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Server1 (S1)172.28.228.21
INDaaS
Weighted MaxSAT solver
++
Replication
Replica1 Replica2
A2A1 A3 <Weight Vector>
Auditing Engine
Auditing Program in RAL
Auditing Results
Service Deployment (network/software stacks)
Proactive Auditing
• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence
Redundancy configuration fails
...
Server 2 failsServer 1 fails
Redundancy configuration fails
...
AND gate: all the sublayer nodes fail, the upper layer node fails
Net fails
+" +"
HW fails SW fails SW fails
Server 2 failsServer 1 fails
Net fails HW fails
Redundancy configuration fails
...
OR gate: one of the sublayer nodes fails, the upper layer node fails
Net fails
+" +"
Core1 Agg2Agg1
HW fails
+" +"
Path1 Path2 HBase HDFS
+"
SW fails SW fails
Server 2 failsServer 1 fails
Net fails HW fails
Redundancy configuration fails
... ...... ...
...
...... ...
Net fails
+" +"
Core1 Agg2Agg1
HW fails
+" +"
Path1 Path2 HBase HDFS
+"
SW fails SW fails
Server 2 failsServer 1 fails
Net fails HW fails
Redundancy configuration fails
... ...... ...
...
...... ...
Service initialization
Service Runtime
Post-Failure Forensics
1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …
State of the Art
Proactive Auditing
1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …
Service initialization
Service Runtime
Post-Failure Forensics
1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …
Correlated Failure Risks in Updates
Proactive Auditing
1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …
Changing network paths
Upgrading software components
Azure global outage: Our DNS update mangled domain records
Benjamin Treynor Sloss, Google's VP of engineering, explained that the root cause of last Sunday's outage was a configuration change for a small group of servers in one region being wrongly applied to a larger number of servers across several neighboring regions.
Service initialization
Service Runtime
Post-Failure Forensics
1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …
Problem 1: Inefficient Auditing in Updates
Proactive Auditing
1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …
Changing network paths
Upgrading software components
O(50) hours per auditing V.S.
One update every 3 hours
Service initialization
Service Runtime
Post-Failure Forensics
1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …
Problem 2: Lack of fixing risks
Proactive Auditing
1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …
Changing network paths
Upgrading software components
Fix ?Fix ?
Service initialization
Service Runtime
Post-Failure Forensics
1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …
Proactive Auditing
1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …
Changing network paths
Upgrading software components
Our Contribution
CloudCanary
Fast Audit & Fix
CloudCanary’s Workflow
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
CloudCanary’s Workflow
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
CloudCanary’s Workflow
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
• Challenge 1: SnapAudit • Challenge 2: DepBooster
CloudCanary’s Workflow
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
• Challenge 1: SnapAudit • Challenge 2: DepBooster
CloudCanary’s Workflow
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
• Challenge 1: SnapAudit • Challenge 2: DepBooster
CloudCanary’s WorkflowCloudCanary
E1 E1
A Fault Graph
E1 E1
Risk Groups in Fault Graphs
• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node
{A2} and {A1, A3} are risk groups {A1} or {A3} is not risk group
E1 E1
Risk Groups in Fault Graphs
• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node
{A2} and {A1, A3} are risk groups {A1} or {A3} is not risk group
E1 E1
Risk Groups in Fault Graphs
• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node
Identifying correlated failure risks can be reduced to
the problem of finding risk groups in the fault graph.
However, analyzing risk groups is
NP-complete problem
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
• Challenge 1: SnapAudit • Challenge 2: DepBooster
CloudCanary’s WorkflowCloudCanary
Service initialization
Service Runtime
Changing network paths
Upgrading software components
The Insight of SnapAudit
… …
Δ is small Δ′�is small Δ′�′� is small
S S’ S’’
Service initialization
Service Runtime
Changing network paths
Upgrading software components
The Insight of SnapAudit
… …
Δ is small Δ′�is small Δ′�′� is small
FirstAudit
S S’ S’’
CloudCanary
Service initialization
Service Runtime
Changing network paths
Upgrading software components
The Insight of SnapAudit
… …
Δ is small Δ′�is small Δ′�′� is small
FirstAudit
S S’ S’’
CloudCanary
IncAudit IncAudit IncAudit
Service initialization
Service Runtime
Changing network paths
Upgrading software components
… …
Δ is small Δ′�is small Δ′�′� is small
FirstAudit
S S’ S’’
CloudCanary
IncAudit IncAudit IncAudit
SnapAudit: FirstAudit & IncAudit
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
X Y
+"
Z
… … … …
R
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {A, B} - {A, C} - {B, B} - {B, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {A, B} - {A, C} - {B, B} - {B, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {A, B} - {A, C} - {B} - {B, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {A, B} - {A, C} - {B} - {B, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C} - {B} - {B}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}H(F)=x31g
- {B} - {A, C}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {S, T}
FirstAudit Primitive
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {B} - {S, T}
Service initialization
Service Runtime
Changing network paths
Upgrading software components
SnapAudit: FirstAudit & IncAudit
… …
Δ is small Δ′�is small Δ′�′� is small
FirstAudit
S S’ S’’
CloudCanary
IncAudit IncAudit IncAudit
Our Insight
• Algorithm sketch: - Finding all the border nodes (black nodes) - Computing their risk groups - Merging these risk groups towards root
Update
Original Deployment
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=lktd
- {A} - {K} - {S, T}
Z
… … … …
RH(R)=aed8
- {A} - {K} - {B} - {S, T}
Updated Deployment
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
Z
… … … …
Q
R
H(Z)=lktd
- {A} - {K} - {S, T}
H(R)=aed8
- {A} - {K} - {B} - {S, T}
Updated Deployment
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S, T}
Z
… … … …
Q
R - {A} - {K} - {B} - {S, T}
H(R)=aed8
Updated Deployment
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S, T}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S, T}
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S, T}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S, T}
Step 1: Find Border Nodes
Border Nodes
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S, T}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S, T}
Border Nodes
Step 2: Q’s Risk Groups
Boolean formula= E1∧E2
= (A1∨A2)∧(A2∨A3)
Our Insight
Boolean formula= E1∧E2
= (A1∨A2)∧(A2∨A3)
SAT solverSatisfying assignment: {A1=1, A2=0, A3=1}
Our Insight
Boolean formula= E1∧E2
= (A1∨A2)∧(A2∨A3)
SAT solver{A1=0, A2=1, A3=0}
• Problem: - Standard SAT solver outputs an arbitrary satisfying
assignment
Our Insight
Boolean formula= E1∧E2
= (A1∨A2)∧(A2∨A3)
SAT solver{A1=0, A2=1, A3=0}
• Problem: - Standard SAT solver outputs an arbitrary satisfying
assignment - What we want is top-k critical (minimal) risk groups
Our Insight
• Using MinCostSAT solver
- Satisfiable assignment with the least weights
- Obtain the least C = ∑ ci ∙ wi
- Very fast with 100% accuracy
• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr
Identifying Risk Groups
• Using MinCostSAT solver
- Satisfiable assignment with the least weights
- Obtain the least C = ∑ ci ∙ wi
- Very fast with 100% accuracy
• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr
We set the values of all the leaf nodes (i.e., Wi) as 1
Identifying Risk Groups
• Using MinCostSAT solver
- Satisfiable assignment with the least weights
- Obtain the least C = ∑ ci ∙ wi
- Very fast with 100% accuracy
• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr
1 1 1
A1 A2 A3 Weight1 0 00 1 0 10 0 11 1 0 21 0 1 20 1 1 20 0 01 1 1 3
Identifying Risk Groups
• Using MinCostSAT solver
- Satisfiable assignment with the least weights
- Obtain the least C = ∑ ci ∙ wi
- Very fast with 100% accuracy
• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr
1 1 1
A1 A2 A3 Weight1 0 00 1 0 10 0 11 1 0 21 0 1 20 1 1 20 0 01 1 1 3
Identifying Risk Groups
• Find out the top-k critical risk groups
- Use a ∧ to connect the current formula and the negation of the resulting assignment
• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr
(A1∨A2)∧(A2∨A3) ∧ ¬(¬A1 ∧ A2 ∧ ¬A3)
Identifying Risk Groups
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S, T}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S, T}
Step 2: Q’s Risk Groups
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S, T}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S, T}
Step 2: Q’s Risk Groups
H(Q)=x1r7
- {S} - {K}
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S, T}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S, T}
H(Q)=x1r7
- {S} - {K}
Step 3: Merging Changed Caches
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S, T}
H(Q)=x1r7
- {S} - {K}
Step 3: Merging Changed Caches
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S, T}
H(Q)=x1r7
- {S} - {K}
Step 3: Merging Changed Caches
A B
D
B C
E
+" +"
F
+"
H(D)=43cd
- {A} - {B}
H(E)=a4vo
- {B} - {C}
H(F)=x31g
- {B} - {A, C}
X YH(X)=xbn7
- {A} - {S,T} H(Y)=bbk9
- {K} - {A,S}
+"
H(Z)=2xzb
- {A} - {K} - {S}
Z
… … … …
Q
RH(R)=45zc
- {A} - {K} - {B} - {S}
H(Q)=x1r7
- {S} - {K}
Step 3: Merging Changed Caches
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
• Challenge 1: SnapAudit • Challenge 2: DepBooster
CloudCanary’s WorkflowCloudCanary
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server1 (S1)172.28.228.21
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2)
10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Correlated Failure Risk Repairing
$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)
Specification:
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server1 (S1)172.28.228.21
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2)
10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Correlated Failure Risk Repairing
$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)
Specification:
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server1 (S1)172.28.228.21
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2)
10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Correlated Failure Risk Repairing
$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)
Specification:
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server1 (S1)172.28.228.21
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2)
10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4
DepBooster
Correlated Failure Risk Repairing
$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)
Specification:
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server1 (S1)172.28.228.21
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2)
10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4
Synthesis
Correlated Failure Risk Repairing
DepBooster
$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)
Specification:
Core Router1(Core1)
Core Router2(Core2)
Agg Switch3(Agg3)
Server1 (S1)172.28.228.21
Server2 (S2)172.28.228.22
Server3 (S3)172.28.228.23
10.0.0.1 10.0.0.2
75.142.33.98 75.142.33.99
Agg Switch1(Agg1)
Agg Switch2(Agg2)
10.0.0.3
Internet
Server4 (S4)172.28.228.24
HDFS
HBase
HDFS
HBase
Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4
Synthesis
Correlated Failure Risk Repairing
DepBooster
SnapAudit
UpdatedService Snapshot
Reliability Goal
Dependency acquisition
and
Fault graph generator
Fault Graph
DepBooster
1. {CoreRouter-1}2. {Agg1, Agg2}… ...
Improvement PlansOperator
• Challenge 1: SnapAudit • Challenge 2: DepBooster
CloudCanary’s WorkflowCloudCanary
Evaluation
• Comparing CloudCanary with the state of the art
• Evaluating CloudCanary’s practicality via real dataset
Evaluation
Accuracy Efficiency Improvement
INDaaS [OSDI’14]
ProbINDaaS [OSDI’14]
reCloud [CoNEXT’16]
RepAudit [OOPSLA’17]
CloudCanary
✔
✔
✘✘✔
✘✔✔✔✔
✘✘✘✘✔
Efficiency Comparison
CloudCanary
INDaaS
ReCloud (107)
6481
S0 S1 S2 S3 S4 S5
Auditing Time (hours)
Update Snapshots
S6
ProbINDaaS (107)
2 16 41 2 82 16 1 2 81 2 161 210 0 0 0 0 0 0
RepAudit
~8 hours
> 16 hours
Accuracy V.S. Efficiency
20
40
60
80
100
1 2 4 8 16 32 256 1024 4096
Acc
ura
cy
Turnaround time (minutes)
INDaaS
ProbINDaaS (107 rounds)reCloud (107 rounds)RepAudit
CloudCanary
• 20,608 switches; 524,288 servers; 638,592 software components • Auditing a random update affecting 20% components
20
40
60
80
100
1 2 4 8 16 32 256 1024 4096
Acc
ura
cy
Turnaround time (minutes)
INDaaS
ProbINDaaS (107 rounds)reCloud (107 rounds)RepAudit
CloudCanary
Accuracy V.S. Efficiency• 20,608 switches; 524,288 servers; 638,592 software components • Auditing a random update affecting 20% components
Our approach is 200x faster than state-of-the-arts, and offers 100% accurate results.
Evaluation
• We evaluated CloudCanary via real update trace:
Detected Num Confirmed Examples
Microservices 50+ 96% Authentication and access control systems introduce most risk groups
Power Sources 10+ 100%Primary and backup power sources are
carelessly assigned to multiple racks hosting a critical service
Network 30+ 100% Aggregation and ToR switches are easily updated to be risk groups
Conclusion
• CloudCanary is the first system for real-time auditing - SnapAudit primitive: Quickly auditing update snapshot - DepBooster: Quickly generating improvement plans
• We evaluated CloudCanary with real trace and large-scale emulations
Thanks, questions?
• CloudCanary is the first system for real-time auditing - SnapAudit primitive: Quickly auditing update snapshot - DepBooster: Quickly generating improvement plans
• We evaluated CloudCanary with real trace and large-scale emulations