check before you change - usenix · server1 (s1) 172.28.228.21 indaas weighted maxsat solver + +...

97
Check before You Change: Preventing Correlated Failures in Service Updates Ennan Zhai Ang Chen Ruzica Piskac Mahesh Balakrishnan Bingchuan Tian Bo Song Haoliang Zhang

Upload: others

Post on 19-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Check before You Change: Preventing Correlated Failures in Service Updates

Ennan Zhai Ang Chen Ruzica Piskac Mahesh Balakrishnan

Bingchuan Tian Bo Song Haoliang Zhang

Page 2: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Background

• Cloud services ensure reliability by redundancy: - Storing data redundantly - Replicating service states across multiple nodes

• Examples: - Amazon AWS, AliCloud, Google Cloud, etc. replicate their

data and service states

Page 3: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

However, cloud outages still occur

Why redundancy does not help?

Page 4: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

An AWS Outage in 2018

Page 5: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Elastic Compute Cloud (EC2)

Elastic Block Store (EBS)

Page 6: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

... ...... ...

EBS Cluster2EBS Cluster1

Elastic Compute Cloud (EC2)

Page 7: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

Netflix

EBS Cluster2EBS Cluster1

Page 8: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

Netflix

EBS Cluster2EBS Cluster1

Page 9: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Cluster2EBS Cluster1

Netflix

Page 10: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Cluster2EBS Cluster1

Netflix

Page 11: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Cluster2EBS Cluster1

Netflix

Correlated failures resulting from deep dependencies

Page 12: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Correlated Failures

• Correlated failures are harmful and epidemic: - Propagated to all the redundant instances - Undermine redundancy and fault tolerance efforts

Page 13: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Correlated failures are prevalent

Page 14: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

State of the Art

Page 15: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

State of the Art

Page 16: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

State of the Art

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Page 17: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

CNF = (A2) ˄ (A1 ˅ A3)

let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist

1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2) 10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Server1 (S1)172.28.228.21

INDaaS

Weighted MaxSAT solver

++

Replication

Replica1 Replica2

A2A1 A3 <Weight Vector>

Auditing Engine

Auditing Program in RAL

Auditing Results

Service Deployment (network/software stacks)

Proactive Auditing

• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence

Page 18: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

CNF = (A2) ˄ (A1 ˅ A3)

let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist

1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2) 10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Server1 (S1)172.28.228.21

INDaaS

Weighted MaxSAT solver

++

Replication

Replica1 Replica2

A2A1 A3 <Weight Vector>

Auditing Engine

Auditing Program in RAL

Auditing Results

Service Deployment (network/software stacks)

Proactive Auditing

• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence

SW

Net

Page 19: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

CNF = (A2) ˄ (A1 ˅ A3)

let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist

1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2) 10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Server1 (S1)172.28.228.21

INDaaS

Weighted MaxSAT solver

++

Replication

Replica1 Replica2

A2A1 A3 <Weight Vector>

Auditing Engine

Auditing Program in RAL

Auditing Results

Service Deployment (network/software stacks)

Proactive Auditing

• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence

Page 20: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

CNF = (A2) ˄ (A1 ˅ A3)

let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist

1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2) 10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Server1 (S1)172.28.228.21

INDaaS

Weighted MaxSAT solver

++

Replication

Replica1 Replica2

A2A1 A3 <Weight Vector>

Auditing Engine

Auditing Program in RAL

Auditing Results

Service Deployment (network/software stacks)

Proactive Auditing

• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence

Page 21: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Redundancy configuration fails

...

Page 22: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Server 2 failsServer 1 fails

Redundancy configuration fails

...

AND gate: all the sublayer nodes fail, the upper layer node fails

Page 23: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Net fails

+" +"

HW fails SW fails SW fails

Server 2 failsServer 1 fails

Net fails HW fails

Redundancy configuration fails

...

OR gate: one of the sublayer nodes fails, the upper layer node fails

Page 24: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Net fails

+" +"

Core1 Agg2Agg1

HW fails

+" +"

Path1 Path2 HBase HDFS

+"

SW fails SW fails

Server 2 failsServer 1 fails

Net fails HW fails

Redundancy configuration fails

... ...... ...

...

...... ...

Page 25: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Net fails

+" +"

Core1 Agg2Agg1

HW fails

+" +"

Path1 Path2 HBase HDFS

+"

SW fails SW fails

Server 2 failsServer 1 fails

Net fails HW fails

Redundancy configuration fails

... ...... ...

...

...... ...

Page 26: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

State of the Art

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Page 27: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

Correlated Failure Risks in Updates

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Changing network paths

Upgrading software components

Azure global outage: Our DNS update mangled domain records

Benjamin Treynor Sloss, Google's VP of engineering, explained that the root cause of last Sunday's outage was a configuration change for a small group of servers in one region being wrongly applied to a larger number of servers across several neighboring regions.

Page 28: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

Problem 1: Inefficient Auditing in Updates

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Changing network paths

Upgrading software components

O(50) hours per auditing V.S.

One update every 3 hours

Page 29: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

Problem 2: Lack of fixing risks

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Changing network paths

Upgrading software components

Fix ?Fix ?

Page 30: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Changing network paths

Upgrading software components

Our Contribution

CloudCanary

Fast Audit & Fix

Page 31: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

CloudCanary’s Workflow

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

Page 32: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

CloudCanary’s Workflow

Page 33: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

CloudCanary’s Workflow

Page 34: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s Workflow

Page 35: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s Workflow

Page 36: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s WorkflowCloudCanary

Page 37: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

E1 E1

A Fault Graph

Page 38: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

E1 E1

Risk Groups in Fault Graphs

• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node

Page 39: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

{A2} and {A1, A3} are risk groups {A1} or {A3} is not risk group

E1 E1

Risk Groups in Fault Graphs

• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node

Page 40: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

{A2} and {A1, A3} are risk groups {A1} or {A3} is not risk group

E1 E1

Risk Groups in Fault Graphs

• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node

Identifying correlated failure risks can be reduced to

the problem of finding risk groups in the fault graph.

However, analyzing risk groups is

NP-complete problem

Page 41: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s WorkflowCloudCanary

Page 42: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Changing network paths

Upgrading software components

The Insight of SnapAudit

… …

Δ is small Δ′�is small Δ′�′� is small

S S’ S’’

Page 43: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Changing network paths

Upgrading software components

The Insight of SnapAudit

… …

Δ is small Δ′�is small Δ′�′� is small

FirstAudit

S S’ S’’

CloudCanary

Page 44: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Changing network paths

Upgrading software components

The Insight of SnapAudit

… …

Δ is small Δ′�is small Δ′�′� is small

FirstAudit

S S’ S’’

CloudCanary

IncAudit IncAudit IncAudit

Page 45: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Changing network paths

Upgrading software components

… …

Δ is small Δ′�is small Δ′�′� is small

FirstAudit

S S’ S’’

CloudCanary

IncAudit IncAudit IncAudit

SnapAudit: FirstAudit & IncAudit

Page 46: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

X Y

+"

Z

… … … …

R

Page 47: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 48: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 49: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 50: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {A, B} - {A, C} - {B, B} - {B, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 51: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {A, B} - {A, C} - {B, B} - {B, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 52: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {A, B} - {A, C} - {B} - {B, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 53: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {A, B} - {A, C} - {B} - {B, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 54: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C} - {B} - {B}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 55: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}H(F)=x31g

- {B} - {A, C}

Page 56: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 57: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

Page 58: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {B} - {S, T}

Page 59: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Service initialization

Service Runtime

Changing network paths

Upgrading software components

SnapAudit: FirstAudit & IncAudit

… …

Δ is small Δ′�is small Δ′�′� is small

FirstAudit

S S’ S’’

CloudCanary

IncAudit IncAudit IncAudit

Page 60: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Our Insight

• Algorithm sketch: - Finding all the border nodes (black nodes) - Computing their risk groups - Merging these risk groups towards root

Update

Page 61: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Original Deployment

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {B} - {S, T}

Page 62: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Updated Deployment

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

Z

… … … …

Q

R

H(Z)=lktd

- {A} - {K} - {S, T}

H(R)=aed8

- {A} - {K} - {B} - {S, T}

Page 63: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Updated Deployment

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

R - {A} - {K} - {B} - {S, T}

H(R)=aed8

Page 64: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Updated Deployment

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Page 65: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Step 1: Find Border Nodes

Border Nodes

Page 66: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Border Nodes

Step 2: Q’s Risk Groups

Page 67: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Boolean formula= E1∧E2

= (A1∨A2)∧(A2∨A3)

Our Insight

Page 68: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Boolean formula= E1∧E2

= (A1∨A2)∧(A2∨A3)

SAT solverSatisfying assignment: {A1=1, A2=0, A3=1}

Our Insight

Page 69: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Boolean formula= E1∧E2

= (A1∨A2)∧(A2∨A3)

SAT solver{A1=0, A2=1, A3=0}

• Problem: - Standard SAT solver outputs an arbitrary satisfying

assignment

Our Insight

Page 70: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Boolean formula= E1∧E2

= (A1∨A2)∧(A2∨A3)

SAT solver{A1=0, A2=1, A3=0}

• Problem: - Standard SAT solver outputs an arbitrary satisfying

assignment - What we want is top-k critical (minimal) risk groups

Our Insight

Page 71: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

• Using MinCostSAT solver

- Satisfiable assignment with the least weights

- Obtain the least C = ∑ ci ∙ wi

- Very fast with 100% accuracy

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

Identifying Risk Groups

Page 72: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

• Using MinCostSAT solver

- Satisfiable assignment with the least weights

- Obtain the least C = ∑ ci ∙ wi

- Very fast with 100% accuracy

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

We set the values of all the leaf nodes (i.e., Wi) as 1

Identifying Risk Groups

Page 73: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

• Using MinCostSAT solver

- Satisfiable assignment with the least weights

- Obtain the least C = ∑ ci ∙ wi

- Very fast with 100% accuracy

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

1 1 1

A1 A2 A3 Weight1 0 00 1 0 10 0 11 1 0 21 0 1 20 1 1 20 0 01 1 1 3

Identifying Risk Groups

Page 74: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

• Using MinCostSAT solver

- Satisfiable assignment with the least weights

- Obtain the least C = ∑ ci ∙ wi

- Very fast with 100% accuracy

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

1 1 1

A1 A2 A3 Weight1 0 00 1 0 10 0 11 1 0 21 0 1 20 1 1 20 0 01 1 1 3

Identifying Risk Groups

Page 75: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

• Find out the top-k critical risk groups

- Use a ∧ to connect the current formula and the negation of the resulting assignment

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

(A1∨A2)∧(A2∨A3) ∧ ¬(¬A1 ∧ A2 ∧ ¬A3)

Identifying Risk Groups

Page 76: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Step 2: Q’s Risk Groups

Page 77: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Step 2: Q’s Risk Groups

H(Q)=x1r7

- {S} - {K}

Page 78: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

H(Q)=x1r7

- {S} - {K}

Step 3: Merging Changed Caches

Page 79: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

H(Q)=x1r7

- {S} - {K}

Step 3: Merging Changed Caches

Page 80: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

H(Q)=x1r7

- {S} - {K}

Step 3: Merging Changed Caches

Page 81: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S}

H(Q)=x1r7

- {S} - {K}

Step 3: Merging Changed Caches

Page 82: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s WorkflowCloudCanary

Page 83: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Correlated Failure Risk Repairing

Page 84: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Correlated Failure Risk Repairing

Page 85: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Correlated Failure Risk Repairing

Page 86: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4

DepBooster

Correlated Failure Risk Repairing

Page 87: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4

Synthesis

Correlated Failure Risk Repairing

DepBooster

Page 88: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4

Synthesis

Correlated Failure Risk Repairing

DepBooster

Page 89: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s WorkflowCloudCanary

Page 90: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Evaluation

• Comparing CloudCanary with the state of the art

• Evaluating CloudCanary’s practicality via real dataset

Page 91: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Evaluation

Accuracy Efficiency Improvement

INDaaS [OSDI’14]

ProbINDaaS [OSDI’14]

reCloud [CoNEXT’16]

RepAudit [OOPSLA’17]

CloudCanary

✘✘✔

✘✔✔✔✔

✘✘✘✘✔

Page 92: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Efficiency Comparison

CloudCanary

INDaaS

ReCloud (107)

6481

S0 S1 S2 S3 S4 S5

Auditing Time (hours)

Update Snapshots

S6

ProbINDaaS (107)

2 16 41 2 82 16 1 2 81 2 161 210 0 0 0 0 0 0

RepAudit

~8 hours

> 16 hours

Page 93: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Accuracy V.S. Efficiency

20

40

60

80

100

1 2 4 8 16 32 256 1024 4096

Acc

ura

cy

Turnaround time (minutes)

INDaaS

ProbINDaaS (107 rounds)reCloud (107 rounds)RepAudit

CloudCanary

• 20,608 switches; 524,288 servers; 638,592 software components • Auditing a random update affecting 20% components

Page 94: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

20

40

60

80

100

1 2 4 8 16 32 256 1024 4096

Acc

ura

cy

Turnaround time (minutes)

INDaaS

ProbINDaaS (107 rounds)reCloud (107 rounds)RepAudit

CloudCanary

Accuracy V.S. Efficiency• 20,608 switches; 524,288 servers; 638,592 software components • Auditing a random update affecting 20% components

Our approach is 200x faster than state-of-the-arts, and offers 100% accurate results.

Page 95: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Evaluation

• We evaluated CloudCanary via real update trace:

Detected Num Confirmed Examples

Microservices 50+ 96% Authentication and access control systems introduce most risk groups

Power Sources 10+ 100%Primary and backup power sources are

carelessly assigned to multiple racks hosting a critical service

Network 30+ 100% Aggregation and ToR switches are easily updated to be risk groups

Page 96: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Conclusion

• CloudCanary is the first system for real-time auditing - SnapAudit primitive: Quickly auditing update snapshot - DepBooster: Quickly generating improvement plans

• We evaluated CloudCanary with real trace and large-scale emulations

Page 97: Check before You Change - USENIX · Server1 (S1) 172.28.228.21 INDaaS Weighted MaxSAT solver + + Replication Replica1 Replica2 A1 A2 A3  Auditing Engine Auditing

Thanks, questions?

• CloudCanary is the first system for real-time auditing - SnapAudit primitive: Quickly auditing update snapshot - DepBooster: Quickly generating improvement plans

• We evaluated CloudCanary with real trace and large-scale emulations