check before you change - usenix · server1 (s1) 172.28.228.21 indaas weighted maxsat solver + +...

Post on 19-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Check before You Change: Preventing Correlated Failures in Service Updates

Ennan Zhai Ang Chen Ruzica Piskac Mahesh Balakrishnan

Bingchuan Tian Bo Song Haoliang Zhang

Background

• Cloud services ensure reliability by redundancy: - Storing data redundantly - Replicating service states across multiple nodes

• Examples: - Amazon AWS, AliCloud, Google Cloud, etc. replicate their

data and service states

However, cloud outages still occur

Why redundancy does not help?

An AWS Outage in 2018

Elastic Compute Cloud (EC2)

Elastic Block Store (EBS)

... ...... ...

EBS Cluster2EBS Cluster1

Elastic Compute Cloud (EC2)

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

Netflix

EBS Cluster2EBS Cluster1

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

Netflix

EBS Cluster2EBS Cluster1

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Cluster2EBS Cluster1

Netflix

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Cluster2EBS Cluster1

Netflix

... ...... ...

VM1 VM2 VM1 VM3 VM1 VM2 VM3 VM4

EBS Cluster2EBS Cluster1

Netflix

Correlated failures resulting from deep dependencies

Correlated Failures

• Correlated failures are harmful and epidemic: - Propagated to all the redundant instances - Undermine redundancy and fault tolerance efforts

Correlated failures are prevalent

Service initialization

Service Runtime

State of the Art

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

State of the Art

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

State of the Art

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

CNF = (A2) ˄ (A1 ˅ A3)

let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist

1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2) 10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Server1 (S1)172.28.228.21

INDaaS

Weighted MaxSAT solver

++

Replication

Replica1 Replica2

A2A1 A3 <Weight Vector>

Auditing Engine

Auditing Program in RAL

Auditing Results

Service Deployment (network/software stacks)

Proactive Auditing

• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence

CNF = (A2) ˄ (A1 ˅ A3)

let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist

1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2) 10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Server1 (S1)172.28.228.21

INDaaS

Weighted MaxSAT solver

++

Replication

Replica1 Replica2

A2A1 A3 <Weight Vector>

Auditing Engine

Auditing Program in RAL

Auditing Results

Service Deployment (network/software stacks)

Proactive Auditing

• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence

SW

Net

CNF = (A2) ˄ (A1 ˅ A3)

let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist

1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2) 10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Server1 (S1)172.28.228.21

INDaaS

Weighted MaxSAT solver

++

Replication

Replica1 Replica2

A2A1 A3 <Weight Vector>

Auditing Engine

Auditing Program in RAL

Auditing Results

Service Deployment (network/software stacks)

Proactive Auditing

• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence

CNF = (A2) ˄ (A1 ˅ A3)

let Server(“172.28.228.21”) -> s1let Server(“172.28.228.22”) -> s2let [s1, s2] -> replet FaultGraph(rep) -> ftlet RankRCG(ft, 2, NET, ft) -> ranklist

1. {Core1[“75.142.33.98”]}2. {Agg1[“10.0.0.1”], Agg2[“10.0.0.2”]}

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2) 10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Server1 (S1)172.28.228.21

INDaaS

Weighted MaxSAT solver

++

Replication

Replica1 Replica2

A2A1 A3 <Weight Vector>

Auditing Engine

Auditing Program in RAL

Auditing Results

Service Deployment (network/software stacks)

Proactive Auditing

• They did pre-deployment recommendations: - Step1: Automatically collecting dependency data - Step2: Modeling system stack in fault graph - Step3: Evaluating alternative deployments’ independence

Redundancy configuration fails

...

Server 2 failsServer 1 fails

Redundancy configuration fails

...

AND gate: all the sublayer nodes fail, the upper layer node fails

Net fails

+" +"

HW fails SW fails SW fails

Server 2 failsServer 1 fails

Net fails HW fails

Redundancy configuration fails

...

OR gate: one of the sublayer nodes fails, the upper layer node fails

Net fails

+" +"

Core1 Agg2Agg1

HW fails

+" +"

Path1 Path2 HBase HDFS

+"

SW fails SW fails

Server 2 failsServer 1 fails

Net fails HW fails

Redundancy configuration fails

... ...... ...

...

...... ...

Net fails

+" +"

Core1 Agg2Agg1

HW fails

+" +"

Path1 Path2 HBase HDFS

+"

SW fails SW fails

Server 2 failsServer 1 fails

Net fails HW fails

Redundancy configuration fails

... ...... ...

...

...... ...

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

State of the Art

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

Correlated Failure Risks in Updates

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Changing network paths

Upgrading software components

Azure global outage: Our DNS update mangled domain records

Benjamin Treynor Sloss, Google's VP of engineering, explained that the root cause of last Sunday's outage was a configuration change for a small group of servers in one region being wrongly applied to a larger number of servers across several neighboring regions.

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

Problem 1: Inefficient Auditing in Updates

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Changing network paths

Upgrading software components

O(50) hours per auditing V.S.

One update every 3 hours

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

Problem 2: Lack of fixing risks

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Changing network paths

Upgrading software components

Fix ?Fix ?

Service initialization

Service Runtime

Post-Failure Forensics

1. Diagnosis (e.g., Sherlock [SIGCOMM’07]) 2. Accountability (e.g., AVM [OSDI’10]) 3. Provenance (e.g., DiffProv [SIGCOMM’16]) 4. ... …

Proactive Auditing

1. INDaas [OSDI’14] 2. reCloud [CoNEXT’16] 3. RepAudit [OOPSLA’17] 4. ... …

Changing network paths

Upgrading software components

Our Contribution

CloudCanary

Fast Audit & Fix

CloudCanary’s Workflow

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

CloudCanary’s Workflow

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

CloudCanary’s Workflow

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s Workflow

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s Workflow

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s WorkflowCloudCanary

E1 E1

A Fault Graph

E1 E1

Risk Groups in Fault Graphs

• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node

{A2} and {A1, A3} are risk groups {A1} or {A3} is not risk group

E1 E1

Risk Groups in Fault Graphs

• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node

{A2} and {A1, A3} are risk groups {A1} or {A3} is not risk group

E1 E1

Risk Groups in Fault Graphs

• A risk group means a set of leaf nodes whose simultaneous failures lead to the failure of root node

Identifying correlated failure risks can be reduced to

the problem of finding risk groups in the fault graph.

However, analyzing risk groups is

NP-complete problem

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s WorkflowCloudCanary

Service initialization

Service Runtime

Changing network paths

Upgrading software components

The Insight of SnapAudit

… …

Δ is small Δ′�is small Δ′�′� is small

S S’ S’’

Service initialization

Service Runtime

Changing network paths

Upgrading software components

The Insight of SnapAudit

… …

Δ is small Δ′�is small Δ′�′� is small

FirstAudit

S S’ S’’

CloudCanary

Service initialization

Service Runtime

Changing network paths

Upgrading software components

The Insight of SnapAudit

… …

Δ is small Δ′�is small Δ′�′� is small

FirstAudit

S S’ S’’

CloudCanary

IncAudit IncAudit IncAudit

Service initialization

Service Runtime

Changing network paths

Upgrading software components

… …

Δ is small Δ′�is small Δ′�′� is small

FirstAudit

S S’ S’’

CloudCanary

IncAudit IncAudit IncAudit

SnapAudit: FirstAudit & IncAudit

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

X Y

+"

Z

… … … …

R

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {A, B} - {A, C} - {B, B} - {B, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {A, B} - {A, C} - {B, B} - {B, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {A, B} - {A, C} - {B} - {B, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {A, B} - {A, C} - {B} - {B, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C} - {B} - {B}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}H(F)=x31g

- {B} - {A, C}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {S, T}

FirstAudit Primitive

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {B} - {S, T}

Service initialization

Service Runtime

Changing network paths

Upgrading software components

SnapAudit: FirstAudit & IncAudit

… …

Δ is small Δ′�is small Δ′�′� is small

FirstAudit

S S’ S’’

CloudCanary

IncAudit IncAudit IncAudit

Our Insight

• Algorithm sketch: - Finding all the border nodes (black nodes) - Computing their risk groups - Merging these risk groups towards root

Update

Original Deployment

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=lktd

- {A} - {K} - {S, T}

Z

… … … …

RH(R)=aed8

- {A} - {K} - {B} - {S, T}

Updated Deployment

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

Z

… … … …

Q

R

H(Z)=lktd

- {A} - {K} - {S, T}

H(R)=aed8

- {A} - {K} - {B} - {S, T}

Updated Deployment

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

R - {A} - {K} - {B} - {S, T}

H(R)=aed8

Updated Deployment

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Step 1: Find Border Nodes

Border Nodes

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Border Nodes

Step 2: Q’s Risk Groups

Boolean formula= E1∧E2

= (A1∨A2)∧(A2∨A3)

Our Insight

Boolean formula= E1∧E2

= (A1∨A2)∧(A2∨A3)

SAT solverSatisfying assignment: {A1=1, A2=0, A3=1}

Our Insight

Boolean formula= E1∧E2

= (A1∨A2)∧(A2∨A3)

SAT solver{A1=0, A2=1, A3=0}

• Problem: - Standard SAT solver outputs an arbitrary satisfying

assignment

Our Insight

Boolean formula= E1∧E2

= (A1∨A2)∧(A2∨A3)

SAT solver{A1=0, A2=1, A3=0}

• Problem: - Standard SAT solver outputs an arbitrary satisfying

assignment - What we want is top-k critical (minimal) risk groups

Our Insight

• Using MinCostSAT solver

- Satisfiable assignment with the least weights

- Obtain the least C = ∑ ci ∙ wi

- Very fast with 100% accuracy

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

Identifying Risk Groups

• Using MinCostSAT solver

- Satisfiable assignment with the least weights

- Obtain the least C = ∑ ci ∙ wi

- Very fast with 100% accuracy

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

We set the values of all the leaf nodes (i.e., Wi) as 1

Identifying Risk Groups

• Using MinCostSAT solver

- Satisfiable assignment with the least weights

- Obtain the least C = ∑ ci ∙ wi

- Very fast with 100% accuracy

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

1 1 1

A1 A2 A3 Weight1 0 00 1 0 10 0 11 1 0 21 0 1 20 1 1 20 0 01 1 1 3

Identifying Risk Groups

• Using MinCostSAT solver

- Satisfiable assignment with the least weights

- Obtain the least C = ∑ ci ∙ wi

- Very fast with 100% accuracy

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

1 1 1

A1 A2 A3 Weight1 0 00 1 0 10 0 11 1 0 21 0 1 20 1 1 20 0 01 1 1 3

Identifying Risk Groups

• Find out the top-k critical risk groups

- Use a ∧ to connect the current formula and the negation of the resulting assignment

• We can use Weighted Partial MaxSAT: - Solve 100 instances less than 100 sec - Each instance contains ~1000 clauses - Industry-scale competition - Pr

(A1∨A2)∧(A2∨A3) ∧ ¬(¬A1 ∧ A2 ∧ ¬A3)

Identifying Risk Groups

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Step 2: Q’s Risk Groups

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

Step 2: Q’s Risk Groups

H(Q)=x1r7

- {S} - {K}

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S, T}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

H(Q)=x1r7

- {S} - {K}

Step 3: Merging Changed Caches

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

H(Q)=x1r7

- {S} - {K}

Step 3: Merging Changed Caches

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S, T}

H(Q)=x1r7

- {S} - {K}

Step 3: Merging Changed Caches

A B

D

B C

E

+" +"

F

+"

H(D)=43cd

- {A} - {B}

H(E)=a4vo

- {B} - {C}

H(F)=x31g

- {B} - {A, C}

X YH(X)=xbn7

- {A} - {S,T} H(Y)=bbk9

- {K} - {A,S}

+"

H(Z)=2xzb

- {A} - {K} - {S}

Z

… … … …

Q

RH(R)=45zc

- {A} - {K} - {B} - {S}

H(Q)=x1r7

- {S} - {K}

Step 3: Merging Changed Caches

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s WorkflowCloudCanary

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Correlated Failure Risk Repairing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Correlated Failure Risk Repairing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Correlated Failure Risk Repairing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4

DepBooster

Correlated Failure Risk Repairing

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4

Synthesis

Correlated Failure Risk Repairing

DepBooster

$Server -> 172.28.228.21, 172.28.228.22 goal(failProb(ft)<0.08 | ChNode | Agg3)

Specification:

Core Router1(Core1)

Core Router2(Core2)

Agg Switch3(Agg3)

Server1 (S1)172.28.228.21

Server2 (S2)172.28.228.22

Server3 (S3)172.28.228.23

10.0.0.1 10.0.0.2

75.142.33.98 75.142.33.99

Agg Switch1(Agg1)

Agg Switch2(Agg2)

10.0.0.3

Internet

Server4 (S4)172.28.228.24

HDFS

HBase

HDFS

HBase

Plan 1: Move replica from S1 -> S4 Plan 2: Move replica from S2 -> S4

Synthesis

Correlated Failure Risk Repairing

DepBooster

SnapAudit

UpdatedService Snapshot

Reliability Goal

Dependency acquisition

and

Fault graph generator

Fault Graph

DepBooster

1. {CoreRouter-1}2. {Agg1, Agg2}… ...

Improvement PlansOperator

• Challenge 1: SnapAudit • Challenge 2: DepBooster

CloudCanary’s WorkflowCloudCanary

Evaluation

• Comparing CloudCanary with the state of the art

• Evaluating CloudCanary’s practicality via real dataset

Evaluation

Accuracy Efficiency Improvement

INDaaS [OSDI’14]

ProbINDaaS [OSDI’14]

reCloud [CoNEXT’16]

RepAudit [OOPSLA’17]

CloudCanary

✘✘✔

✘✔✔✔✔

✘✘✘✘✔

Efficiency Comparison

CloudCanary

INDaaS

ReCloud (107)

6481

S0 S1 S2 S3 S4 S5

Auditing Time (hours)

Update Snapshots

S6

ProbINDaaS (107)

2 16 41 2 82 16 1 2 81 2 161 210 0 0 0 0 0 0

RepAudit

~8 hours

> 16 hours

Accuracy V.S. Efficiency

20

40

60

80

100

1 2 4 8 16 32 256 1024 4096

Acc

ura

cy

Turnaround time (minutes)

INDaaS

ProbINDaaS (107 rounds)reCloud (107 rounds)RepAudit

CloudCanary

• 20,608 switches; 524,288 servers; 638,592 software components • Auditing a random update affecting 20% components

20

40

60

80

100

1 2 4 8 16 32 256 1024 4096

Acc

ura

cy

Turnaround time (minutes)

INDaaS

ProbINDaaS (107 rounds)reCloud (107 rounds)RepAudit

CloudCanary

Accuracy V.S. Efficiency• 20,608 switches; 524,288 servers; 638,592 software components • Auditing a random update affecting 20% components

Our approach is 200x faster than state-of-the-arts, and offers 100% accurate results.

Evaluation

• We evaluated CloudCanary via real update trace:

Detected Num Confirmed Examples

Microservices 50+ 96% Authentication and access control systems introduce most risk groups

Power Sources 10+ 100%Primary and backup power sources are

carelessly assigned to multiple racks hosting a critical service

Network 30+ 100% Aggregation and ToR switches are easily updated to be risk groups

Conclusion

• CloudCanary is the first system for real-time auditing - SnapAudit primitive: Quickly auditing update snapshot - DepBooster: Quickly generating improvement plans

• We evaluated CloudCanary with real trace and large-scale emulations

Thanks, questions?

• CloudCanary is the first system for real-time auditing - SnapAudit primitive: Quickly auditing update snapshot - DepBooster: Quickly generating improvement plans

• We evaluated CloudCanary with real trace and large-scale emulations

top related