predictive and adaptive failure mitigation to avert ... · predictive and adaptive failure...

38
Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI ‘20 Sebastien Levy†, Randolph Yao†, Youjiang Wu†, Yingnong Dang†, Peng Huang^, Zheng Mu†, Pu Zhao*, Tarun Ramani†, Naga Govindaraju†, Xukun Li†, Qingwei Lin*, Gil Lapid Shafriri†, Murali Chintalapati† † Microsoft Azure, ^ Johns Hopkins University, * Microsoft Research

Upload: others

Post on 15-Nov-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions

OSDI ‘20

Sebastien Levy†, Randolph Yao†, Youjiang Wu†, Yingnong Dang†, Peng Huang^, Zheng Mu†, Pu Zhao*, Tarun Ramani†, Naga Govindaraju†, Xukun

Li†, Qingwei Lin*, Gil Lapid Shafriri†, Murali Chintalapati†

† Microsoft Azure, ^ Johns Hopkins University, * Microsoft Research

Page 2: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

2

Azure global infrastructure

Azure regions

Page 3: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

HeterogeneityAll nodes are different

Different characteristics: # cores, total memory …

Different hardware type / vendors

Different workload patterns

Different health history

3

Page 4: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Measuring Customer Experience

• We need to monitor VM availability: down time / up time

• But each VM interruption causes significant impact to customer• Disrupt user experience (ie gaming)• Applications take time to recover• Customer frustration would come in

case of repeated reboots• Two short interruptions are more

impactful than one longer one

4

Page 5: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Impact of Node Failure

VM VM

VM VM

VM

Disks CPUMemory

VM VM

VM VM

VM

Disks CPUMemoryFW/HW failure

• Node-level failure ➔ bad impact for every VM it contains

• We need to predict failures and take the appropriate mitigation actions

5

Page 6: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Traditional Operation Workflow

Offline & repair node

6

VM VM

VM VM

VM

Disks CPUMemory

VM VM

VM VM

VM

Disks CPUMemory

VM VM …

Healthy Node

2 Diagnosis

3 Migration

!

All VMs reboot

Long VM downtime

VM VM

VM VM

VM

Disks CPUMemory

1 Detection

?Node failure

Page 7: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Attempt 1: Mitigation Before Diagnostics

Offline & repair node

7

VM VM

VM VM

VM

Disks CPUMemory

VM VM

VM VM

VM

Disks CPUMemory

VM VM …

Healthy Node

1 Detection 3 Diagnosis

2 Migration

? !

All VMs reboot

Short VM downtime

Mitigation can be better in some cases

Node failure

VM VM

VM VM

VM

Disks CPUMemory

Page 8: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Attempt 2: Prediction of Node Failure

Offline & repair node

8

VM VM

VM VM

VM

Disks CPUMemory

VM VM

VM VM

VM

Disks CPUMemory

VM VM …

Healthy Node

1 Node predicted to fail 2 Diagnosis

3 Migration

?

VM VM

VM VM

VM

Disks CPUMemoryWait

Use prediction to speed up OS crash mitigation

All VMs reboot

Shorter VM downtime

Mitigation suboptimal for wrong prediction

Node failure

Page 9: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Attempt 3: Prediction + Static Mitigation

Offline & repair node

9

VM VM

Disks CPUMemory

VM VM …

Healthy Node

1 Node predicted to fail 3 Diagnosis

2

?

VM VM

Disks CPUMemoryBlock allocation

Capacity impact

Live migrate eligible VMsVM pause impact

Some VMs did not reboot

Other VMs reboot

Shorter VM downtime

Increased impact for false positives

Can we do better?

VM VM

VM VM

VM

Disks CPUMemory

Page 10: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Mitigation Tradeoff

❖ Can a mitigation be effective for all kind of scenarios?

❖ How to confirm a mitigation is effective?

❖ Even if a mitigation is effective, can we do better? Can it change in time?

10

Page 11: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Mitigation Tradeoff: No One-Size-Fits-All

How to mitigate predicted node failures: reasonable proposal

Capacity impactVM pause impact VM interruption impact

Block allocation

Live-migrate eligible VMs

Wait for 7 days

Force remaining VMs

migration

Diagnosis + Repair

If failure is not imminent, do we need to completely block

allocation?

How long should we wait for customers to intentionally move

their VMs?

If node has still not failed after 7 days, could it now be healthy?

11

Page 12: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

How To Estimate Mitigation Efficacy?

• Heterogeneity: Multiple factors will impact the mitigation Efficacy• Live migration success rate depends on available capacity, hardware health,

workload on the node

• Allocation relies on a complex optimization logic and on stochastic customer demand

• Prediction false positives are unavoidable and can depend on unobservable signals

12

Page 13: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Key Insights

“In a heterogeneous and ever-changing cloud system, the effectiveness of a mitigation action is often probabilistic.”

“To select the (near-)optimal mitigation, each possible action needs to be compared at-scale with production workload.”

13

Page 14: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Solution

Continuous probabilistic online minimizationof customer impact

Different mitigation actions depending on the type of predicted failure

Online testing of the different options

Adapt to ever-changing cloud behavior

Account for node heterogeneity of Azure’s fleet

Focus on impacting failures

14

Page 15: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Narya’s Overview

Prediction Rule

ML failure prediction Customer impact

Feedback loop

Diagnostics

Offline Monitoring

CostLabels

Prediction Decision Mitigation Impact Assessment15

1

3

2

1

4

4

Choose action

Page 16: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Main Challenges

Define Define customer impact as a generic metric

Test Safely test action in production

BalanceBalance exploring actions and exploiting the best so far

Adapt Adapt our decision to system changes

Fast Fast and scalable mitigation decision

16

Page 17: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Measuring Interruptions

• Interruption Types:

• VM reboot, IO pause, VM temporarily freeze / blackout

• Service Availability

• Time duration based

• Short VM downtime DOES NOT imply short customer service downtime.

• Each VM service downtime requires customer service to rehydrate their state

• Interruption Count

• Event count based

• Each VM interrupt impacts customer service once

• More realistically reflects the customer pain

17

Page 18: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Defining Customer Impact

• VM interruptions are the main negative impact to customer

• We define Annual interruption rate (AIR)

𝐴𝐼𝑅 =# 𝑖𝑛𝑡𝑒𝑟𝑟𝑢𝑝𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑇

𝑇𝑜𝑡𝑎𝑙 𝑢𝑝 𝑡𝑖𝑚𝑒 𝑖𝑛 𝑇× 365 𝑑𝑎𝑦𝑠 × 100

Cost = # VM interruptions per node

When scoping it to a contribution per node and removing the constants, we get:

18

Page 19: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Narya’s Prediction

Expert Rules ML Prediction

19

Page 20: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Motivation To Use ML Prediction

• Expert rule • Setting a threshold on a single predictive signal. Ex: Disk low spare space

• Limitations of expert rule• High variance of prediction horizon - actual failure time also depends on factors

• Difficult precision-recall tradeoff – strict rules result in low recall, loose rules yield low precision

• What if we look at all signals before the actual failure and use those signals to predict the failure?

20

Page 21: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

ML Model Highlights

• Comprehensive list of signals that cover OS layer, driver layer and device layer observation.

• Predict server level failures that has customer impact

• Non-handcrafted features• Spatial attention

• Temporal convolution

21

Longer Lead Time

Higher Precision

Higher Recall

Page 22: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Diversity of Mitigation Actions

Allocation change VM migration Node reboot

Allow allocation

Avoid allocation*

Block allocation

Live migrate VMs*(unallocatable node only)

Service heal VMs(unallocatable node only)

Soft kernel reboot*(unallocatable node only)

Hard reboot

Offline + repair(empty node only)

*

No Impact

Pause Impact

Soft capacity impact

Hard capacity impact

VM reboot impact

Non-deterministic action22

Page 23: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Narya’s Mitigation Actions

• Use composite actions (sequence of actions)• Avoid illogical scenarios: migration without blocking allocation, repeated

reboots …

• Add safety constraint to minimize the cost of exploration

• Easy override priority if several rules overlap

• Simpler learningIf OS crashes

If empty

Example:

Composite actions’ duration is typically in the order of days 23

Page 24: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Narya’sDecision Engine

• Mapping each predicted bad nodes to the best composite mitigation action to minimize customer pain

• Two algorithms:• AB testing:

explore all different possibilities, observe the customer pain metric(s) then use the best action if it exists

• Multi-Armed Bandit (Thompson Sampling): learn the right tradeoff between exploring new actions and exploiting the estimated best one based on a single customer pain metric

24

Page 25: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

AB Testing

1. Sample action with preconfigured probabilities

2. Observe customer impact within an observation window

3. Hypothesis test between cost of all actions

4. If statistical significance, use the optimal action

Action A

Compare statistical impact

Action B25

Page 26: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Adapting AB Testing To Our Scenario

• Cost attribution: VM reboots during the observation window• knowing which interruptions is caused by the action is not possible

• Stickiness: same node always use the same action• otherwise we would violate the iid assumption

• Decision vs action: we observe the consequence of every decision• even if it is skipped, delayed or overridden

26

Page 27: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Bandit Motivation

• AB testing limitations

• Does not leverage early observation before statistical significance

• Cannot adapt to change after statistical significance

• Multi-armed bandit: dynamically learn the probability based on observations

✓Leverage early observation in exploration

✓ Adapt to change in exploitation

27

Page 28: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Bandit Framework

Reward = - Customer impact (Cost)

Machines = Available composite actions

Pull / Game = New node mitigation request

Each prediction rule is a different bandit experiment

28

Page 29: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Bandit Adaptation To Our Scenario

Accommodate temporal change: exponential decay

Delayed rewards

Bandit stickiness Safeguards

29

Page 30: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Ensuring Safe Exploration

• Only allow relevant composite actions

• Override logic if other more severe issues are detected

• Fallback to AB testing if there are not enough observation data

• Enforce minimum and maximum probability for each action

30

Page 31: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

31

Mitigation Engine

ML Prediction

Pub / Sub Real-time

Node Monitoring

Agent

Action Orchestrator

Policy Generator

Req

ues

t H

and

ler

Model Serving Platform Learner

VM Impact

Mitigation decision

Narya’s System Architecture1

3

2

Scalable

Fast

Adaptive

Page 32: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Overall Savings: 26%

• Compare the use of AB testing/Bandit to previous static strategy

• Estimated AIR savings: [Observed AIR] – [Control group AIR mapped to whole fleet]

0

10

20

30

40

50

60

% Im

pro

vem

ent

Experiments

Bandit improvement over AB testing

32

Page 33: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Prediction Performance

• Precision – 79.5%

• Recall – 50.7%

• ML prediction:Time to failure – 48 hours on average

33

Page 34: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Compare AB Testing and Bandits

• Counterfactual estimation

18.4

7.7

22.3

16.6

3.8

7.1

15.3

0

5

10

15

20

25

2 Ierr E500 E11 Tardigrade E7 30 Threshold 1 Ierr 63023 Orange ML Prediction

Improvement (%)34

Improvement (%)

Page 35: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Case Study: I/O Timeouts

• A/B testing then Bandit experiment on I/O timeouts prediction rule

• Unexpected system changes switched probability from using NoOp to using UA-LM-RH automatically

• Although change is not understood, bandit can automatically adjust

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2/1

8/2

02

0

2/2

0/2

02

0

2/2

2/2

02

0

2/2

4/2

02

0

2/2

6/2

02

0

2/2

8/2

02

0

3/1

/20

20

3/3

/20

20

3/5

/20

20

3/7

/20

20

3/9

/20

20

3/1

1/2

02

0

3/1

3/2

02

0

3/1

5/2

02

0

3/1

7/2

02

0

3/1

9/2

02

0

3/2

0/2

02

0

3/2

2/2

02

0

3/2

4/2

02

0

3/2

6/2

02

0

3/2

8/2

02

0

3/3

0/2

02

0

4/1

/20

20

4/3

/20

20

4/5

/20

20

4/7

/20

20

Bandit probability

UaProb NoOpProb

35

UA-LM-RH: Unallocatable + Live Migration + Reset Node Health

Page 36: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Operation Learnings

• Many factors influence efficiency: need for probabilistic approaches

• Decisions may go wrong: closely monitor all components behavior and use interpretable models

• Data quality is critical: watch for telemetry schema changes

• Customer impact is complex: human in the loop helps prevent from new types of impact

36

Page 37: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Summary / Takeaway Message

• Both failure prediction and proactive mitigation is critical

• No one-size-fits-all mitigation strategy, adapting different mitigation strategies in an online fashion

• Narya uses AB testing and multi-armed bandit to proactively and adaptively mitigate of predicted bad nodes

• Narya achieved 26% improvement over previous static strategy

37

Page 38: Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions OSDI 20 Sebastien Levy†, Randolph

Thank you!

• Acknowledgement • Haryadi Gunawi, our shepherd, and the anonymous reviewers

• All our collaborators within Azure

• Contact us• [email protected]

[email protected]

[email protected]

[email protected]

38