exploring failure transparency and the limits of generic recovery

21
Exploring Failure Transparency and the Limits of Generic Recovery Dave Lowell Compaq Western Research Lab xxx Subhachandra Chandra and Peter M. Chen, University of Michigan

Upload: akando

Post on 02-Feb-2016

30 views

Category:

Documents


2 download

DESCRIPTION

Exploring Failure Transparency and the Limits of Generic Recovery. Dave Lowell Compaq Western Research Lab xxx Subhachandra Chandra and Peter M. Chen, University of Michigan. Introduction. Failure transparency: abstraction of failure-free operation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploring Failure Transparency and the Limits of Generic Recovery

Exploring Failure Transparency and the Limits of Generic Recovery

Dave LowellCompaq Western Research Labxxx

Subhachandra Chandra andPeter M. Chen, University of Michigan

Page 2: Exploring Failure Transparency and the Limits of Generic Recovery

2

Introduction

Failure transparency: abstraction of failure-free operation

OS recovers app after hardware, OS, and application failures

– No programmer help– No slow down

Will explore theory, performance, and limitations

Page 3: Exploring Failure Transparency and the Limits of Generic Recovery

3

Consistent recovery

Visible output equivalent to failure-free run

– equivalence: allows duplicates– avoids “exactly once” problem

Failure transparency consistent recovery with generic techniques

Page 4: Exploring Failure Transparency and the Limits of Generic Recovery

4

Guaranteeing consistent recovery

Key players: non-deterministic events, visible events, commit events

Save-work invariant (simplified):– There’s a commit after each non-

deterministic event that happens-before a visible event.

– Full theorem handles liveness, distinguishes causality and ordering

Page 5: Exploring Failure Transparency and the Limits of Generic Recovery

5

Commit All CAND CAND-LOG

Effort to identify/convert ND events

Page 6: Exploring Failure Transparency and the Limits of Generic Recovery

6

CAND CAND-LOG

Effort to identify/convert ND events

CPVS

CPV-2PCE

ffort

to c

om

mit

onl

y vi

sib

le e

vent

s

CBNDVS

CBNDV-2PC

CBNDVS-LOG

Page 7: Exploring Failure Transparency and the Limits of Generic Recovery

7

CAND CAND-LOG

Effort to identify/convert ND events

CPVS

CPV-2PCE

ffort

to c

om

mit

onl

y vi

sib

le e

vent

s

CBNDVS

CBNDV-2PC

CBNDVS-LOG

Coord. CheckpointingManethoOptimistic Logging

Targon/32SBL Hypervisor

Page 8: Exploring Failure Transparency and the Limits of Generic Recovery

8

Effort to identify/convert ND events

Effo

rt to

co

mm

it o

nly

visi

ble

eve

nts increasing recovery time

app

lica

tion

failu

re r

eco

very

incre

asing

sim

plicit

y

incre

asing

per

form

ance

Page 9: Exploring Failure Transparency and the Limits of Generic Recovery

9

Performance study

Discount Checking: fast checkpoints to reliable memory (Rio)

– Logging and two-phase commit– Disk version

Mostly interactive applications– Localized and distributed

Page 10: Exploring Failure Transparency and the Limits of Generic Recovery

10

CAND1%

43%

CAND-LOG0%

13%

Effort to identify/convert ND events

CPVS1%44%

Effo

rt to

co

mm

it o

nly

visi

ble

eve

nts

CBNDVS1%42%

CBNDVS-LOG0%12%

Nvi Text Editor

Page 11: Exploring Failure Transparency and the Limits of Generic Recovery

11

CAND199%

11499%

CAND-LOG126%

7700%

Effort to identify/convert ND events

CPVS129%7346%

CPV-2PC12%319%

Effo

rt to

co

mm

it o

nly

visi

ble

eve

nts

CBNDVS101%5743%

CBNDV-2PC12% 252%

CBNDVS-LOG73%4973%

TreadMarks Barnes-Hut

Page 12: Exploring Failure Transparency and the Limits of Generic Recovery

12

Have only considered “stop” failures

Committing everything is okay– Save-work: when we must commit

Some failures affect application state– Can we commit too much?

Page 13: Exploring Failure Transparency and the Limits of Generic Recovery

13

Dangerous Paths

Page 14: Exploring Failure Transparency and the Limits of Generic Recovery

14

Dangerous Paths

Page 15: Exploring Failure Transparency and the Limits of Generic Recovery

15

Lose-work invariant

To recover from propagation failure, never commit on a “dangerous path”.

Save-work and Lose-work conflict!– Visible event on dangerous path– Can’t guarantee consistent recovery

from propagation failures

Do we see this conflict in practice?

Page 16: Exploring Failure Transparency and the Limits of Generic Recovery

16

Measuring Lose-work violations

Fault-injection study : OS crashes– injected faults into running kernel– induced 350 OS crashes– recovered nvi and postgres using

Discount Checking

Results– nvi: 15% crashes violate Lose-work– postgres: 3% crashes violate Lose-work

Page 17: Exploring Failure Transparency and the Limits of Generic Recovery

17

Application crashes

Fault-injection study: ND bugs– nvi: 37% violate Lose-work– postgres: 33% violate Lose-work

Published bug distributions: 85-95% of application bugs are deterministic

– intrinsically violate Lose-work

Perhaps > 90% app crashes violate Lose-work!

Page 18: Exploring Failure Transparency and the Limits of Generic Recovery

18

Conclusions

Save-work and Lose-work invariants Save-work protocol space Invariants fundamentally conflict Failure transparency performance:

– 0-12% overhead on reliable memory– 13-40% overhead on disk (interactive apps)

> 90% application failures violate Lose-work

Page 19: Exploring Failure Transparency and the Limits of Generic Recovery
Page 20: Exploring Failure Transparency and the Limits of Generic Recovery

20

Page 21: Exploring Failure Transparency and the Limits of Generic Recovery

21

Chart example

0

10

20

30

40

50

60

70

80

90

100

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East

West

North

South