towards formal approaches to system resilience

34
Towards Formal Approaches to System Resilience Vishal Chandra Sharma * , Arvind Haran * , Zvonimir Rakamaric * , Ganesh Gopalakrishnan {vcsharma, haran, zvonimir, ganesh}@cs.utah.edu School of Computing, University of Utah * Supported in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426. § Faculty Associate, SUPER (http://super-scidac.org/)

Upload: soleil

Post on 25-Feb-2016

69 views

Category:

Documents


0 download

DESCRIPTION

Towards Formal Approaches to System Resilience. Vishal Chandra Sharma * , Arvind Haran * , Zvonimir Rakamaric * , Ganesh Gopalakrishnan *§ { vcsharma , haran , zvonimir , ganesh }@cs.utah.edu School of Computing, University of Utah. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Formal Approaches  to  System Resilience

Towards Formal Approaches to System Resilience

Vishal Chandra Sharma*, Arvind Haran*, Zvonimir Rakamaric*, Ganesh Gopalakrishnan*§

{vcsharma, haran, zvonimir, ganesh}@cs.utah.eduSchool of Computing, University of Utah

*Supported in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426.§Faculty Associate, SUPER (http://super-scidac.org/)

Page 2: Towards Formal Approaches  to  System Resilience

2

Overview• Introduction• Fault Injector• Case Study• Fault Detector• Concluding Remarks

Page 3: Towards Formal Approaches  to  System Resilience

3

Motivation• Recent studies show resiliency as a growing area of

concern [arg13] [lanl05] • MTBF decreasing at a faster rate in exascale computing• Dynamic voltage/frequency scaling in low power

computing

• Our goal is to improve application-level resiliency

• Primary focus is to detect transient faults in a software

Silent data corruption (SDC)

Page 4: Towards Formal Approaches  to  System Resilience

4

Motivating Example

printf(“x=%d, y=%d” ,x ,y)

if (x < 3 && y > 10)y++;

int x = 2;int y = 11;

elsex++;

Rashmi Mishra
example with corrupting lower order bits
Page 5: Towards Formal Approaches  to  System Resilience

5

Motivating Example

printf(“x=%d, y=%d” ,x ,y)

int x = 2;int y = 11;

elsex++;

if (x < 3 && y > 10)y++;

Program output:x=2, y=12

Rashmi Mishra
example with corrupting lower order bits
Page 6: Towards Formal Approaches  to  System Resilience

6

Motivating Example

printf(“x=%d, y=%d” ,x ,y)

int x = 3;int y = 11;

elsex++;

if (x < 3 && y > 10)y++;

LSB position of x flipped

Rashmi Mishra
example with corrupting lower order bits
Page 7: Towards Formal Approaches  to  System Resilience

7

Motivating Example

printf(“x=%d, y=%d” ,x ,y)

int x = 3;int y = 11;

elsex++;

if (x < 3 && y > 10)y++;

Program output:x=4, y=11

LSB position of x flipped

SDC in the output value of x

Rashmi Mishra
example with corrupting lower order bits
Page 8: Towards Formal Approaches  to  System Resilience

8

Our Contribution• A LLVM-level fault Injector for evaluation purpose

[llvm04]

• A simple case study on sorting algorithms• Demonstrates effectiveness of our solution• Highlights importance of design space exploration w.r.t.

resiliency

• A software-level fault detector based on idea of predicate abstraction• Applying it in resiliency research is a novel direction!• Introduced by Ball to define a novel program coverage

metrics [pct05]

Rashmi Mishra
Confirm if it is a transient faults
Rashmi Mishra
one line bullet
Rashmi Mishra
Clearly state the problem
Rashmi Mishra
other prominent study..
Page 9: Towards Formal Approaches  to  System Resilience

9

Closely Related Work• Low-cost software level detectors

iSWAT by Sahoo et. al. uses likely program invariants [iswat08] Derives likely invariants by monitoring program properties Hardware-assisted framework to detect false positives

Error detector by Sloan et.al. [sloan13] Algorithm based error detector applied to linear solvers Utilizes algorithmic properties of linear solvers to detect and isolate

errors

• Software-level fault injectors LLVM-level fault injector developed by Kuijif et. al. [relax10]

Publicly unavailable A recent study done by a user suggests our fault injector has better

fine-grained options [schen13] LLFI fault injector by Thomas et. al. [thomas13]

Developed around same time as our fault injector, shares many similar features

Page 10: Towards Formal Approaches  to  System Resilience

10

Overview• Introduction• Fault Injector• Case Study• Fault Detector• Concluding Remarks

Page 11: Towards Formal Approaches  to  System Resilience

11

Fault InjectorKontrollable Utah’s LLVM based Fault Injector KULFI

KULFI Indian dessert

Page 12: Towards Formal Approaches  to  System Resilience

12

KULFI: Fault Injection LogicStart

Forall dynamic instructions

Inject Fault with user provided probability

Feasible?

Stop

Yes

No

Rashmi Mishra
Remove pointer vs data error distinction
Page 13: Towards Formal Approaches  to  System Resilience

13

KULFI: Fault Injection ProcessProgram

Clang

LLVM bitcode

LLVMKULFI

Dynamic Instruction

Count

Fault Injecting

LLVM bitcode

Program Input Vectors

LLVM

Execution Outcome

KULFI

SDCSegFaultBenign

Page 14: Towards Formal Approaches  to  System Resilience

14

Overview• Introduction• Fault Injector• Case Study• Fault Detector• Concluding Remarks

Page 15: Towards Formal Approaches  to  System Resilience

15

Case Study

1 Experiment 200 Fault Injection Campaigns

1 Fault Injection Campaign 100 Program Runs

Sorting routines - Bubblesort, Quicksort, Mergesort, Radixsort, Heapsort

Inject one fault during each program run

Total number of fault injections = 200*100 = 20,000!

Page 16: Towards Formal Approaches  to  System Resilience

16

Case Study• A dynamic instruction is chosen at random for fault

injection

• A single-bit fault in a random bit position of the dynamic instruction’s register

• For each fault injection campaign, categorize outcome into SDC, Benign, or Segmentation fault categories

Benign: 41, Segmentation: 29, SDC: 30

Rashmi Mishra
Provide details about experimentations
Page 17: Towards Formal Approaches  to  System Resilience

17

Case Study• Plot fault count from a fault category corresponding to

a fault injection campaign• X axis: Fault count corresponding to a fault injection

campaign• Y axis: Sorting routines

• Result shows strong clustering pattern with statistically significant distribution for each fault category

Rashmi Mishra
Provide details about experimentations
Page 18: Towards Formal Approaches  to  System Resilience

18

Results

Page 19: Towards Formal Approaches  to  System Resilience

19

Results

Page 20: Towards Formal Approaches  to  System Resilience

20

Results

Page 21: Towards Formal Approaches  to  System Resilience

21

Overview• Introduction• Fault Injector• Case Study• Fault Detector• Concluding Remarks

Page 22: Towards Formal Approaches  to  System Resilience

22

A Software-Level Approach to Fault Detection

• Predicates: Boolean program conditionals

• Predicate State: <PP,BV>

• PP: Program point between two successive program statements

• BV: Bit-vector representing concrete boolean values of program conditionals at a given program point

• Predicate State Transition: Current State Next State

Page 23: Towards Formal Approaches  to  System Resilience

23

A Software-Level Approach to Fault DetectionFoo(int x, int y){

PP0:If ( x<3 && y>10 ){PP1: y++;PP2:}else{PP3: x++;PP4:}PP5:printf(“x=%d, y=%d”,x , y)}

Predicates: x<3, y>10

Program Points: PP0, PP1, PP2, PP3, PP4, PP5

Input Vectors: x = 2, y = 11

Predicate State at PP0: <PP0, TT>

Predicate State at PP1: <PP1, TT>

Predicate State Transition:<PP0, TT> <PP1, TT>

<PP0,TT>

<PP1,TT>

Page 24: Towards Formal Approaches  to  System Resilience

A Software-Level Approach to Fault Detection

24

Start

Program P

Extract predicate transitions

Stop

Start

Program P

Get Predicate Transition

Stop

Fault Detected

Check if Valid ?

last transition

?

Yes

NoYes

No

Instrumented Program P1

Instrumented Program P2

Profile valid predicate transitions

Detect transient faults

Page 25: Towards Formal Approaches  to  System Resilience

25

Predicate Transition Diagram (PTD)Start

Program

Inject Fault

Track Predicate Transitions

Track Predicate Transitions

Merge

Predicate Transition Diagram

Stop

Rashmi Mishra
Mention that early on..
Page 26: Towards Formal Approaches  to  System Resilience

26

PTD of Foo()

Rashmi Mishra
Mention that early on..
Page 27: Towards Formal Approaches  to  System Resilience

27

PTD of dgstrf() in SuperLU [slu99,05,11] • SuperLU is a direct linear solver for sparse and nonsymmetric

systems of linear equations • Available at: http://crd-legacy.lbl.gov/~xiaoye/SuperLU/

Page 28: Towards Formal Approaches  to  System Resilience

28

PTD of BlkSchlsEqEuroNoDiv() in Blackscholes• Financial analysis using blackscholes PDE• Part of Parsec 3.0 benchmark suite [parsec08]

Page 29: Towards Formal Approaches  to  System Resilience

29

Overview• Introduction• Fault Injector• Case Study• Fault Detector• Concluding Remarks

Page 30: Towards Formal Approaches  to  System Resilience

30

Concluding Remarks• A novel software-level fault detector

• Enabling infrastructure for resiliency analysis and evaluation through KULFI

• Recommended use during design space exploration

• Try out KULFI: https://github.com/soar-lab/KULFI

Rashmi Mishra
should be more concise
Rashmi Mishra
design decisions.. design space exploration...evaluate programs
Page 31: Towards Formal Approaches  to  System Resilience

31

References[arg13] Snir, M., et al. Addressing Failures in Exascale Computing. No. ANL/MCS-TM-33. Argonne National Laboratory (ANL), 2013[lanl05] Michalak, Sarah E., et al. "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer." IEEE Transactions on Device and Materials Reliability, 2005[llvm04] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization (CGO), 2004[pct05] T. Ball, “A theory of predicate-complete test coverage and generation,” in International Conference on Formal Methods for Components and Objects (FMCO), 2005[iswat08] S. K. Sahoo, M. lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y. Zhou, “Using likely program invariants to detect hardware errors,” in IEEE International Conference on Dependable Systems and Networks (DSN), 2008[sloan13] Sloan, Joseph, Rakesh Kumar, and Greg Bronevetsky. "An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance.“, in IEEE International Conference on Dependable Systems and Networks (DSN), 2013

Page 32: Towards Formal Approaches  to  System Resilience

32

References[slu99] Demmel, James W., et al. "A supernodal approach to sparse partial pivoting.“ SIAM Journal on Matrix Analysis and Applications, 1999

[slu05] Li, Xiaoye S. "An overview of SuperLU: Algorithms, implementation, and user interface." ACM Transactions on Mathematical Software (TOMS), 2005[slu11] Li, X. S., Demmel, J. W., Gilbert, J. R., Grigori, L., Shao, M., & Yamazaki, I. (2011). SuperLU Users’ Guide. url: http://crd. lbl. gov/~ xiaoye/SuperLU/superlu_ug. Pdf.

[sprs11] Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS), 2011

[parsec08] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” ser. PACT, 2008

[relax10] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An ar- chitectural framework for software recovery of hardware faults,” in International Symposium on Computer Architecture (ISCA), 2010

[thomas13] Thomas, Anna, and Karthik Pattabiraman. "Error Detector Placement for Soft Computation." in International Conference on Dependable Systems and Networks (DSN), 2013.[schen13] S. Chen, personal communication, 2013.

Page 33: Towards Formal Approaches  to  System Resilience

33

Acknowledgements• Pedro Diniz• Prabhakar Kudva• Shuvendu Lahiri• Karthik Pattabiraman • Sui Chen• Anonymous reviewers of PRDC conference who reviewed

our paper

Page 34: Towards Formal Approaches  to  System Resilience

Thank you!

34