towards a hardware-software co-designed resilient system man-lap (alex) li, pradeep ramachandran,...

19
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of Illinois at Urbana-Champaign In collaboration with Pradip Bose (IBM) and Subhasish Mitra (Stanford)

Upload: norma-lawrence

Post on 11-Jan-2016

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Towards a Hardware-Software

Co-Designed Resilient System

Man-Lap (Alex) Li, Pradeep Ramachandran,

Sarita Adve, Vikram Adve, Yuanyuan Zhou

University of Illinois at Urbana-Champaign

In collaboration with

Pradip Bose (IBM) and Subhasish Mitra (Stanford)

Page 2: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Motivation

• Failures will happen in the field

– Design defects

– Aging

– Soft errors

– Inadequate burn-in

– Aggressive design for power/performance/reliability

– …

• Low-cost method to detect/recover from all sources of failure?

– Reliability problem pervasive across many markets

– Traditional solutions (e.g. nMR) too expensive

– Must incur low performance, power overhead

Page 3: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

A Low-Cost, Unified Reliability Solution

• Need handle only faults that propagate to software

– Hardware faults appear as software bugs

– Leverage software reliability solutions for hardware?

• One-size-fits-all near-100% coverage often unnecessary

– Solution must be customizable to application needs

Page 4: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Outline

• Motivation of Framework

• Unified Framework for H/W + S/W Reliability

• Understanding the Impact of H/W Failures on S/W

• Future Work

Page 5: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Unified Framework for H/W + S/W Reliability

• Unified hardware/software co-designed framework

– Tackles hardware and software faults

– Software-centric solutions with near-zero H/W overhead

– Customizable to app needs, flexible for new error sources

Error undetected

Fault

Error

CHECKPOINT

CHECKPOINT

Error detected

CHECKPOINT

Detection with more overhead

Fault

Error

Testing

CHECKPOINT

Repair, recovery

No error

Fault

Error

Symptom detected

Recovery

CHECKPOINT

CHECKPOINT

Ideal: symptom-based detection

Repair

Diagnosis

Page 6: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Framework Components

• Detection: Software symptoms, online testing

• Recovery: Software/hardware checkpoint and rollback

• Diagnosis: Firmware layer for rollback/replay, online testing

• Repair/reconfiguration: Redundant, reconfigurable hardware

• Need to understand how hardware faults propagate to S/W

– How do hardware faults become visible to software?

– What is the latency?

– Do H/W faults affect application and/or system state?

Page 7: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Methodology

• Microarchitecture-level fault injection

– Trade-off between accuracy and simulation time

– GEMS timing models for out-of-order processor, memory

– Simics full-system simulation of Solaris + UltraSPARC III

– SPEC workloads for ten million instructions

• Fault model

– Stuck-at, bridging faults in many micro-arch structures

• Fault detection

– Crashes detected through hardware generated fatal traps

Misaligned memory access, RED state, watchdog reset, etc.

– Hangs detected using simple hardware hang detector

Page 8: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

How do Hardware Faults Propagate to Software?

• 97% faults (w/o FPU) detectable with simple H/W & S/W– Need H/W support or S/W monitoring for FPU

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

s-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vcc

Decoderfault

INT ALUfault

FP ALU fault Reg Dbusfault

Int reg fault ROB fault RAT fault

OtherHangCrashMask

Page 9: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

How do Hardware Faults Propagate to Software?

• 97% faults (w/o FPU) detectable with simple H/W & S/W– Need H/W support or S/W monitoring for FPU

• > 50% crashes/hangs in OS

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

s-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vccs-a-0B-gnd

s-a-1B-Vcc

Decoderfault

INT ALUfault

FP ALUfault

Reg Dbusfault

Int reg fault ROB fault RAT fault

OtherHang-OSHang-AppCrash-OSCrash-AppMask

Page 10: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

S/W Components Corrupted

• 62% of faults corrupt system state

– Need to recover system state

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

CrashHang CrashHang CrashHang CrashHang CrashHang CrashHang

Decoderfault

INT ALUfault

Reg Dbusfault

Int regfault

ROB fault RATfault

NoneSystemApp

Page 11: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Latency to Detection from Application Corruption

• 80% have latency < 100K instr, amenable to H/W recovery

– Buffering for 50µs on 2 GHz processor

• May need to use software checkpoint/recovery for others

Total instructions executed between app state corruption and detection

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

CrashHang CrashHang CrashHang CrashHang CrashHang CrashHang

Decoderfault

INT ALUfault

Reg Dbusfault

Int reg fault ROB fault RATfault

<10M

<100k

<1k

<100

Page 12: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Latency to Detection from OS Corruption

• 92% of injections result in latency of < 100K OS instructions

– Amenable to hardware recovery

OS-only instructions executed between OS state corruption and detection

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

CrashHang CrashHang CrashHang CrashHang CrashHang CrashHang

Decoderfault

INT ALUfault

Reg Dbusfault

Int reg fault ROB fault RATfault

<10M

<100k

<1k

<100

Page 13: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Summary so far

• Hardware faults highly visible

– Over 97% of faults in 6 structures result in crashes/hangs

– Simple H/W and S/W sufficient

• Recovery through checkpointing

– S/W and/or H/W checkpoints for application recovery

– H/W checkpoints and buffering for OS recovery

Page 14: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Next Steps (1 of 3)• Improving understanding of fault propagation

– Accurate fault models, effect of transients, intermittents

– Lower-level simulations

– Better workloads

• Detection

– More software level monitoring Software signals, invariants, perturbations, …

– H/W support to aid detection in some structures (e.g., FPU)

– Selective backup testing

• Recovery

– Enhanced detection may reduce latency

– Explore software vs. hardware, application customizability

Page 15: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Next Steps (2 of 3)• Diagnosis

– Assume rollback/restart mechanism, multicore system

Original symptom doesn’t recur Original symptom recurs

Transient h/w bug, ornon-deterministic s/w bug

Continue execution…

Deterministic s/w bug, orPermanent h/w bug

Rollback, restart on different core

Permanent defect in original core

Bug detected

Rollback to previous checkpoint, restart on original core

No symptom

Deterministic s/w bug

Symptom

Page 16: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Next Steps (3 of 3)• Repair/reconfigure

– What should be the right field configurable unit?

– Core, FU, array entries?

• Avoidance

– Dynamic reliability management

• Implementation architecture

– Hardware + firmware + OS

– Itanium machine check architecture has hooks

Page 17: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Thank You

Questions?

Page 18: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Backup Slides

Page 19: Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of

Pradeep Ramachandran, University of Illinois, Urbana Champaign

Types of fatal traps

• Faults cause different fatal traps thrown before crashes

– Junk data access leads to memory misalignment

– Repeatedly trapping leads to RED state

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Decoderfault

INT ALUfault

Reg Dbusfault

Int regfault

ROB fault RAT fault

OS - Red state

OS - Memory Misaligned

OS - Watchdog Reset

OS - Illegal Instruction

OS - Division By Zero

OS - Data Acc. Exception

App - Memory MisAligned

App - Watchdog Reset

App - Illegal Instruction

App - Division By Zero

App - Data Acc. Exception