cse 6510 (461) fall 2010 selected notes on fault-tolerance (12)

15
Page 1 Copyright © 2002-2010 Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer Science and Engineering University of Connecticut

Upload: phyllis-blackwell

Post on 31-Dec-2015

25 views

Category:

Documents


4 download

DESCRIPTION

CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12). Alexander A. Shvartsman Computer Science and Engineering University of Connecticut. Fault-Tolerance -- An Overview. A fundamental property of distributed systems: potential for fault tolerance - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 1Copyright © 2002-2010 Alexander Allister Shvartsman

CSE 6510 (461)Fall 2010

Selected Noteson Fault-Tolerance (12)

Alexander A. ShvartsmanComputer Science and Engineering

University of Connecticut

Page 2: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 2Copyright © 2002-2010 Alexander Allister Shvartsman

Fault-Tolerance -- An Overview

• A fundamental property of distributed systems: potential for fault tolerance

• The main tool in achieving fault tolerance is redundancy

• Distributed systems consist of multiple components: When more than one resource is capable of

performing a certain function, some fault tolerance is achievable

• Goal Take advantage of the multiplicity of resources in

constructing systems that tolerate failures

Page 3: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 3Copyright © 2002-2010 Alexander Allister Shvartsman

Fault Tolerance and Dependability

• A system specification may call for fault-tolerance By stating that the system must perform correctly Even if certain internal or external components fail to

perform according to their specifications Additionally, the degradation in in performance due to

failures must be “graceful”

• Dependability : is a closely-related notion Trustworthiness of a computer system, i.e., Reliance can justifiably be placed on system’s service Dependability is achieved in part through fault-

tolerance

Page 4: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 4Copyright © 2002-2010 Alexander Allister Shvartsman

Faults, Errors and Failures

• We distinguish among faults, errors and failures: Fault: (or defect)

a component or a subsystem fail to perform according to their specification

Error: a computation enters an incorrect state as the result of a fault

Failure: a systems fails to meet its specification as the result of an error

• Faults may or may not lead to an error

• Errors may or may not lead to a failure

Page 5: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 5Copyright © 2002-2010 Alexander Allister Shvartsman

Fault-Tolerance -- Basic Approaches

• Fault prevention: eliminating faults before the system put into use or during periodic preventive maintenance

• Fault tolerance: a system detects errors caused by faults, corrects its state and does not fail for as long as the faults and errors are

within its design parameters

• Fault masking: a fault-tolerant system is capable of dealing with faults

and errors in a way that is transparent to the users of the

system’s services

Page 6: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 6Copyright © 2002-2010 Alexander Allister Shvartsman

Fault Classification

• Crash fault Fail-stop processor (detectable crash) Failure after a send/receive

• Omission fault Communication, send or

receive omission Operation

• Timing fault Processor delays Link time-out

• Byzantine fault Arbitrary fault Malicious behavior

Crash

Omission

Timing

Byzantine

IncreasedSeverity

Page 7: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 7Copyright © 2002-2010 Alexander Allister Shvartsman

Models of Processor Failures and Restarts

• Fail-stop processors

• Model assumptions, e.g., Shared memory Robust interconnect Resilient memory Timing guarantees

Undetectable restarts

Detectable restarts

Synchronous restarts

No restarts

Initial faults

Page 8: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 8Copyright © 2002-2010 Alexander Allister Shvartsman

Fault Tolerance, Redundancy and Efficiency

• Fault tolerance is achieved through redundancy

• Redundancy in components/resources -- space redundancy : additional components (hardware or software) are

provided or made available to deal with errors distributed systems have inherently redundancy

• Redundancy in computation or time redundancy : additional computation is performed to detect errors

or to test components here the cost is performance

Page 9: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 9Copyright © 2002-2010 Alexander Allister Shvartsman

Combining Fault-Tolerance and Efficiency

• The fundamental conflict exists between efficiency and fault tolerance: Efficiency implies low redundancy Fault tolerance implies high redundancy

• Robustness Property of a system that combines Efficiency and Fault-tolerance, e.g., correctness under failures

• Achieving robustness is very challenging in many cases Efficiency often must be traded-off for fault tolerance

Page 10: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 10Copyright © 2002-2010 Alexander Allister Shvartsman

Strategies for Fault Tolerance

• Layered architecture : a structuring technique in achieving fault tolerance

• A failure of a lower level component may/will manifest itself as a fault to a higher layer

• Error at a lower layer may be contained or masked

• When this is not possible, the layer attempts to reduce the severity of the error and to manifest itself through a more benign failure

Page 11: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 11Copyright © 2002-2010 Alexander Allister Shvartsman

Layer Architecture for Fault-Tolerance

fault

error

failure

fault

Layer N+1

Layer N-1

Layer N

failurefailureerror error

fault fault

Page 12: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 12Copyright © 2002-2010 Alexander Allister Shvartsman

Phases in Fault Tolerance• Fault prevention and fault tolerance are complementary:

both are needed for dependability• Fault tolerance and its “phases”

Error detection• Tests, checks and diagnostics

Damage confinement• Dynamic assessment of damage boundaries• Static firewalls

Progress evaluation and error recovery• Backward recovery, checkpointing, roll back• Forward recovery and self-stabilization• Processor scheduling and load balancing

Fault treatment and continued system service• Fault location• System repair• Dynamic reconfiguration• Standby spare components

Page 13: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 13Copyright © 2002-2010 Alexander Allister Shvartsman

Faults: Causes and Temporal Effects• Faulty system -- a system with defects

Faulty requirements Design faults Hardware faults Software . . . bugs (I don’t know who put it there) Operational faults

• Faults -- temporal taxonomy Transient fault -- limited duration Intermittent fault -- occur repeatedly Permanent fault -- manifests itself until fixed

• Faults and fault masking Is fault masking “good”? If a system is capable of tolerating k faults, is masking 1

fault good? Masking k-1 faults? Are faults “bad”? Is a system containing faults necessarily defective?

Page 14: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 14Copyright © 2002-2010 Alexander Allister Shvartsman

Models of Failure: Overall Considerations• Models need to capture/abstract/approximate reality

• Type of failures -- severity: fail-stop, malicious failures, memory

contamination

• Kind of failure-causing adversary -- omniscient or oblivious; on-line adaptive or off-line.

• Duration: no-restart <-> restartable

• Frequency of failures -- rate of processor attrition (one time, arbitrary,

probabilistic)

• Fine/coarse granularity of failures -- components: processors / gates, processor / thread failures

• Magnitude of failures -- total number of failures (and recoveries) during

computation

Page 15: CSE 6510 (461) Fall 2010  Selected Notes on Fault-Tolerance (12)

Page 15Copyright © 2002-2010 Alexander Allister Shvartsman

Designing for F/T: Evaluation Criteria• What is the cost of failure? Is it bearable?• How much is one willing to pay for fault tolerance?

Is slower response preferable to a failure? Is higher HW cost acceptable? Is lower HW cost acceptable as long as failures are

masked?

• What is the goal of building-in some fault tolerance? Elimination of (some failure)? Reduction in the severity of failures? Error detection?

• When the failures are corrected, Is a slower response time acceptable as long as the

computation is correct? Is a slight error acceptable as long as the computation

completes within the required time?