ca226 — advanced computer architectureray/teaching/ca226/02-failures.pdf · ca226 — advanced...
TRANSCRIPT
CA226 — AdvancedComputer Architecture
2
Hardware Failures …But first:
• we need to talk a little bit about probability
CA226 — AdvancedComputer Architecture
3
Note to self…
1. The Monty Hall problem [http://en.wikipedia.org/wiki/Monty_Hall_problem]
2. The two-children problem [http://www.maa.org/external_archive/devlin/devlin_04_10.html]
3. The terrorist problem [http://en.wikipedia.org/wiki/Base_rate_fallacy]
4. The Geometrical distribution [http://en.wikipedia.org/wiki/Geometric_distribution]
Note
Of these four topics,only the last is directly relevant to today’s material.
CA226 — AdvancedComputer Architecture
4
The Exponential DistributionThe exponential distribution:
• `f_lambda(x)\ =\ lambda e^{-lambda x}`
In which:
• `lambda` is know as the rate
• `e` is the base of natural logarithmsabout `2.71828...`
CA226 — AdvancedComputer Architecture
5
The Exponential DistributionThe exponential distribution:
• `f_lambda(x)\ =\ lambda e^{-lambda x}`
For Poisson processes, `f_lambda(x)` is:
• the probability that `x` units of time elapse until the next event of interest occurs
• e.g. the probability that 10,000 hours pass until a disk fails
CA226 — AdvancedComputer Architecture
6
The Exponential Distribution
CA226 — AdvancedComputer Architecture
7
The Exponential Distribution
CA226 — AdvancedComputer Architecture
8
The Exponential Distribution
CA226 — AdvancedComputer Architecture
9
The Exponential Distribution
CA226 — AdvancedComputer Architecture
10
Exponential Distribution — PropertiesMean:
• `1/lambda`
Note
The mean is inversely proportional to the rate(which is intuitively correct).
So, it is easy to convert between means and rates(and we’ll be doing a fair amount of that).
CA226 — AdvancedComputer Architecture
11
Exponential Distribution — PropertiesProbability of exceeding a value:
• `P(x>t)\ =\ e^{-lambda t}`
CA226 — AdvancedComputer Architecture
12
Multiple Exponential DistributionsAssume two (independent) exponentially-distributed random variables `X_1`and `X_2` with rates `lambda_1` and `lambda_2`:
• `P(min(X_1,X_2) > t)\ =\ e^{-(lambda_1 + lambda_2)t}`
So:
• the smallest (or first) of two exponentially-distributed random events is itselfexponentially distributed
• and the corresponding rate is just the sum of the individual rates
Recall that `P(x>t)\ =\ e^{-lambda t}`.
CA226 — AdvancedComputer Architecture
13
Example — A ShopkeeperEvents:
1. mean time until a customer arrives is 5 minutes
2. mean time until the phone rings is 20 minutes
Assuming these are exponentially distributed:
• what is the mean time until either of these events occur?
CA226 — AdvancedComputer Architecture
14
AnswerRates:
1. `\ 1/{5\ "minutes"}`
2. `\ 1/{20\ "minutes"}`
Combined rate:
• `4/20 + 1/20\ =\ 5/20\ =\ 1/{4\ "minutes"}`
CA226 — AdvancedComputer Architecture
15
AnswerEvents:
1. mean time until a customer arrives is 5 minutes
2. mean time until the phone rings is 20 minutes
Mean time to the first of these events:
• 4 minutes
CA226 — AdvancedComputer Architecture
16
Multiple Exponential DistributionsMore generally (and obviously):
• `P(min(X_1,X_2,...,X_n) > t)\ =\ e^{-(lambda_1 + lambda_2 + ... + lambda_n)t}`
Again, recalling that `P(x>t)\ =\ e^{-lambda t}`.
CA226 — AdvancedComputer Architecture
17
AsideThe exponential distribution (and its discrete version — the geometricdistribution):
• is the only memoryless probability distribution
Because exponential distributions are entirely characterised by their mean:
• they are often defined by just stating their mean(or their half life)
CA226 — AdvancedComputer Architecture
18
Why is this relevant?Many computer hardware failures are exponentially distributed:
• and, for those that aren’t:the exponential distribution is nevertheless a reasonable first approximation
CA226 — AdvancedComputer Architecture
19
And…Given:
• `P(min(X_1,X_2,...,X_n) > t)\ =\ e^{-(lambda_1 + lambda_2 + ... + lambda_n)t}`
we can reason about failure rates of complex (multi-component) systems withoutknowing too much about the details of the exponential distribution itself
CA226 — AdvancedComputer Architecture
20
FailuresA system is in one of two states:
• functioning or not functioning
Transitions between these states are:
• failures and restorations
CA226 — AdvancedComputer Architecture
21
Metrics — MTTFMTTF:
• mean time to failure
Examples:
1. MTTF is (perhaps) 1,000,000 hours for some hard disk
2. MTTF is (perhaps) 100,000 hours for a fan
CA226 — AdvancedComputer Architecture
22
Metrics — Failure RateThe failure rate:
• is just the reciprocal of the MTTF
Examples:
1. if MTTF is `10^6` hours, then failure rate is `1//10^6` per hour
2. if MTTF is `10^5` hours, then failure rate is `1//10^5` per hour
CA226 — AdvancedComputer Architecture
23
MTTF — ExampleWhat is the MTTF of a two-component system composed of:
1. a hard disk with a MTTF of 1,000,000 hours
2. and a fan with a MTTF 100,000 hours?
CA226 — AdvancedComputer Architecture
24
MTTF — ExampleIf we assume failures are independent and exponentially distributed, then:
• The means are `10^{6}` and `10^{5}`
• So the rates are `10^{-6}` and `10^{-5}`
• Since they’re exponentially distributed, we add these to get the overall rate:`10^{-6} + 10^{-5} = 1.1 times 10^{-5}`
• Giving us the MTTF as the reciprocal of the rate:`1/{1.1 times 10^{-5}} = 90909`
CA226 — AdvancedComputer Architecture
25
Metrics — Failure RateThe failure rate:
• is often measured in failures per `10^9` (billion) hours(this known as FIT — for failures in time)
1 FIT is one failure every 114155 years.
Examples (from previous slides):
1. for the disk, rate of `10^9//10^6\ =\ 1000\ "FIT"`
2. for the fan, rate of `10^9//10^5\ =\ 10000\ "FIT"`
CA226 — AdvancedComputer Architecture
26
RestorationsWhen a system fails, it must be repaired
CA226 — AdvancedComputer Architecture
27
Metrics — MTTRMTTR:
• mean time to repair
Examples — if a power unit fails:
• it may take 24 hours (say) for it to be replaced
• or perhaps 168 hours (one week)
CA226 — AdvancedComputer Architecture
28
Metrics — MTBFMTBF:
• mean time between failures
• `"MTTF" + "MTTR"`
CA226 — AdvancedComputer Architecture
29
Metrics — AvailabilityAvailability:
• the proportion of time during which service is satisfactorily delivered
• `"MTTF" / {"MTTF" + "MTTR"}`
Availability is usually quoted as a percentage.
CA226 — AdvancedComputer Architecture
30
Availability — ExampleIf:
• MTTF is `10^5` hours
• MTTR is 168 hours
Then:
• availability is `10^5/{10^5+168}\ =\ 99.83%`
CA226 — AdvancedComputer Architecture
31
SystemsComputer systems consist of a number of components:
• e.g. processor, memory, bus, disk, fan, etc.
If components individually have exponentially distributed lifetimes:
• then so too does the system as a whole
CA226 — AdvancedComputer Architecture
32
ExampleAssume a disk subsystem with the following components:
• 10 disks, each rated at `1 xx 10^6` MTTF (hours)
• 1 ATA controller, `5 xx 10^5` MTTF
• 1 power supply, `2 xx 10^5` MTTF
• 1 fan, `2 xx 10^5` MTTF
• 1 ATA cable, `1 xx 10^6` MTTF
Assuming exponentially-distributed lifetimes and independent failures:
• calculate the MTTF of the disk subsystem as a whole
CA226 — AdvancedComputer Architecture
33
Example — Failure rate of system …Failure rate of system:
`\ 10 xx 1/{1 xx 10^6}\ +\ 1/{5 xx 10^5}\ +\ 2 xx 1/{2 xx 10^5}\ +\ 1/{1 xx 10^6}``\ \ = \ {10 + 2 + 10 + 1} / {10^6}``\ \ = \ 23 / {10^6\ "hours"}`
CA226 — AdvancedComputer Architecture
34
Example — Failure rate of system …Or:
• `23 / {10^6\ "hours"} xx 10^9 = 23000\ "FIT"`
CA226 — AdvancedComputer Architecture
35
Example — MTTF of the system as a whole?The MTTF of the system as a whole is the inverse of the failure rate:
`\ \ 1 / "failure rate of system"``\ \ =\ 43500\ "hours"` (just under five years, approx.)
CA226 — AdvancedComputer Architecture
36
Example — Availability of system as a whole …Assume any failed component will be repaired in 24 hours.
Availability:
`\ \ "MTTF" / {"MTTF" + "MTTR"}``\ \ =\ 43500 / { 43500 + 24 }\ =\ 99.945%`
CA226 — AdvancedComputer Architecture
37
Example — Availability of system as a whole …Alternatively, assume any failed component will be repaired in 168 hours (oneweek).
Availability:
`\ \ "MTTF" / {"MTTF" + "MTTR"}``\ \ =\ 43500 / { 43500 + 168 }\ =\ 99.615%`
CA226 — AdvancedComputer Architecture
38
Key PointsThis calculation is possible (and simple) because:
• of the assumptions of exponential distributions and independent failures
• of the simplicity of combining exponential distributions
• of the mean and the rate being merely the reciprocal of one another
CA226 — AdvancedComputer Architecture
39
Improving ReliabilityOne common approach to improving reliability is:
• redundancy
CA226 — AdvancedComputer Architecture
40
Improving Reliability — ExampleWhat would be the MTTF of a power supply and its availability if:
• the MTTF of an individual power supply is `2xx10^5` hours
• we add an additional (redundant) power supply, and
• the MTTR for the power supply unit is 24 hours?
CA226 — AdvancedComputer Architecture
41
Well, let’s see …Mean time to individual power supply failure:
• `"MTTF"_{"individual"}\ =\ 2xx10^5\ "hours"`
Mean time to any power supply failure:
• `"MTTF"_{"any"}\ =\ {"MTTF"_{"individual"}}/2`
CA226 — AdvancedComputer Architecture
42
And …Probability of second failure before first failure is repaired:
• `{"MTTR"_{"individual"}}/{"MTTF"_{"individual"}}`
CA226 — AdvancedComputer Architecture
43
And …MTTF of power supply pair:
• `"MTTF"_{"any"} times 1/{"probability of second failure before repair"}`
• `{"MTTF"_{"any"}}/{"probability of second failure before repair"}`
• `{{"MTTF"_{"individual"}}/2} / {({"MTTR"_{"individual"}}/{"MTTF"_{"individual"}})}`
CA226 — AdvancedComputer Architecture
44
And …MTTF of power supply pair:
• `{{"MTTF"_{"individual"}}/2} / {({"MTTR"_{"individual"}}/{"MTTF"_{"individual"}})}`
• `{"MTTF"_"individual"^2} / {2xx"MTTR"_{"individual"}}`
CA226 — AdvancedComputer Architecture
45
So, let’s try that out …Using the values from the previous example:
• `{"MTTF"_"individual"^2} / {2xx"MTTR"_{"individual"}}`
• `{ (2xx10^5)^2} / {2xx24} \ =\ 83xx10^7` hours
So the MTTF of the pair is 4150 times that of a single power supply:
• redundancy works!
CA226 — AdvancedComputer Architecture
46
Now …What would be the effect of adding a third power supply?
CA226 — AdvancedComputer Architecture
47
Third Power Supply …Assuming an MTTR of 24 hours:
• `{8xx10^15} / { 9 xx 24} \ = \ 36 xx 10^12\ "hours"`
• or … failures occur, on average, about 4 billion years apart
Assuming an MTTR of about 2 months:
• `{8xx10^15} / { 9 xx 1428} \ = \ 622 xx 10^9\ "hours"`
• or … failures occur, on average, about 71 million years apart
CA226 — AdvancedComputer Architecture
48
Note …The analysis above applies to any form of redundancy:
• mirrored disks
• hot spare
Google use three replicas of all user data:
• hopefully in storage systems with independent failures
CA226 — AdvancedComputer Architecture
49
Data Centres — 1Assume:
• a service provider runs `10^6` (one million) servers
• the MTTF for a server is `17520` hours (two years)
What is the MTTF for any server?
`1.0512` minutes!
CA226 — AdvancedComputer Architecture
50
Data Centres — 2Assume:
• a server provider runs `10^6` (one million) servers
• the MTTF for a server is `35040` hours (four years)
What is the MTTF for any server?
`2.1024` minutes!
CA226 — AdvancedComputer Architecture
51
So…If you’re running large numbers of machines:
• expect failuresthey’re normal and commonregardless of the quality of your hardware
CA226 — AdvancedComputer Architecture
52
Overall — Potential Problems?Assumptions:
• failures may not be independent
• failures may not be exponentially distributed
Note
Note to self: the bathtub curve.
CA226 — AdvancedComputer Architecture
53
Last year’s exam…An ax consists of a handle and a blade:
• Assume that the MTTF of a handle is 100 hours and the MTTF of a blade is 250hours.Further assume that failures are exponentially distributed.
• Calculate the MTTF of an ax as a whole.
If the MTTR of an ax is one hour:
• Calculate the (system) availability of such axes.Express your answer as a percentage.
CA226 — AdvancedComputer Architecture
54
Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>