failure rates mttr mttf

6
Failure Rates Terminology: MTTF - Mean time to failure MTBF - Mean time between failures MTTR - Mean time to repair Availability = MTTF/MTBF Downtime = 1 - Availability MTTF of disk = 300,000 ~= 34 years o This is a misleading measure, no one really knows, but it's probably lower. AFR (Annualized Failure Rate) is a more meaningful measurement o A normal bathtub curve would be expected, but the data doesn't represent that. o In reality, it starts low (due to factory testing), then trends sharply upwards and levels off at about 10%. The RAID 5 failure rate increases more sharply and to a higher percentage. This is due to the fact that if two drives fail, all the drives are toast. Many RAID boxes have a hot space they switch to in case of a failure. ASSUMPTION: Failures are independent. Distributed Systems Distributed systems cannot drop into kernel mode. They use message passing. Specifically, they use remote procedure calls (RPC) o Abstracted to look like a normal function, like close(19). Underneath the hood, it's very different

Upload: riadelectro

Post on 25-Oct-2015

14 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Failure Rates MTTR MTTF

Failure Rates

Terminology:

MTTF - Mean time to failure MTBF - Mean time between failures MTTR - Mean time to repair

Availability = MTTF/MTBF Downtime = 1 - Availability

MTTF of disk = 300,000 ~= 34 years o This is a misleading measure, no one really knows, but it's probably

lower. AFR (Annualized Failure Rate) is a more meaningful measurement

o A normal bathtub curve would be expected, but the data doesn't represent that.

o In reality, it starts low (due to factory testing), then trends sharply upwards and levels off at about 10%.

The RAID 5 failure rate increases more sharply and to a higher percentage. This is due to the fact that if two drives fail, all the drives are toast.

Many RAID boxes have a hot space they switch to in case of a failure. ASSUMPTION: Failures are independent.

Distributed Systems

Distributed systems cannot drop into kernel mode. They use message passing. Specifically, they use remote procedure calls (RPC)

o Abstracted to look like a normal function, like close(19). Underneath the hood, it's very different

Hard modularity is obtained for free due to its nature. On the other hand, we lose call by reference, because there is no shared

address space. It is limited by network bandwidth. It is far less secure. Messages can get lost or duplicated. The server can become a bottleneck (however, this can happen to anything) Machines may have different architecture.

o This can lead to issues in interpretation. For example, big-endian vs. little-endian

Page 2: Failure Rates MTTR MTTF

Dealing with architectural differences

Standard interchange format: XML text (this has a considerable network + CPU overhead)

Have everybody know everybody else's architecture (this doesn't scare, O(N^2))

Standardize on big-endian (this is how it is done on the net) o This is what is normally done. o It's the same for every other difference. o In the case of big-endian vs. little-endian, there's a machine instruction

in x86 that swaps back and forth between the two. Marshalling: Converting data in memory to a data format suitable for storage

or transmission. o This is also known as serialization, or pickling. o It's a pain to do this every time you do a function call. o Stubs (automatically derived from a protocol description)

Example of RPCScenario: A client talking to a window server.Request: Draw 10 20 blueResponse: OKResult: The pixel at coordinates (10,20) is turned blue.

In order to optiomize, commands could be combined into more complicated commands.

o For example, there could be a rectangle command to draw a rectangle, instead of just a point.

o This complicates the API further though.

RPC Failure Modes

Lost messages Duplicated messages Corrupted messages Network may be down or slow Server may be down or slow

Solutions for Lost Messages:AT-LEAST-ONCE RPC: If no response, resend the request (this isn't always the correct action to take).AT-MOST-ONCE RPC: If no response, report an error.EXACTLY-ONCE RPC: This is the ideal. It is also very hard.

Page 3: Failure Rates MTTR MTTF

Performance Issues

Travel time between computers if often the source of the most delay. o In order to alleviate this, we can use asynchronous calls.

However, this can cause other problems, as if calls depend on each other, a earlier call failing can corrupt the state of the program.

o Another solution is to coalesce calls. For example, there could be a rectangle command to draw a

rectangle, instead of just a point. This complicates the API further.

Page 4: Failure Rates MTTR MTTF

La Maintenance Professionnelle

Maintenance Professionnelle 25 novembre 2007

Passer d’une maintenance de réparation, agissant en pompier à un métier de préventeurs, d’experts pour obtenir 0 panne, c’est un changement culturel pour:

le management: o reconnaitre ce qui ne ce voit pas: le préventif,o plutôt que ce qui se voit: les dépannages

le responsable de maintenance: un autre challenge les techniciens de maintenance: un autre métier

Les objectifs de la maintenance professionnelle:

Maximiser la fiabilité des équipements pour un coût économique.En améliorant la fiabilité, la sécurité et la qualité produit seront également améliorées

Éliminer les activités de maintenance non planifiées, improvisées Utiliser les méthodes de maintenance (périodiques, conditionnelles, autonome,

…) en fonction de la criticité des machines pour un meilleur coût Développer les compétences des personnels de maintenance et des opérateurs

pour supporter la stratégie de maintenance professionnelle. Créer une culture zéro défaillance Planifier les activités pour réduire au maximum les arrêt de production.

« De faire le minimum de travail nécessaire à l’élimination des pannes et à l’utilisation maximum des équipements »

Les 8 fondations d’un service de maintenance:

1. Établir la classification des équipements AA, A, B et C2. Définir les flux d’informations et de pièces (work flow management)3. Développer les dossiers machines: compétences équipements4. Gestion des pièces de rechange, magasin5. Ressources de Maintenance: 5S à l’atelier de maintenance, gestion des sous-

traitants, compétences des techniciens de maintenance6. Gestion de la lubrification7. Gestion des pannes, recueil des informations, analyse8. Établissement et suivi d’indicateurs clefs

Les indicateurs de maintenance:

MTBF Mean Time Beetween Failure: le temps moyen entre 2 pannes mesure la fiabilité des équipements, l’efficacité du service de maintenance à éviter les pannes.

MTTR Mean Time To Repare: le temps moyen de réparation mesure l’efficaicté du service de maintenance à réparer.

Page 5: Failure Rates MTTR MTTF

Les différents types de maintenance:

Et l’utilisation des types de maintenance en fonction de la criticité des équipements: Nous ne déploieront pas les mêmes ressources pour un équipement critique AA et un autre qui ne l’est pas du tout!!