before terabytes fall disk reliability in windows vista and beyond frank shu program manager...

31
Before Terabytes Before Terabytes Fall Fall Disk reliability in Windows Disk reliability in Windows Vista and beyond Vista and beyond Frank Shu Frank Shu Program Manager Program Manager WDEG-Storage WDEG-Storage Microsoft Corporation Microsoft Corporation Matthew Kerner Matthew Kerner Program Manager Program Manager Windows Diagnosis Windows Diagnosis Microsoft Microsoft Corporation Corporation

Upload: carmel-terry

Post on 23-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Before Terabytes FallBefore Terabytes FallDisk reliability in Windows Vista Disk reliability in Windows Vista and beyondand beyond

Frank ShuFrank ShuProgram ManagerProgram ManagerWDEG-StorageWDEG-StorageMicrosoft CorporationMicrosoft Corporation

Matthew KernerMatthew KernerProgram ManagerProgram ManagerWindows DiagnosisWindows DiagnosisMicrosoft CorporationMicrosoft Corporation

Page 2: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Windows Storage DevicesWindows Storage DevicesStrategic pillarsStrategic pillars

Optical Platform Client/Consumer

Storage Fabrics Server/Enterprise

Personal Storage Client/Consumer

PreferredStorage Platform

Partner/Customer

Timely, comprehensive, quality Timely, comprehensive, quality platform support for optical devicesplatform support for optical devices

Optimized platform features Optimized platform features enabling your Windows enabling your Windows experience, here and nowexperience, here and now

Leading platform enablingLeading platform enablingstorage fabric adoptionstorage fabric adoption

Preferred platform for developing, Preferred platform for developing, deploying, and using deploying, and using storage devices storage devices

Page 3: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Session OutlineSession Outline

Introduction (Frank Shu)Introduction (Frank Shu)

Windows Vista Disk Diagnostics Windows Vista Disk Diagnostics (Matthew Kerner)(Matthew Kerner)

Future Technology (Frank Shu)Future Technology (Frank Shu)

Demo (Microsoft and Samsung)Demo (Microsoft and Samsung)

Page 4: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

What Matters MostWhat Matters MostTo Our Users?To Our Users?

A consumer bought a new computer and it A consumer bought a new computer and it works great at work and at home. She works great at work and at home. She couldn’t do her everyday tasks without it. couldn’t do her everyday tasks without it. What matters most to her?What matters most to her?a)a) CPU power CPU power

b)b) Network connection Network connection

c)c) Battery life Battery life

d)d) Something else…Something else…

Page 5: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

The Answer Is…The Answer Is…

The DataThe Data

Page 6: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Protecting Data: Protecting Data: Windows Vista disk diagnosticsWindows Vista disk diagnostics

Matthew KernerMatthew Kerner

Page 7: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Quantifying Disk FailuresQuantifying Disk Failures

Catastrophic disk failuresCatastrophic disk failures~200 disks replaced per week at Microsoft ~200 disks replaced per week at Microsoft in 2003in 2003Top driver of Microsoft support’s hardware-Top driver of Microsoft support’s hardware-related support calls in both client and serverrelated support calls in both client and serverBased on Microsoft figures, disk failures cost Based on Microsoft figures, disk failures cost many millions of dollars per year in enterprisesmany millions of dollars per year in enterprises

Localized failures (bad blocks)Localized failures (bad blocks)Kernel and user-mode crashesKernel and user-mode crashes

1.7% of customer-report Microsoft Online Crash 1.7% of customer-report Microsoft Online Crash Analysis crashes are due to disk errorsAnalysis crashes are due to disk errors

Application hangs during read recoveryApplication hangs during read recovery

Page 8: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Disk Failure MitigationsDisk Failure Mitigations

PreventionPreventionHybrid hard disks (mobile systems)Hybrid hard disks (mobile systems)

RAIDRAID

Catastrophic failure recoveryCatastrophic failure recoveryData backupData backup

Disk replacementDisk replacement

Localized failure recoveryLocalized failure recoveryRepair from redundant copyRepair from redundant copy

Restore from backupRestore from backup

Page 9: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Windows Vista Windows Vista Disk DiagnosticsDisk Diagnostics

Purpose: Save user data before Purpose: Save user data before catastrophic disk failurecatastrophic disk failureClient SKUsClient SKUsSelf Monitoring And Reporting Technology Self Monitoring And Reporting Technology (S.M.A.R.T.) polling triggers diagnostic(S.M.A.R.T.) polling triggers diagnostic

Uses S.M.A.R.T. trip status – no Uses S.M.A.R.T. trip status – no threshold/attribute comparisonthreshold/attribute comparison

Warns user of impending failure and walks Warns user of impending failure and walks them through backup and replacementthem through backup and replacement

Windows Vista backup improvementsWindows Vista backup improvements

Page 10: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Disk Diagnostics DetailsDisk Diagnostics Details

Disk class driver polls S.M.A.R.T. status hourly Disk class driver polls S.M.A.R.T. status hourly as it has done since Windows 2000as it has done since Windows 2000

Based on industry feedback, no use of Disk Based on industry feedback, no use of Disk Self-Test or attribute comparisonSelf-Test or attribute comparison

Failure triggers user-mode codeFailure triggers user-mode codeFilter out duplicate failuresFilter out duplicate failures

Log SMART READ LOG details to OS event logLog SMART READ LOG details to OS event logDevice error count from summary error log sector Device error count from summary error log sector

Life timestamp from most recent error log entryLife timestamp from most recent error log entry

Trigger user-context interactive resolutionTrigger user-context interactive resolutionCustomizable by Group PolicyCustomizable by Group Policy

Print instructions, walk user through backupPrint instructions, walk user through backup

Page 11: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Startup Repair/Windows Startup Repair/Windows Recovery EnvironmentRecovery Environment

Purpose: Recover from non-bootable Purpose: Recover from non-bootable states, including those caused by states, including those caused by disk failuresdisk failures

Automatic failover on boot failureAutomatic failover on boot failureto recovery partitionto recovery partition

Optionally deployed by OEMOptionally deployed by OEM

Available on installation mediaAvailable on installation media

Hands-free diagnosis and repairHands-free diagnosis and repairof top non-boot issuesof top non-boot issues

Page 12: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Corrupted File RecoveryCorrupted File Recovery

Purpose: Turn repeat user-mode crashes Purpose: Turn repeat user-mode crashes caused by corrupted system binaries into caused by corrupted system binaries into one-time crash with silent repair one-time crash with silent repair from cachefrom cache

Windows Error Reporting crash handler Windows Error Reporting crash handler triggers diagnostic on inpage error triggers diagnostic on inpage error crashes due to bad blockscrashes due to bad blocks

Diagnoses corrupted system filesDiagnoses corrupted system files

Silent repair from System File CacheSilent repair from System File Cache

Page 13: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Windows Vista Windows Vista Disk DiagnosticsDisk Diagnostics

Matthew KernerMatthew KernerProgram ManagerProgram ManagerWindows DiagnosisWindows Diagnosis

Page 14: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Opportunities For Opportunities For Future TechnologyFuture Technology

Proactive failure preventionProactive failure prevention

Reduce scenario pain by enabling Reduce scenario pain by enabling resolutions other than just data recoveryresolutions other than just data recovery

Requires finer-grained failure descriptionRequires finer-grained failure descriptionto help host choose the best resolutionto help host choose the best resolution

Increase warning time before failuresIncrease warning time before failuresto allow users to save datato allow users to save data

Page 15: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Future Technology:Future Technology:Protecting User DataProtecting User DataAnd Preventing HardAnd Preventing HardDrive Failure ProactivelyDrive Failure Proactively

Frank ShuFrank Shu

Page 16: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

What Is PRCS?What Is PRCS?

Proactive Reporting and Correcting Proactive Reporting and Correcting Safeguard (PRCS) enables a device and Safeguard (PRCS) enables a device and host to correct failure conditions proactivelyhost to correct failure conditions proactively

Device can report hostile conditions before Device can report hostile conditions before damage or failure occursdamage or failure occurs

Host reacts to a device event in real time Host reacts to a device event in real time based on policy and user preferencebased on policy and user preference

A proposal for the PRCS protocol hasA proposal for the PRCS protocol hasbeen submitted to T13been submitted to T13

Page 17: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Why Is PRCS Important?Why Is PRCS Important?

User’s digital data is more valuable than User’s digital data is more valuable than ever before ever before

Disk drive capacity continue to increaseDisk drive capacity continue to increase

Not every PC user can afford RAIDNot every PC user can afford RAID

Deliver on opportunities for improvements Deliver on opportunities for improvements beyond S.M.A.R.T. beyond S.M.A.R.T.

Page 18: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Goals Of PRCSGoals Of PRCS

Proactively protect user dataProactively protect user data

Improve the user experienceImprove the user experiencewhen data is at riskwhen data is at risk

Reduce OEM’s customer support costsReduce OEM’s customer support costs

Reduce warranty costs for disk Reduce warranty costs for disk drive vendorsdrive vendors

Page 19: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

PRCS FeaturesPRCS Features

Device monitors its own conditionsDevice monitors its own conditionsin real timein real time

Reduce host monitoring performance impactReduce host monitoring performance impact

Device sends meaningful PRCS events to Device sends meaningful PRCS events to the host for correction of hostile conditions the host for correction of hostile conditions and data protectionand data protection

No translations or guesses requiredNo translations or guesses required

Host acts on device’s PRCS event Host acts on device’s PRCS event proactively according to policy and proactively according to policy and user preferenceuser preference

Page 20: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

PRCS AdvantagesPRCS Advantages

PRCS is proactivePRCS is proactiveTaking a corrective action before errors occurTaking a corrective action before errors occurProtecting data when it is at riskProtecting data when it is at risk

PRCS is designed for end users, not just PRCS is designed for end users, not just computer expertscomputer experts

No need to understand a cryptic message toNo need to understand a cryptic message tobenefit from PRCS. For example: “The previousbenefit from PRCS. For example: “The previousself-test completed having the electrical elementself-test completed having the electrical elementof the test failed”of the test failed”

PRCS enables transparent mitigation of a hostile PRCS enables transparent mitigation of a hostile condition or a recovery processcondition or a recovery process

Users do not need to configure a self-test mode or Users do not need to configure a self-test mode or reporting methodreporting methodUsers control policy as desiredUsers control policy as desired

Page 21: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Proactive Proactive Disk DiagnosticsDisk Diagnostics

Debasis BaralDebasis BaralVice President of EngineeringVice President of EngineeringSamsungSamsung

Page 22: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

HDD Reliability 101HDD Reliability 101

HDD reliability and performanceHDD reliability and performanceis is negatively impactednegatively impacted by extremes by extremesin the following operating conditionsin the following operating conditions

TemperatureTemperature DemoDemo

VibrationVibration DemoDemo

Shock DemoShock Demo

Duty cycle Duty cycle

AltitudeAltitude

HumidityHumidity

A combination of the above conditionsA combination of the above conditions

A history of the above combinationsA history of the above combinations

Page 23: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Reliability VersusReliability Versus Temperature Temperature

HDD life decreases with temperatureHDD life decreases with temperature

Failure rates increase exponentially with temperatureFailure rates increase exponentially with temperaturefor all HDD suppliers for all HDD suppliers

Environmental temperature increase from 25C to 100C Environmental temperature increase from 25C to 100C could translate into could translate into 10 – 50x shorter life10 – 50x shorter life

Ref.: Samsung reliability tests

Samsung HDD Lab Engineering Sample Data

Page 24: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Performance Versus Performance Versus VibrationVibration

Data throughput or drive performance can beData throughput or drive performance can besignificantly affectedsignificantly affected in the presence of in the presence of vibrationvibration

Effect of vibration is reversibleEffect of vibration is reversible

Cumulative effects of vibration on long term drive Cumulative effects of vibration on long term drive reliability is a subject of ongoing researchreliability is a subject of ongoing research

Performance Loss With Vibration

1

10

100

0.05 0.10 0.20 0.50 0.75 1.00 1.30

Vibration level, Arb. Units

Th

rou

gh

pu

t in

MB

/s

0

20

40

60

80

100

120

Off

track,

% T

rack P

tich

Thruput, MB/S Off Track

Samsung HDD Lab Engineering Sample Data

Page 25: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Reliability Versus ShockReliability Versus Shock

Excessive shock is the major Excessive shock is the major cause of failure in cause of failure in both PCboth PCand consumer electronics and consumer electronics environmentsenvironments

Shock ModelingShock Modeling

Courtesy: E. Jayson and Frank Talke, UC San Diego Courtesy: E. Jayson and Frank Talke, UC San Diego

Op. Shock Scratches

Damage by corners, leading edge, Damage by corners, leading edge, and side edges of the slider.and side edges of the slider.

Operating shock damageOperating shock damage

Non-operating shock damageNon-operating shock damage

Page 26: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Reliability Design GuidelinesReliability Design Guidelines

Failure modes and failure rates Failure modes and failure rates of disk drives depend on of disk drives depend on their their operating environmentsoperating environments

Temperature and HandlingTemperature and Handling(shock and vibration)(shock and vibration) are major factors are major factors impacting HDD reliabilityimpacting HDD reliability

HDD reliability will be enhanced if OS HDD reliability will be enhanced if OS detects and managesdetects and manages reliability risks reliability risksand stress events intelligently (PRCS)and stress events intelligently (PRCS)

Users can Users can improveimprove HDD data reliability HDD data reliabilityby correctly responding to PRCS eventsby correctly responding to PRCS events

Page 27: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

PRCSPRCS

Kai ChenKai ChenMicrosoft CorporationMicrosoft Corporation

Debasis BaralDebasis BaralSamsungSamsung

Page 28: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Call To ActionCall To Action

Test your drives with Windows Vista Disk Test your drives with Windows Vista Disk Diagnostics and send feedbackDiagnostics and send feedbackEnsure your drives comply with ATA-7 Ensure your drives comply with ATA-7 specs to surface device error count and specs to surface device error count and life timestamplife timestampEngage with the Startup Repair team to Engage with the Startup Repair team to build a plan for Startup Repair in OEM build a plan for Startup Repair in OEM factory processesfactory processesParticipate in T13 discussions on PRCSParticipate in T13 discussions on PRCSPlan your device designs in line with Plan your device designs in line with PRCS guidelinesPRCS guidelines

Page 29: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

Additional ResourcesAdditional Resources

WhitepapersWhitepapersWindows Recovery Environment/Startup Windows Recovery Environment/Startup Repair/Built-in Diagnostics: Repair/Built-in Diagnostics: http://www.microsoft.com/technet/windowsvista/evaluate/feat/relperf.mspxhttp://www.microsoft.com/technet/windowsvista/evaluate/feat/relperf.mspx

Feedback/QuestionsFeedback/QuestionsWindows Vista Disk Diagnosis:Windows Vista Disk Diagnosis:

Corrupt File Recovery:Corrupt File Recovery:

Windows Recovery Environment/Startup Repair:Windows Recovery Environment/Startup Repair:

PRCS:PRCS:

Dfdfeed @ microsoft.comDfdfeed @ microsoft.com

Dfdfeed @ microsoft.comDfdfeed @ microsoft.com

Recovery @ microsoft.comRecovery @ microsoft.comPrcsdisc @ microsoft.comPrcsdisc @ microsoft.com

Page 30: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program

© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,

it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 31: Before Terabytes Fall Disk reliability in Windows Vista and beyond Frank Shu Program Manager WDEG-Storage Microsoft Corporation Matthew Kerner Program