devagent: analyzing the reliability of storage software ...esaule/nsf-pi-csr-2017... · [6]...
TRANSCRIPT
DevAgent:AnalyzingtheReliabilityofStorageSoftwareStackviaSmartDevicesMaiZheng,ComputerScienceDepartment,NewMexicoStateUniversity
Background&Motivation
• NewNVM-basedcomponentsarerevolutionizingthetraditionalstoragesystems,potentiallycreatingnewfailuremodesdifficulttounderstand
Observations&Approach
Acknowledgement
0
0.5
1
1.5
2
0 1000 2000 3000 4000 5000 6000 7000 8000
0 = uncom
mitted, 1
= commi
tted
write requests
0
0.5
1
1.5
2
0 1000 2000 3000 4000 5000 6000 7000 8000
0 = drop
ped, 1 =
committe
d
write requests
Results&Milestones
This material is based upon work supported in part by the National Science Foundation(NSF) under Grant Number 1566554 (CRII). Any opinions, findings, and conclusions orrecommendations expressed in this material are those of the author(s) and do notnecessarily reflect the views of the NSF.
[1]UnderstandingtheFaultResilienceofFileSystemCheckers(HotStorage’17)[2]OnFaultResilienceofFileSystemCheckers(FAST’17-WiP)[3]DoNotBlameDevicesforAllFailures(NVMW’17- Poster)[4]AGenericFrameworkforTestingParallelFileSystems (PDSW’16)[5]ReliabilityAnalysisofSSDsunderPowerFault (TOCS’16)[6]EmulatingRealisticFlashDeviceErrorswithHighFidelity(NAS’16- Poster)
SSDsexhibitdifferent#oferrorswhentestedondifferentOSes
devicedriver
blocklayer
filesystem
device
userlevel
OS (Kernel) SSD 1 SSD 2 SSD 3Debian 6 (2.6.32) 317 991 2Ubuntu 14 (3.16) 88 0 1
DevAgent-FW
June2015,flash-basedserversinAlgoliadatacenterstartedcorruptingfiles;developers“spentabigportionoftwoweeksjustisolatingmachinesandrestoringthemasquicklyaspossible”;SamsungSSDsweremistakenly blamed&blacklisted(untilonemonthlatertheyidentifiedakernelbug)
• Failureexample1:
• Failureexample2:
• Limitation of exiting analysis tools:- heavily rely on kernel &assumekernel is correct
• Wecan no longer completely trust thekernel=> analysis tools shouldminimize dependency/interferenceon kernel
• Wecan no longer focus on single component=> cross-layer analysis• Weneed to focus on generic & fundamental operations => interfacesb/w layers
• Publications:
• Prototyping DevAgent-FW onCosmosOpenSSD Platform[3]
• Emulatingrealisticdevicebehaviorsfortestinghighersoftwarelayers[5][6]
• Exposingvulnerabilitiesinpopularfilesystemcheckers[1][2]
• Analyzingfine-grainedI/Obehaviorofparallelfilesystems[4]
syscallinterface
deviceinterface SCSI/NVMe
systemcalls
iSCSINVMe/Fabrics DevAgent-SW
DevAgent-Usr
• Twokeyfunctionalities:- recordhost/deviceinteraction
fordiagnosis- emulatedevicebehaviorfor
testinghigherlayersfirmwareimplementation
(minimalintrusionintokernel)
softwareimplementation(goodportability)
- mapfunctioncallstodevice-levelcommandsforreasoninghigh-levelsemantics
• Identifyingdevice-levelimpactofkernelbugpatches[3]
- real
V.S.
- emulated