failure data collection and analysis archana ganapathi peter bodik wei xu
TRANSCRIPT
![Page 1: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/1.jpg)
Failure Data Collection and Analysis
Archana GanapathiPeter Bodik
Wei Xu
![Page 2: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/2.jpg)
Motivation (1)My machine crashes…
Since 3/1/04… 3 system crashes 18 application errors 96 application hangs
Who cares? I do! People who share similar experiences In general, customer uproar
![Page 3: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/3.jpg)
Motivation (2)An Internet service has failures…
Who cares? Internet service
users Internet service
system administrators
Anyone affected by the IS’s loss of revenue
Hardware26%
Software28%
Unknown11%
Operator35%
Total: 61 user-visible failures in 12 months
at Online Service
![Page 4: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/4.jpg)
Motivation (3)
ROC/RADS needs real failure/attack information to drive benchmarks evaluate our prototypes help us select what we work attack
![Page 5: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/5.jpg)
Data Sources
1000s of individual machines Cory/Soda Hall, BOINC
Large clusters at real Internet services Internet services
Distributed applications on 100s of machines PlanetLab
![Page 6: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/6.jpg)
Individual Machines
![Page 7: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/7.jpg)
Data Collection Collect minidumps that contain…
The Stop message/parameters/data Loaded drivers Processor context for processor that
stopped Process info/kernel context for
process/thread stopped The Kernel-mode call stack for thread that
stopped Frequency of collection
synchronized with application and system crashes on computers
![Page 8: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/8.jpg)
Analysis results What happened that is immediately
responsible for the crash exact error code brief description, primarily for debugging
Bucketing info, e.g.: "driver fault" Details for debugging, e.g. stack
contents Use Microsoft’s publicly available
analysis tools Caveat: significant variability in results
between internal and public version of tool!
![Page 9: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/9.jpg)
How we collect minidumps (1)
Corporate Error Reportinghttp://www.microsoft.com/resources/satech/
cer/ Manage error reports/msgs
generated by WER and other programs
Configure clients to redirect reports to CER shared directory
![Page 10: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/10.jpg)
Sample Statistics(25 nodes, 5 days)
![Page 11: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/11.jpg)
Sample Statistics(25 nodes, 5 days)
Crashed Program Version Problem
BESConsole.exe 4.1.3.33 hungapp
CDCopier.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
explorer.exe 6.0.2800.1106 shlwapi.dll
firefox.exe 0.8.0.0 hungapp
IAMAPP.EXE 5.1.1.309 hungapp
iexplore.exe 6.0.2800.1106 hungapp
iexplore.exe 6.0.2800.1106 mshtml.dll
matlab.exe 1.0.0.1 hungapp
mozilla.exe 1.6.20040.11308 ntdll.dll
msmsgs.exe 4.7.0.2009 msmsgs.exe
OUTLOOK.EXE 10.0.4510.0 hungapp
thunde~1.exe 0.6.0.0 xpc3250.dll
![Page 12: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/12.jpg)
How we collect minidumps (2)
BOINC For SETI@home –esque apps that
pool resources Provides client API to send/receive
data to/from BOINC server Write tools to read info in
minidump directory and send to us
![Page 13: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/13.jpg)
Sample Statistics (50 system crashes)
Thread stuck in device driver 12
Page Fault in Non-Paged Area 10
System Thread Exception Not Handled 6
Unexpected Kernel Mode Trap 6
Kernel Mode Exception Not Handled 5
IRQL Not Less or Equal 3
Driver IRQL Not Less or Equal 3
NTFS File System 2
Bad Pool Caller 2
PFN List Corrupt 1
![Page 14: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/14.jpg)
Sample Statistics (50 system crashes)
watchdog.sys 7 ar5211.sys6 ibmpmdrv.sys 6 ati3duag.dll 5 SYMEVENT.SYS 3 ipsecw2k.sys 3 memory_corruption
3 ialmdev5.DLL 2 PSCRIPT4.DLL 2 ntoskrnl.exe 2
CLASSPNP.SYS 2 win32k.sys 2 SynTP.sys 1 TDI.SYS 1 ino_fltr.sys1 ks.sys 1 drvnddm.sys 1 ntkrnlmp.exe 1 Pool_Corruption 1
![Page 15: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/15.jpg)
Metrics (Windows & Linux)
Availability system uptime, % time BOINC running
CPU(s) # processes, processor queue length, % non-idle
Memory available physical memory, free swap space
Disk(s) free space
Network(s) IP address, packets&bytes sent&received/sec, bandwidth to/from
SETI@home server, first-hop bandwidth*, network coordinates*
Static CPU type, #, and benchmarks; total memory; OS type
![Page 16: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/16.jpg)
Questions
Other metrics? Frequency with which to measure them? What research questions can we answer with this data
set? original goal: workload to evaluate our node discovery service evaluate effectiveness of network coordinates evaluate potential to run more than just “embarrassingly
parallel” apps on this type of infrastructure depending on machines’ uptime network connectivity available disk space
distributed analysis? security uses?
![Page 17: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/17.jpg)
Internet Services
![Page 18: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/18.jpg)
Data characteristics
Real companies Multitude of users Voluminous data (several terabytes) Systems are complex
Treat as black box Use SLT algorithms for analysis
More data => better models
![Page 19: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/19.jpg)
Analysis Results Study event logs
Not necessarily failures Can derive models of good & bad behavior
Models with varying granularity Use different algorithms Vary boundary parameters
For more details see poster:“Towards a General Approach for Event Log Analysis”
![Page 20: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/20.jpg)
Distributed Apps
![Page 21: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/21.jpg)
PlanetLab
“An open platform for developing, deploying, and accessing planetary-scale services” 392 nodes at 164 sites around the
world Per-site system administration Applications: OceanStore, PIER
![Page 22: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/22.jpg)
Why?
Platform for injecting faults and testing our algorithms
Applications on RADS-like environment
Research platform More accessible University-developed apps most likely
to be tested on PlanetLab
![Page 23: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/23.jpg)
Applications
1) OceanStore Global persistent data store. In the process of running prototype on
PlanetLab Good source of failure data
2) PIER Distributed query processor Currently running on PlanetLab Good source of failure data + analysis engine
![Page 24: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/24.jpg)
What do we do with these apps?
Instrument applications to collect any type of information Choice of granularity
Open source - no longer black box Can modify it as much as necessary
![Page 25: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu](https://reader035.vdocuments.mx/reader035/viewer/2022062516/56649da15503460f94a8d83d/html5/thumbnails/25.jpg)
Questions
What other applications can we use? What should we measure and model? What information is useful for
industry? Do you have any failure/attack data
you are willing to share with us?