root cause analysis of failures in large-scale computing environments

25
Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin [email protected] Naoya Maruyama, Tokyo Institute of Technology [email protected] Barton P. Miller, University of Wisconsin [email protected] http://www.paradyn.org/

Upload: dakota

Post on 15-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Root Cause Analysis of Failures in Large-Scale Computing Environments. Alex Mirgorodskiy, University of Wisconsin [email protected] Naoya Maruyama, Tokyo Institute of Technology [email protected] Barton P. Miller, University of Wisconsin [email protected] http://www.paradyn.org/. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Root Cause Analysis of Failures in Large-Scale Computing Environments

Root Cause Analysis of Failures in Large-Scale Computing

Environments

Alex Mirgorodskiy, University of [email protected]

Naoya Maruyama, Tokyo Institute of Technology

[email protected] P. Miller, University of Wisconsin

[email protected]://www.paradyn.org/

Page 2: Root Cause Analysis of Failures in Large-Scale Computing Environments

2

Motivation

• Systems are complex and non-transparent– Many components, different vendors

• Anomalies are common– Intermittent– Environment-specific

• Users have little debugging expertise

Finding the causes of bugs and performance problems in production systems is hard

Page 3: Root Cause Analysis of Failures in Large-Scale Computing Environments

3

Vision

Autonomous, detailed, low-overhead analysis:

• User specifies a perceived problem cause • The agent finds the actual cause

Host A Host BProcess P

Process Q

Agent

network

Process R

Page 4: Root Cause Analysis of Failures in Large-Scale Computing Environments

4

Applications• Diagnostics of E-commerce systems

– Trace the path each request takes through a system– Identify unusual paths– Find out why they are different from the norm

• Diagnostics of Cluster and Grid systems– Monitor behavior of different nodes in the system– Identify nodes with unusual behavior– Find out why they are different from the norm– Example: found problems in SCore middleware

• Diagnostics of Real-time and Interactive systems– Trace words through the phone network– Find out why some words were dropped

Page 5: Root Cause Analysis of Failures in Large-Scale Computing Environments

5

Key Components• Data collection: self-propelled instrumentation

– Works for a single process– Can cross the user-kernel boundary– Can be deployed on multiple nodes at the same time– Ongoing work: crossing process and host boundaries

• Data analysis: use repetitiveness to find anomalies– Repetitive execution of the same high-level action OR– Repetitiveness among identical processes (e.g., Cluster

management tools, Parallel codes, Web server farms)

Page 6: Root Cause Analysis of Failures in Large-Scale Computing Environments

6

Focus on Control Flow Anomalies

• Unusual statements executed– Corner cases are more likely to have bugs

• Statements executed in unusual order– Race conditions

• Function taking unusually long to complete– Sporadic performance problems– Deadlocks, livelocks

Page 7: Root Cause Analysis of Failures in Large-Scale Computing Environments

7

Current Framework

1. Traces control flow of all processes• Begins at process startup• Stops upon a failure or performance degradation

2. Identifies anomalies: unusual traces• Problems on a small number of nodes• Both fail-stop and not

3. Identifies the causes of the anomalies• Function responsible for the problem

P1

Trace of P1 P2

P3

P4

Page 8: Root Cause Analysis of Failures in Large-Scale Computing Environments

a.outbar

8430:8431:8433:8444:8446:8449:844b:844c:

pushmov...callmovxorpopret

foo %ebp%esp,%ebp

*%eax%ebp,%esp%eax,%eax%ebp

callcalljmp

Patch1instrument(foo)foo0x8405

6cf5:6d20:6d27:6d49:

push...call...iret

sys_call: %eax

*%eax

callcalljmp

instrument(%eax)*%eax0x6d27

Patch3

instrumenter.so

/dev/instrumenter

callcalljmp

instrument(%eax)*%eax0x8446

Patch2

patchjmp

jmp

jmp

jmp

%ebp%esp,%ebp

foo %ebp,%esp%ebp

pushmov...callmovpopret

83f0:83f1:83f3:8400:8405:8413:8414:

OS Kernelpatchjmp

InjectActivatePropagateAnalyze: build call graph/CFG with Dyninst

Page 9: Root Cause Analysis of Failures in Large-Scale Computing Environments

9

Data Collection: Trace Management

call

foo

ret

foo

Tracer

…Process P

•The trace is kept in a fixed-size circular buffer– New entries overwrite the oldest entries– Retains the most recent events leading to the

problem

•The buffer is located in a shared memory segment

– Does not disappear if the process crashes

Page 10: Root Cause Analysis of Failures in Large-Scale Computing Environments

10

Data Analysis: Find Anomalous Host

• Check if the anomaly was fail-stop or not:• One of the traces ends substantially earlier than

the others -> Fail-stop– The corresponding host is an anomaly

• Traces end at similar times -> Non-fail-stop– Look at differences in behavior across traces

Trace end time

Tra

ces

P1

P2

P3

P4

Page 11: Root Cause Analysis of Failures in Large-Scale Computing Environments

11

Data Analysis: Non-fail-stop Host

Find outliers (traces different from the rest):• Define a distance metric between two traces

– d(g,h) = measure of similarity of traces g and h

• Define a trace suspect score– σ(h) = similarity of h to the common behavior

• Report traces with high suspect scores– Most distant from the common behavior

Page 12: Root Cause Analysis of Failures in Large-Scale Computing Environments

12

Defining the Distance Metric

• Compute the time profile for each host h:– p(h) = (t1, …, tF)

– ti = normalized time spent in function fi on host h

– Profiles are less sensitive to noise than raw traces

• Delta vector of two profiles: δ(g,h) = p(g) – p(h)

• Distance metric: d(g,h) = Manhattan norm of δ(g,h)

t(foo)

t(bar) δ(g,h)p(g)

p(h)

Page 13: Root Cause Analysis of Failures in Large-Scale Computing Environments

13

Defining the Suspect Score

• Common behavior = normal• Suspect score: σ(h) = distance to nearest

neighbor– Report host with the highest σ to the analyst– h is in the big mass, σ(h) is low, h is normal– g is a single outlier, σ(g) is high, g is an anomaly

• What if there is more than one anomaly?

g

h

σ(g)

σ(h)

Page 14: Root Cause Analysis of Failures in Large-Scale Computing Environments

14

Defining the Suspect Score

• Suspect score: σk(h) = distance to the kth neighbor– Exclude (k-1) closest neighbors– Sensitivity study: k = NumHosts/4 works well

• Represents distance to the “big mass”:– h is in the big mass, kth neighbor is close, σk(h) is low

– g is an outlier, kth neighbor is far, σk(g) is high

g

h

σk(g)

Computing the score using k=2

Page 15: Root Cause Analysis of Failures in Large-Scale Computing Environments

15

Defining the Suspect Score

• Anomalous means unusual, but unusual does not always mean anomalous!

– E.g., MPI master is different from all workers– Would be reported as an anomaly (false positive)

• Distinguish false positives from true anomalies:– With knowledge of system internals – manual effort– With previous execution history – can be automated

g

h

σk(g)

Page 16: Root Cause Analysis of Failures in Large-Scale Computing Environments

16

Defining the Suspect Score

• Add traces from known-normal previous run– One-class classification

• Suspect score σk(h) = distance to the kth trial neighbor or the 1st known-normal neighbor

• Distance to the big mass or known-normal behavior– h is in the big mass, kth neighbor is close, σk(h) is low

– g is an outlier, normal node n is close, σk(g) is low

g

h n

Page 17: Root Cause Analysis of Failures in Large-Scale Computing Environments

17

Finding Anomalous Function

• Fail-stop problems– Failure is in the last function invoked

• Non-fail-stop problems– Find why host h was marked as an anomaly– Function with the highest contribution to

σ(h):• σ(h) = |δ (h,g)|, where g is the chosen neighbor

• anomFn = arg max |δi|i

Page 18: Root Cause Analysis of Failures in Large-Scale Computing Environments

18

Experimental Study: SCore

• SCore: cluster-management framework– Job scheduling, checkpointing, migration– Supports MPI, PVM, Cluster-enabled OpenMP

• Implemented as a ring of daemons, scored– One daemon per host for monitoring jobs– Daemons exchange keep-alive patrol messages– If no patrol message traverses the ring in 10

minutes, sc_watch kills and restarts all daemons

sc_watch

scored

scored

scored

patrol

Page 19: Root Cause Analysis of Failures in Large-Scale Computing Environments

19

Debugging SCoresc_watch

scored

scored

scored

patrol

• Inject tracing agents into all scoreds• Instrument sc_watch to find when the

daemons are being killed• Identify the anomalous trace• Identify the anomalous function/call path

Page 20: Root Cause Analysis of Failures in Large-Scale Computing Environments

20

Finding the Host

• Host n129 is unusual – different from the others• Host n129 is anomalous – not present in

previous known-normal runs• Host n129 is a new anomaly – not present in

previous known-faulty runs

Page 21: Root Cause Analysis of Failures in Large-Scale Computing Environments

21

Finding the Cause

• Call chain with the highest contribution to the suspect score: (output_job_status -> score_write_short -> score_write -> __libc_write)– Tries to output a log message to the scbcast process

• Writes to the scbcast process kept blocking for 10 minutes– Scbcast stopped reading data from its socket – bug!– Scored did not handle it well (spun in an infinite loop) – bug!

Page 22: Root Cause Analysis of Failures in Large-Scale Computing Environments

22

Ongoing workHost A Host BProcess P

Process Q

network

• Cross process and host boundaries– Propagate upon communication

• Reconstruct system-wide flows• Compare flows to identify anomalies

Process R

Page 23: Root Cause Analysis of Failures in Large-Scale Computing Environments

23

Ongoing work• Propagate upon communication

– Notice the act of communication– Identify the peer– Inject the agent into the peer– Trace the peer after it receives the data

• Reconstruct system-wide flows– Separate concurrent interleaved flows

• Compare flows– Identify common flows and anomalies

Page 24: Root Cause Analysis of Failures in Large-Scale Computing Environments

24

Conclusion

• Data collection: acquire call traces from all nodes– Self-propelled instrumentation: autonomous, dynamic

and low-overhead

• Data analysis: identify unusual traces and find what made them unusual– Fine-grained: identifies individual suspect functions– Highly accurate: reduces rate of false positives using

past history

• Come see the demo!

Page 25: Root Cause Analysis of Failures in Large-Scale Computing Environments

25

Relevant Publications• A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller,

"Root Cause Analysis of Failures in Large-Scale Computing Environments", Submitted for publication, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy05Root.pdf

• A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy04SelfProp.pdf