black-box problem diagnosis in parallel file systems · black-box problem diagnosis in parallel...

Post on 15-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Black-Box Problem Diagnosisin Parallel File Systems

Michael P. Kasick1

Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1

1Carnegie Mellon University 2DSO National Labs, Singapore

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 1

Problem Diagnosis Goals

To diagnose problems in off-the-shelf parallel file systemsEnvironmental performance problems: disk & network faultsTarget file systems: PVFS & Lustre

To develop methods applicable to existing deploymentsApplication transparency: avoid code-level instrumentationMinimal overhead, training, and configurationSupport for arbitrary workloads: avoid models, SLOs, etc.

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 2

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problems

No errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance

Storage-related problems:

Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle

Network-related problems:

Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance

Storage-related problems:

Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle

Network-related problems:

Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance

Storage-related problems:Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle

Network-related problems:

Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Motivation: Real Problem Anecdotes

Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster

“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance

Storage-related problems:Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle

Network-related problems:Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 4

Target Parallel File Systems

Aim to support I/O-intensive applicationsProvide high-bandwidth, concurrent access

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 5

Parallel File System Architecture

network

clients

I/O servers

ios0 ios1 ios2 iosN mds0 mdsM

metadata

servers

One or more I/O and metadata serversClients communicate with every server

No server-server communication

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 6

Parallel File System Data Striping

…543210Logical File:

Server 1 0 3 6 …

Server 2 1 4 7 …

Server 3 2 5 8 …

PhysicalFiles

Client stripes local file into 64 kB–1 MB chunksWrites to each I/O server in round-robin order

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 7

Parallel File Systems: Empirical Insights (I)

Server behavior is similar for most requestsLarge requests are striped across all serversSmall requests, in aggregate, equally load all servers

Hypothesis: Peer-similarityFault-free servers exhibit similar performance metricsFaulty servers exhibit dissimilarities in certain metricsPeer-comparison of metrics identifies faulty node

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 8

Example: Disk-Hog Fault

0 200 400 600

020

000

6000

010

0000

Elapsed Time (s)

Sec

tors

Rea

d (/s

) Faulty server

Non-faulty servers

Peer-asymmetry

Strongly motivates peer-comparison approachMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 9

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)

Symmetrically on throughput metrics (↓ on all nodes)

0 200 400 600 800

Elapsed Time (s)

I/O W

ait T

ime

(ms)

010

0020

00 Faulty serverNon−faulty servers

Peer-asymmetry

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)

0 200 400 600 800

Elapsed Time (s)

Sec

tors

Rea

d (/s

)

040

000

8000

0

Faulty serverNon−faulty servers

No asymmetry

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Parallel File Systems: Empirical Insights (II)

Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .

Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)

Faults distinguishable by which metrics are peer-divergent

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 11

System Model

Fault Model:Non-fail-stop problems

“Limping-but-alive” performance problems

Problems affecting storage & network resources

Assumptions:Hardware is homogeneous, identically configuredWorkloads are non-pathological (balanced requests)Majority of servers exhibit fault-free behavior

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 12

Instrumentation

Sampling of storage & network performance metricsSampled from /proc once every secondGathered from all server nodes

Storage-related metrics of interest:Throughput: Bytes read/sec, bytes written/secLatency: I/O wait time

Network-related metrics of interest:Throughput: Bytes received/sec, transmitted/secCongestion: TCP sending congestion window

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 13

Workloads

ddw & ddr (dd write & read)Use dd to write/read many GB to/from fileLarge (order MB) I/O requests, saturating workload

iozonew & iozoner (IOzone write & read)Ran in either write/rewrite or read/reread modeLarge I/O requests, workload transitions, fsync

postmark (PostMark)Metadata-heavy, small reads/writes (single server)Simulates email/news servers

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 14

Fault Types

Susceptible resources:Storage: Access contentionNetwork: Congestion, packet loss (faulty hardware)

Manifestation mechanism:Hog: Introduces new visible workload (server-monitored)Busy/Loss: Alters existing workload (unmonitored)

Storage NetworkHog disk-hog write-network-hog

read-network-hogBusy/Loss disk-busy receive-packet-loss

send-packet-loss

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15

Fault Types

Susceptible resources:Storage: Access contentionNetwork: Congestion, packet loss (faulty hardware)

Manifestation mechanism:Hog: Introduces new visible workload (server-monitored)Busy/Loss: Alters existing workload (unmonitored)

Storage NetworkHog disk-hog write-network-hog

read-network-hogBusy/Loss disk-busy receive-packet-loss

send-packet-loss

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 15

Experiment Setup

PVFS cluster configurations:10 clients, 10 combined I/O & metadata servers6 clients, 12 combined I/O & metadata servers

Luster cluster configurations:10 clients, 10 I/O servers, 1 metadata server6 clients, 12 I/O servers, 1 metadata server

Each client runs same workload for ≈600 s

Faults injected on single server for 300 s

All workload & fault combinations run 10 times

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 16

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 17

Diagnostic Algorithm

Phase I: Node IndictmentHistogram-based approach (for most metrics)Time series-based approach (congestion window)Both use peer-comparison to indict faulty node

Phase II: Root-Cause AnalysisAscribes to root cause based on affected metrics

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 18

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding window

Compute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pair

Flag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds threshold

Flag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers

Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19

Threshold Selection

Fault-free training session (stress test)Run ddw, ddr, & postmark under fault-free conditionsFind minimum threshold that eliminates all anomalies

Histogram comparison uses per-server thresholdsCaptures performance profile of each serverImportant to train on each cluster & file system

Train on performance-stressing workloads onlyMetrics deviate most when servers are saturatedLess intense workloads have better coupled behavior

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 20

Example: PVFS Throughput (Disk-Hog Fault)

0 200 400 600

020

000

6000

010

0000

Elapsed Time (s)

Sec

tors

Rea

d (/s

)

Faulty serverNon−faulty servers

PVFS + disk-hog

PVFS only

Throughput diverges due to disk-hog on faulty serverMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 21

Phase II: Root-Cause Analysis

Build table of metrics & faults affecting them:

Storage Throughput: Storage Latency:disk-hog disk-hog

disk-busyNetwork Throughput: Network Congestion:network-hog network-hogpacket-loss (ACKs only) packet-loss

Derive checklist that maps divergent metrics to causeInfers resource responsibleDetermines mechanism by which resource faulted

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 22

Checklist for Root-Cause Analysis

Peer-divergence in . . .

Yes: disk-hog faultStorage throughput?No: next question

Yes: disk-busy faultStorage latency?No: . . .

Yes: network-hog faultNetwork throughput?∗No: . . .

Yes: packet-loss faultNetwork congestion?No: no fault discovered

∗Must diverge in both receive & transmit, or in absence of congestion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 23

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 24

Results: Single Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 25

Results: Aggregate

PVFS 10/10 PVFS 6/12 Lustre 10/10 Lustre 6/12

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Cluster

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 26

Results Summary

True-positives inconsistent across faultsSome faults are not observable for all workloadsMinimal performance effect where not observable

True- & false-positives inconsistent across clustersAlgorithm sensitive to imprecise thresholdsRank metrics based on degree of dissimilarity

Strategy is promising in general

Instrumentation overhead< 1% increase in workload runtime, negligible

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 27

Outline

1 Introduction

2 Experimental Methods

3 Diagnostic Algorithm

4 Results

5 Conclusion

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 28

Future Work

Analysis: Improve diagnosis accuracy ratesMake analysis robust to imprecise thresholds

Real-world data: Deploy on a production systemValidate technique on real workloads, at scale

Coverage: Expand target problem classOther sources of performance & non-performance faults

Instrumentation: Expand instrumentationAdditional black-box metrics, request sniffing & tracing

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 29

Summary

Problem diagnosis in parallel file systemsIllustrates use of OS-level metrics in diagnosisLeverages peer-comparison to identify faulty nodesDemonstrates root-cause analysis by metrics affected

Diagnosis method is applicable to existing deploymentsInstrumentation is minimally invasive, low overheadFault-free training with stress tests

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 30

Peer-Comparison Scalability

Number of comparisons: n(n−1)2 =⇒ O(n2)

Insight: Don’t need to compare one node against all

Proposed solution:Establish n−k partitions with k serversPerform peer-comparisons among servers in each partitionRepartition with a different grouping for each window

Solution comparisons: (n−k)k(k−1)2 =⇒ O(n)

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 31

Congestion Window Problem

No closely-coupled peer behaviorcwnd is intentionally noisy under normal conditionsSynchronized connections can’t fully use link capacityCan’t compare histograms, too much variance

Congestion window packet-loss heuristic:TCP responds to packet-loss by halving cwndExponential decay after multiple loss eventsLog scale: Each loss results in linear cwnd decrease

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 32

Time Series Comparison Example

0 200 400 600

25

1020

5010

0

Elapsed Time (s)

Seg

men

ts

0 200 400 600

25

1020

5010

0

Elapsed Time (s)

Seg

men

ts

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33

Time Series Comparison Example

0 200 400 600

25

1020

5010

0

Elapsed Time (s)

Seg

men

ts

0 200 400 600

25

1020

5010

0

Elapsed Time (s)

Seg

men

ts

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33

Time Series Comparison Example

Elapsed Time (s)

Seg

men

ts

100 200 300 400 500 600 700

25

1050

100 200 300 400 500 600 700

100 200 300 400 500 600 700

25

1050

25

1050

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34

Time Series Comparison Example

Elapsed Time (s)

Seg

men

ts

100 200 300 400 500 600 700

25

1050

100 200 300 400 500 600 700

100 200 300 400 500 600 700

25

1050

25

1050

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34

Heterogeneous Hardware (ddr)

0 100 200 300 400 500 600

050

100

150

200

Elapsed Time (s)

I/O W

ait T

ime

(ms)

Disks are same model, have different performance profilesMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 35

Load Imbalances (postmark)

0 100 200 300 400 500 600 700

0e+0

01e

+05

2e+0

53e

+05

Elapsed Time (s)

Byt

es R

ecei

ved

(B/s

)

“/” on one metadata server, all path lookups go thereMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 36

Cross-Resource Influence (ddr)

0 200 400 600 800

510

2050

100

200

Elapsed Time (s)

Seg

men

ts

Faulty serverNon−faulty servers

Disk-busy effect on server cwnd, unintentional syncMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 37

Delayed ACKs (ddw)

0 200 400 600

0e+0

02e

+05

4e+0

56e

+05

8e+0

5

Elapsed Time (s)

Byt

es T

rans

mitt

ed (B

/s)

Faulty serverNon−faulty servers

Packet-loss fault may also deviate network throughputMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 38

Results: PVFS 10/10 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 39

Results: PVFS 6/12 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 40

Results: Lustre 10/10 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 41

Results: Lustre 6/12 Cluster

control diskhog diskbusy wnethog rnethog recvloss sendloss

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Fault

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 42

Results: Aggregate

PVFS 10/10 PVFS 6/12 Lustre 10/10 Lustre 6/12

Indicted True PositiveDiagnosed True PositiveIndicted False PositiveDiagnosed False Positive

Cluster

Acc

urac

y R

ate

(%)

020

4060

8010

0

Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 43

top related