stack trace analysis for large scale debugging using mrnet

24
Stack Trace Analysis Stack Trace Analysis for Large Scale Debugging for Large Scale Debugging using MRNet using MRNet Dorian C. Arnold, Barton P. Miller University of Wisconsin Dong Ahn, Bronis R. de Supinski, Gregory L. Lee, Martin Schulz UCRL-PRES-230290

Upload: amiel

Post on 14-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

UCRL-PRES-230290. Stack Trace Analysis for Large Scale Debugging using MRNet. Dorian C. Arnold, Barton P. Miller University of Wisconsin Dong Ahn, Bronis R. de Supinski, Gregory L. Lee, Martin Schulz Lawrence Livermore National Laboratory. Scaling Tools. Machine sizes are increasing - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis Stack Trace Analysis for Large Scale Debugging for Large Scale Debugging using MRNetusing MRNet

Dorian C. Arnold, Barton P. Miller

University of Wisconsin

Dong Ahn, Bronis R. de Supinski,Gregory L. Lee, Martin Schulz

Lawrence Livermore National Laboratory

UCRL-PRES-230290

Page 2: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Scaling ToolsScaling Tools Machine sizes are increasingMachine sizes are increasing

New cluster close to or above 10,000 coresNew cluster close to or above 10,000 cores Blue Gene/L: over 131,000 coresBlue Gene/L: over 131,000 cores

Not only applications need to scaleNot only applications need to scale Support environmentSupport environment ToolsTools

ChallengesChallenges Data collection, storage, and analysisData collection, storage, and analysis Scalable process management and controlScalable process management and control VisualizationVisualization

Page 3: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Debugging on BlueGene/LDebugging on BlueGene/L

Typical debug session includes many interactionsTypical debug session includes many interactions

4096 is only 3% of BG/L!

TotalView on BG/L – 4096 ProcessesTotalView on BG/L – 4096 ProcessesOperationOperation LatencyLatency

Single stepSingle step ~15-20 secs.~15-20 secs.

Breakpoint InsertionBreakpoint Insertion ~30 secs.~30 secs.

Stack trace samplingStack trace sampling ~120 secs.~120 secs.

Page 4: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Scalability LimitationsScalability Limitations Large volumes of debug dataLarge volumes of debug data

Single frontend for all node connectionsSingle frontend for all node connections

Centralized data analysisCentralized data analysis

Vendor licensing limitationsVendor licensing limitations

Approach: scalable, lightweight debuggerApproach: scalable, lightweight debugger Reduce exploration space to small subsetReduce exploration space to small subset Online aggregation using a TBOnline aggregation using a TBŌŌNN Full-featured debugger for deeper diggingFull-featured debugger for deeper digging

Page 5: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

OutlineOutline

Case study: CCSMCase study: CCSM STAT ApproachSTAT Approach

Concept of Stack TracesConcept of Stack Traces Identification of Equivalence ClassesIdentification of Equivalence Classes

Implementation Implementation Using Tree-based Overlay NetworksUsing Tree-based Overlay Networks Data and Work Flow in STATData and Work Flow in STAT

EvaluationEvaluation ConclusionsConclusions

Case study: CCSMCase study: CCSM STAT ApproachSTAT Approach

Concept of Stack TracesConcept of Stack Traces Identification of Equivalence ClassesIdentification of Equivalence Classes

Implementation Implementation Using Tree-based Overlay NetworksUsing Tree-based Overlay Networks Data and Work Flow in STATData and Work Flow in STAT

EvaluationEvaluation ConclusionsConclusions

Page 6: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Case Study: CCSMCase Study: CCSM

CCommunity ommunity CClimate limate SSystem ystem MModel (CCSM)odel (CCSM) Used to make climate predictionsUsed to make climate predictions Coupled models for atmosphere, ocean, sea ice Coupled models for atmosphere, ocean, sea ice

and land surfaceand land surface

ImplementationImplementation Multiple Program Multiple Data (MPMD) modelMultiple Program Multiple Data (MPMD) model MPI-based applicationMPI-based application Distinct components for each modelDistinct components for each model

Typically requires significant node count Typically requires significant node count Models executed concurrentlyModels executed concurrently Several hundred tasksSeveral hundred tasks

Page 7: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

ObservationsObservations Intermittently hangs with 472 tasksIntermittently hangs with 472 tasks

Non-deterministicNon-deterministic Only at large scaleOnly at large scale Appears at seemingly random code locationsAppears at seemingly random code locations Hard to reproduce:Hard to reproduce:

2 hangs over next 10 days (~50 runs)2 hangs over next 10 days (~50 runs)

Current approach:Current approach: Attach to job using TotalView Attach to job using TotalView Collect stack traces from all 472 tasksCollect stack traces from all 472 tasks Visualize cross-node callgraphVisualize cross-node callgraph

Page 8: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

CCSM CallgraphCCSM Callgraph

Page 9: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Lessons LearnedLessons Learned Some bugs only occur at large scalesSome bugs only occur at large scales

Non-deterministic & hard to reproduceNon-deterministic & hard to reproduce

Stack traces can provide useful insightStack traces can provide useful insight

Many bugs are temporal in natureMany bugs are temporal in nature

Need tools that:Need tools that: Combine spatial and temporal observationsCombine spatial and temporal observations Discover application behaviorDiscover application behavior Run effectively at scaleRun effectively at scale

Page 10: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

STAT ApproachSTAT Approach Sample application stack tracesSample application stack traces

Across time and spaceAcross time and space Through third party interfaceThrough third party interface Using a DynInst based daemonUsing a DynInst based daemon

Merge/analyze traces:Merge/analyze traces: Discover equivalent process behaviorDiscover equivalent process behavior Group similar processesGroup similar processes Facilitate scalable analysis/data presentationFacilitate scalable analysis/data presentation

Leverage TBŌN model (MRNet)Leverage TBŌN model (MRNet) Communicate traces back to a frontendCommunicate traces back to a frontend Merge on the fly within MRNet filtersMerge on the fly within MRNet filters

Page 11: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Singleton Stack TraceSingleton Stack Trace

Appl.

Page 12: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Merging Stack TracesMerging Stack Traces Multiple traces over space or timeMultiple traces over space or time

Taken independentlyTaken independently Stored in graph representationStored in graph representation

Create call graph prefix treeCreate call graph prefix tree Only merge nodes with identical stack backtraceOnly merge nodes with identical stack backtrace Retains context informationRetains context information

AdvantagesAdvantages Compressed representationCompressed representation Scalable visualizationScalable visualization Scalable analysisScalable analysis

Page 13: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Merging Stack TracesMerging Stack Traces

Page 14: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

2D-Trace/Space Analysis2D-Trace/Space Analysis

Appl

Appl

Appl

Appl

Appl…

Page 15: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Prefix Tree vs. DAGPrefix Tree vs. DAG

STAT TotalView

Page 16: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

2D-Trace/Time Analysis2D-Trace/Time Analysis

Appl …

Page 17: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Time & Space AnalysisTime & Space Analysis Both 2D techniques insufficientBoth 2D techniques insufficient

Spatial aggregation misses temporal componentSpatial aggregation misses temporal component Temporal aggregation misses parallel aspectsTemporal aggregation misses parallel aspects

Multiple samples, multiple processesMultiple samples, multiple processes Track global program behavior over timeTrack global program behavior over time Merge into single, 3D prefix treeMerge into single, 3D prefix tree

Challenges:Challenges: Scalable data representationScalable data representation Scalable analysisScalable analysis Scalable and useful visualization/resultsScalable and useful visualization/results

Page 18: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

3D-Trace/Space/Time Analysis3D-Trace/Space/Time Analysis

Appl

Appl

Appl

Appl

Appl… …

4 Nodes / 10 Snapshots

Page 19: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

3D-Trace/Space/Time Analysis3D-Trace/Space/Time Analysis

288 Nodes / 10 Snapshots

Page 20: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

Implementation DetailsImplementation Details Communication through MRNetCommunication through MRNet

Single data stream from BE to FESingle data stream from BE to FE Filters implement tree mergeFilters implement tree merge Tree depth can be configuredTree depth can be configured

Three major componentsThree major components Backend (BE) daemons gathering tracesBackend (BE) daemons gathering traces Communication processes merging prefix treesCommunication processes merging prefix trees Frontend (FE) tool storing the final graphFrontend (FE) tool storing the final graph

Final result saved as GML or DOT fileFinal result saved as GML or DOT file Node classes color codedNode classes color coded External visualization toolsExternal visualization tools

Page 21: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

FE

CPCP

CP CP

…BE BE BE BE

STAT Frontend

STAT Tool Daemon

Filter

trace( count, freq. )

Work and Data FlowWork and Data Flow

MRNetCommunication

Process

Node 1 Node 2 Node N-1 Node NMPIMPIMPIMPI MPIMPIMPIMPI MPIMPIMPIMPI MPIMPIMPIMPI

Application Processes

TreeMerge

Page 22: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

0

2

4

6

8

10

12

14

16

18

20

1000 1500 2000 2500 3000 3500 4000

Number of Processes

La

ten

cy

(s

ec

s)

1-deep Topology

2-deep MRNet Tree

STAT PerformanceSTAT Performance

1024x4 Cluster1.4 GHz Itanium2Quadrics QsNetII

3844 processors, 0.741 seconds

Page 23: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

ConclusionsConclusions Scaling tools poses challengesScaling tools poses challenges

Data management and process controlData management and process control New strategies for tools neededNew strategies for tools needed

STAT – Scalable Stacktrace AnalysisSTAT – Scalable Stacktrace Analysis Lightweight tool to identify process classesLightweight tool to identify process classes Based on merged callgraph prefix treesBased on merged callgraph prefix trees Aggregation in Time and SpaceAggregation in Time and Space Orthogonal to full featured debuggersOrthogonal to full featured debuggers

Implementation based on TBImplementation based on TBŌŌNsNs Scalable data collection and aggregationScalable data collection and aggregation Enables significant speedupEnables significant speedup

Page 24: Stack Trace Analysis  for Large Scale Debugging  using MRNet

Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007

More InformationMore Information Paper published at IPDPS 2007Paper published at IPDPS 2007

Stack Trace Analysis for Large Scale DebuggingStack Trace Analysis for Large Scale Debugging

D. Arnold, D.H. Ahn, B.R. de Supinski, D. Arnold, D.H. Ahn, B.R. de Supinski, G. Lee, B.P. Miller, and M. SchulzG. Lee, B.P. Miller, and M. Schulz

Project website & Demo tomorrow Project website & Demo tomorrow http://www.paradyn.org/STAThttp://www.paradyn.org/STAT

TBŌN computing papers & open-source prototype, TBŌN computing papers & open-source prototype, MRNet, available atMRNet, available athttp://www.paradyn.org/mrnethttp://www.paradyn.org/mrnet