histograms detect imbalances variable sizes capture variance lawrence livermore national laboratory...
TRANSCRIPT
Histograms detect imbalances
Variable sizes capture variance
Lawrence LivermoreNational Laboratory
Center for Applied Scientific Computing
NC STATE UNIVERSITYDepartment of Computer Science
An Open Framework for Scalable, Reconfigurable Performance AnalysisTodd Gamblin‡, Prasun Ratn*†, Bronis R. de Supinski*, Martin Schulz*, Frank Mueller†, Robert J. Fowler‡, Daniel Reed‡
*Lawrence Livermore National Laboratory, †North Carolina State University, ‡Renaissance Computing Institute
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.This work was also supported in part by NSF grants CNS-0410203, CCF-0429653 and CAREER CCR-0237570, as well as the SciDAC Performance Engineering Research Institute, DE-FC02-06ER25764.
UCRL-POST-236200
ScalaTrace: Reconfigurable, Scalable Performance
Analysis
ScalaTrace: Reconfigurable, Scalable Performance
Analysis
Size of machines is rapidly increasing(130,000+ processors)
Tools will be overwhelmed with data
Need scalable, online measurement and analysis
Size of machines is rapidly increasing(130,000+ processors)
Tools will be overwhelmed with data
Need scalable, online measurement and analysis
Future DirectionsFuture Directions
Large Scale MPI Applications Implicit Program Behavior
Interactive User Tools and Visualization Understanding of Program Behavior and Optimizations
Fe
edb
ack
, Ste
erin
g &
Glo
bal C
on
figu
ratio
nD
ata
Co
llect
ion
& A
na
lysi
s
…MPI tasks
User Interaction& Storage
ReductionNodes
Local Configuration
Data Input: from Tracer, Profiler, Child Nodes
Data Output: to Parent Node, Tool
Communication paths
OPS Attr.1 Attr.2 Attr.N…
extensible, self-describing formatAttribute examples: PRSDs, timing, …Reducer
Transformer
Annotater
Analyzer
Compressor PRSDs Timing
Loop Detection
Annotator
MPI PRSD Time Loop
MPI PRSD Time
MPI PRSD Time
MPI PRSD Time
LoopInformation
Time
PRSD
Types of tool components
LOCAL Node INSTANTIATION
GLOBAL REDUCTION
Trace representation using Power Regular Section Descriptors (PRSDs)
Idea: preserve time in compressed traces
• Encode time deltas instead of timestamps
• Create delta histograms automatically
• Dynamically balance histograms
•Time depends on path taken
•Distinguish histograms by path
Description of problemDescription of problem
Bins generated for synthetic input span entire range with similar sample counts
Number of histograms per record depends on number of possible call paths
Sample bimodal distribution from UMT2K collectives
... MPI _Allreduce ( ... ) ; for ( ... ) { for ( ... ) { MPI _Send ( ... ) ; MPI _Recv ( ... ) ; } MPI _Barrier ( ... ) ; } ...
Replay accuracy (NAS Benchmarks and UMT2K)
IS
LU
FT
IS
FT
Trace sizes (NAS Benchmarks and UMT2K)
MG
BT
Near-constant trace size
DT, EP, LU
Sub-linear trace sizeCG, MG, FT
Non-scalable trace sizeBT, IS, UMT2K
Accurate replay: DT, EP, FT, LU, IS, UMT2K
Replay inaccurate in MPI Time: CG, MG
Replay inaccurate in Compute Time: BT
ScalaReplay: Replay Using Histogram Timing Annotations
ScalaReplay: Replay Using Histogram Timing Annotations
Evolutionary Load-Balance Analysis
with Scalable Data Collection
Evolutionary Load-Balance Analysis
with Scalable Data CollectionIdea: Normalize measurements and
models based on application semantics.
Progress loops • Typically outer loops in SPMD codes• Indicate absolute progress towards some
domain-specific goal• Basis for comparison of load over time
Effort Loops• Variable-time loops, represent load• Data-dependent execution
Load Modeling Framework
VisualizationVisualization
Effort Region Analysis
Progress EventAnnotator
MPI PRSD
Compressor
Trace merge
Wavelet Compressio
n
MPI PRSD Progress
LocalEffort
Records
PRSDs
MPI
PRSD
MPI Tracer
Load-annotatedTrace
MPI PRSD Progress
Progress instrumentation• User marks progress loop explicitly
Effort modeled with code regions• Dynamically detected at runtime• Split MPI-op trace at collectives and wait operations• User can further divide code into phases with instrumentation
Models dislocation dynamics in crystals• Dislocations discretized as nodes & arms• Recursive spatial domain decomposition• Balancer subdivides nodes/arms along x, then y, then z
Co
mp
res
s
. . .
. . .
. . .
Scalable Compression Using Wavelet Transform
Distributed per-process effort measurements are consolidated onto fewer processes
Parallel wavelet transform
Transformed data is locally compressed by thresholding, run-length and entropy coding
Compressed representation is merged in parallel
• Load measurements are reconstructed on client• Visualization tool allows intuitive data browsing
Co
mp
res
s
Co
mp
res
s
Near-constant low-overhead MPI traces
Annotate with additional reconfigurable data, e.g.
Time using adaptive histograms (left)
Progress rates, load imbalance (right)
Compression RatioCompression Ratio
Merge TimeMerge Time
ErrorError
Process IdProgress
Process IdProgress
Process IdProgressProcess IdProgress
Process IdProgress
Process IdProgress
Checkpoint
ForceComputation
Effort for 64-node ParaDiS Run
Exact Reconstructed
Re-partition
Load Balance in ParaDiSz
x
y
Flexible framework for application-specific tools
Ability to select compression schemes for different fields, data types
Foster combination, interaction between data collection, analysis mechanisms
Adaptive, self-tuning runtime systems
Near term:
Adaptive wavelet transform topology, equivalence class detection
Full time-sequence compression