Download - On-line detection
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 1/16
On-line detection of large-scale
parallel applications structureGerman Llort, Juan Gonzalez, Harald Servat, Judit
Gimenez, Jesus Labatra
Barcelona Supercomputing Center
University Politecnica de Catalunya
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 2/16
Introduction (1/2)
� Trace-based performance analysis of large parallel
applications has become a challenging task
± Traces rapidly become unmanageable due to long runs
and many processes
� Saving all traces might be unfeasible due to storage limitations
� Vast amount of data degrades the responsiveness of the
analysis tools
� Irrelevant data can distort the results and hinder the
understanding of the applications performance
± Filtering irrelevant (either meaningless or repetitive)
data is a first step for an efficient analysis
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 3/16
Introduction (2/2)
� This paper proposes an on-line analysis framework
� i) Automatic analysis: users only specify a trace size
� ii) Clustering technique: at runtime, a small region of the
execution which represents the overall behavior of app ischosen
� iii) Selective collection: only region-related performance data
is stored in the trace
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 4/16
Framework (1/7)
� System components interaction
MPItrace intercepts calls
and records the values
MRNet interconnects
processes in a tree-like
topology, and summarizes
data on its way
CPU bursts are grouped according to their
similarity in terms of duration and
performance counters
a fine-grain characterization of the app¶s
structure
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 5/16
Framework (2/7)
� Data acquisition
± MPItrace gather information whenever any of the
instrumented events occur from processors
� e.g., elapsed cycles, completed instructions, cache misses
� Values are stored per task into separate memory buffers, and
every new event overwrites the oldest
� Data for analysis belongs to a time region where all processes
are active simultaneously
� Data transmission
± A backend thread per process connects to the tools
front-end through MRNet network
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 6/16
Framework (3/7)
� All communication processes ran in a separate set of
processors to lower the burden of task processes
� Communication processes transfer performance data in buffer
when awakened by broadcast message from the front-end
� Data analysis
± The main purpose is to detect computing regions (i.e.,
CPU bursts) with similar behavior to identifycomputation structure (i.e., apps phases)
� Every CPU burst is defined by its duration and a set of
performance metrics at the start and end of the region
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 7/16
Framework (4/7)
� The clustering algorithm uses these metrics to characterize
app
± The small subset is clustered to speed up the clustering process
± Apps like Gromacs, Specfem3D, NAS BT generated 50,000 bursts in 30
seconds, which can take up to 10 minutes to analyze
± The remaining bursts are classified to their closest cluster using a
nearest neighbor search
± Reduction though selection strategies varying sampling time or
sample processes dropped the analysis time to 5-10 seconds
� A numerical report with the average values and a scatter plot
are presented
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 8/16
Framework (5/7)
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 9/16
Framework (6/7)
± Tracking the app evolution
� Whenever the app produces a new volume of data with a
given size, a subsequent clustering analysis triggers
� Once a stable region has been detected, clustering results aretransferred back to the back-end threads, and every CPU burst
is labeled with the cluster to whom it belongs to
± The app is considered stable when several clusterings in a row are
equivalent
± Two clustering are considered equivalent if the matching clusters
represent at least the 85% of the total computation time
� Along with the clusters distribution, all performance data
within the same time interval is flushed from the tracing
buffers in order to produce a detailed trace of that region
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 10/16
Framework (7/7)
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 11/16
Experimental Setup
± Marenostrum supercomputer
� A cluster comprising 10,240 IBM Power PC 970MP processors
at 2.3GHz interconnected by a Myrinet network
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 12/16
Gromacs (1/2)
± An engine to perform molecular dynamics simulations
and energy minimization
� 64 MPI tasks with 10 iterationsIndication of potential
load imbalance
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 13/16
Gromacs (2/2)
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 14/16
Zeus-MP
± A computational fluid dynamics code for the simulation
astrophysical phenomena
� 256 MPI tasks with 4 iterations
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 15/16
SPEMFEM3D
8/7/2019 On-line detection
http://slidepdf.com/reader/full/on-line-detection 16/16
Q uality of the results