Download - On-line detection

8/7/2019 On-line detection

http://slidepdf.com/reader/full/on-line-detection 1/16

On-line detection of large-scale

parallel applications structureGerman Llort, Juan Gonzalez, Harald Servat, Judit

Gimenez, Jesus Labatra

Barcelona Supercomputing Center

University Politecnica de Catalunya



Introduction (1/2)

� Trace-based performance analysis of large parallel

applications has become a challenging task

± Traces rapidly become unmanageable due to long runs

and many processes

� Saving all traces might be unfeasible due to storage limitations

� Vast amount of data degrades the responsiveness of the

analysis tools

� Irrelevant data can distort the results and hinder the

understanding of the applications performance

± Filtering irrelevant (either meaningless or repetitive)

data is a first step for an efficient analysis



Introduction (2/2)

� This paper proposes an on-line analysis framework

� i) Automatic analysis: users only specify a trace size

� ii) Clustering technique: at runtime, a small region of the

execution which represents the overall behavior of app ischosen

� iii) Selective collection: only region-related performance data

is stored in the trace



Framework (1/7)

� System components interaction

MPItrace intercepts calls

and records the values

MRNet interconnects

processes in a tree-like

topology, and summarizes

data on its way

CPU bursts are grouped according to their

similarity in terms of duration and

performance counters

a fine-grain characterization of the app¶s

structure



Framework (2/7)

� Data acquisition

± MPItrace gather information whenever any of the

instrumented events occur from processors

� e.g., elapsed cycles, completed instructions, cache misses

� Values are stored per task into separate memory buffers, and

every new event overwrites the oldest

� Data for analysis belongs to a time region where all processes

are active simultaneously

� Data transmission

± A backend thread per process connects to the tools

front-end through MRNet network



Framework (3/7)

� All communication processes ran in a separate set of

processors to lower the burden of task processes

� Communication processes transfer performance data in buffer

when awakened by broadcast message from the front-end

� Data analysis

± The main purpose is to detect computing regions (i.e.,

CPU bursts) with similar behavior to identifycomputation structure (i.e., apps phases)

� Every CPU burst is defined by its duration and a set of

performance metrics at the start and end of the region



Framework (4/7)

� The clustering algorithm uses these metrics to characterize

app

± The small subset is clustered to speed up the clustering process

± Apps like Gromacs, Specfem3D, NAS BT generated 50,000 bursts in 30

seconds, which can take up to 10 minutes to analyze

± The remaining bursts are classified to their closest cluster using a

nearest neighbor search

± Reduction though selection strategies varying sampling time or

sample processes dropped the analysis time to 5-10 seconds

� A numerical report with the average values and a scatter plot

are presented



Framework (5/7)



Framework (6/7)

± Tracking the app evolution

� Whenever the app produces a new volume of data with a

given size, a subsequent clustering analysis triggers

� Once a stable region has been detected, clustering results aretransferred back to the back-end threads, and every CPU burst

is labeled with the cluster to whom it belongs to

± The app is considered stable when several clusterings in a row are

equivalent

± Two clustering are considered equivalent if the matching clusters

represent at least the 85% of the total computation time

� Along with the clusters distribution, all performance data

within the same time interval is flushed from the tracing

buffers in order to produce a detailed trace of that region



Framework (7/7)



Experimental Setup

± Marenostrum supercomputer

� A cluster comprising 10,240 IBM Power PC 970MP processors

at 2.3GHz interconnected by a Myrinet network



Gromacs (1/2)

± An engine to perform molecular dynamics simulations

and energy minimization

� 64 MPI tasks with 10 iterationsIndication of potential

load imbalance



Gromacs (2/2)



Zeus-MP

± A computational fluid dynamics code for the simulation

astrophysical phenomena

� 256 MPI tasks with 4 iterations



SPEMFEM3D



Q uality of the results

Download - On-line detection

Top Related