tau performance systemtools.zih.tu-dresden.de/2011/downloads/spear_tau.pdf · • measurement:...
TRANSCRIPT
![Page 1: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/1.jpg)
ADVANCES IN THE TAU
PERFORMANCE SYSTEM
Sameer Shende Chee Wai Lee, Wyatt Spear, Scott Biersdorff, Suzanne Millstein
Performance Research Lab
Allen D. Malony, Nick Chaimov, William Voorhees Department of Computer and Information Science
University of Oregon
![Page 2: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/2.jpg)
Outline
2
• TAU Overview
• Performance Analysis of GPU Accelerated Codes
• Event Based Sampling (EBS)
• 3D Visualization of Performance Data
• TAU/Eclipse Integration
![Page 3: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/3.jpg)
TAU
3
A Brief Introduction To TAU
![Page 4: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/4.jpg)
TAU Performance System®
• Tuning and Analysis Utilities (18+ year project)
• Comprehensive performance profiling and tracing
– Integrated, scalable, flexible, portable
– Targets all parallel programming/execution paradigms
• Integrated performance toolkit
– Instrumentation, measurement, analysis, visualization
– Widely-ported performance profiling / tracing system
– Performance data management and data mining
– Open source (BSD-style license)
• Easy to integrate in application frameworks
http://tau.uoregon.edu
4
![Page 5: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/5.jpg)
How does TAU work?
• Instrumentation: Adds probes to perform measurements
– Source code instrumentation using pre-processors and compiler scripts
– Wrapping external libraries (I/O, MPI, Memory, CUDA, OpenCL, pthread)
– Rewriting the binary executable
• Measurement: Profiling or Tracing using wallclock time or hardware counters
– Direct instrumentation (Interval events measure exclusive or inclusive duration)
– Indirect instrumentation (Sampling measures statement level contribution)
– Throttling and runtime control of low-level events that execute frequently
– Per-thread storage of performance data
– Interface with external packages (Scalasca, VampirTrace, Score-P, PAPI)
• Analysis: Visualization of profiles and traces
– 3D visualization of profile data in paraprof, perfexplorer tools
– Trace conversion & display in external visualizers (Vampir, Jumpshot, ParaVer)
5
![Page 6: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/6.jpg)
Parallel Profile Visualization: ParaProf
![Page 7: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/7.jpg)
Using TAU: A Brief Introduction
• TAU supports several measurement and thread options
– Phase profiling, profiling with hardware counters, trace with Score-P…
• Each measurement configuration of TAU corresponds to a unique stub makefile and library that is generated when you configure it
• To instrument source code automatically using PDT
– Choose an appropriate TAU stub makefile in <arch>/lib:
% export TAU_MAKEFILE=$TAU/Makefile.tau-mpi-pdt
% export TAU_OPTIONS=„-optVerbose …‟ (see tau_compiler.sh )
Use tau_f90.sh, tau_cxx.sh or tau_cc.sh as F90, C++ or C compilers:
% mpif90 foo.f90 changes to
% tau_f90.sh foo.f90
• Set runtime environment variables, execute application and analyze performance data:
% pprof (for text based profile display)
% paraprof (for GUI)
7
![Page 8: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/8.jpg)
TAU
8
Instrumentation of GPU
Accelerated Code
![Page 9: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/9.jpg)
Heterogeneous Parallel Systems and
Performance
Heterogeneous parallel systems are highly relevant
today
Multi-CPU, multicore
shared memory nodes
Manycore (throughput)
accelerators with
high-BW I/O
Cluster interconnection network
Performance is the main driving concern
Heterogeneity is an important (the?) path to extreme scale
Heterogeneous software technology to get performance
More sophisticated parallel programming environments
Integrated parallel performance tools
support heterogeneous performance model and perspectives
![Page 10: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/10.jpg)
Heterogeneous Performance Views
• Want to create performance views that capture heterogeneous
concurrency and execution behavior
– Reflect interactions between heterogeneous components
– Capture performance semantics relative to computation model
– Assimilate performance for all execution paths for shared view
• Existing parallel performance tools are CPU (host)-centric
– Event-based sampling (not appropriate for accelerators)
– Direct measurement (through instrumentation of events)
• What perspective does the host have of other components?
– Determines the semantics of the measurement data
– Determines assumptions about behavior and interactions
• Performance views may have to work with reduced data
![Page 11: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/11.jpg)
Heterogeneous Performance Measurement
• Multi-level heterogeneous performance perspectives
• Inter-node communication
– Message communication, overhead, synchronization
• Intra-node execution
– Multicore thread execution and interactions
• Host-GPU interactions (general CPU – “special” device)
– Kernel setup, memory transfer, concurrency overlap,
synchronization
• GPU kernel execution
– Use of GPU compute and memory resources
![Page 12: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/12.jpg)
Host-GPU Measurement – Synchronous
Method
• Consider three measurement approaches:
– Synchronous, Event queue, Callback
• Synchronous approach treats Host-GPU interactions as
synchronous events that are measured on the CPU
• Approximates measurement of actual kernel start/stop
actions
![Page 13: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/13.jpg)
Host-GPU Measurement – Event Queue
Method
• Event queue methods inserts events in GPU stream
• Events are measured by GPU
• Performance information read at sync points on CPU
• Support an asynchronous performance view
• Events must be placed around kernel launch !!!
![Page 14: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/14.jpg)
Host-GPU Measurement – Callback Method
• Callback method is based on GPU driver and runtime
support for exposing certain routines and runtime actions
• Measurement tool registers the callbacks
– Application code is not modified !!!
• Where measurements occur depends on implementation
– Measurement might be made on CPU or GPU
– Measurements are accessed at callback point (CPU)
– mea
![Page 15: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/15.jpg)
Method Support and Implementation
Synchronous method
Place instrumentation appropriately around GPU calls
Wrap (synchronous) library with performance tool
Event queue method
Utilize CUDA and OpenCL event support
Need instrumentation to 1) create and insert events in the
streams with kernel launch and 2) process events
Can be implemented with driver library wrapping
Callback method
Utilize language-level callback support in OpenCL
Utilize NVIDIA CUDA Performance Tool Interface (CUPTI)
Need to appropriately register callbacks in program
![Page 16: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/16.jpg)
TAU for Heterogeneous Measurement
• TAU Performance System® (http://tau.uoregon.edu)
– Instrumentation, measurement, analysis for parallel systems
– Extend to support heterogeneous performance analysis
• Integrate Host-GPU support in TAU measurement
– Enable host-GPU measurement approach
• CUDA and OpenCL
• utilize PAPI CUDA and CUPTI
– Provide both heterogeneous profiling and tracing support
• contextualization of asynchronous kernel invocation
• Additional support
– TAU wrapping of libraries (tau_gen_wrapper)
– Work with binaries using library preloading (tau_exec)
![Page 17: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/17.jpg)
PAPI CUDA
• Performance API (PAPI) (http://icl.cs.utk.edu/papi)
• PAPI CUDA component
– PAPI component to support measurement of GPU counters
– Based on CUPTI (works with NVIDIA GPUs and CUDA)
– Device-level access to GPU counters (different devices)
![Page 18: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/18.jpg)
Vampir / VampirTrace for GPU
• Vampir / VampirTrace (http://www.vampir.eu/)
– Trace measurement and analysis of parallels application
– Extend to support GPU performance measurement
• Integrate Host-GPU measurement in trace measurement
– Based on the event queue method
– Library wrapping for CUDA and OpenCL
– Per kernel thread recording asynchronous events
– Use of CUPTI to capture performance counters
– Translation of GPU trace information to valid Vampir form
• Visualization of heterogeneous performance traces
– Presentation of memory transfer and kernel launches
– Includes calculation of counter statistics and rates
![Page 19: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/19.jpg)
NVIDIA CUPTI
• NVIDIA is developing CUPTI to enable the creation of
profiling and tracing tools
– CUPTI support was released with CUDA 4.0
• Two APIs:
– Callback API
• interject tool callback code at the entry and exist to each CUDA
runtime and driver API call
• registered tools are invoked for selected events
– Counter API
• query, configure, start, stop, read counters on CUDA devices
• device-level counter access
• CUPTI is delivered as a dynamic library
![Page 20: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/20.jpg)
GPU Performance Tool Interoperability
![Page 21: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/21.jpg)
CUDA Linpack Profile (4 processes, 4 GPUs)
• GPU-accelerated Linpack benchmark (NVIDIA)
![Page 22: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/22.jpg)
CUDA Linpack Trace
MPI communication (yellow) CUDA memory transfer (white)
CP
U t
races
GP
U t
race
s
![Page 23: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/23.jpg)
TAU
23
Event Based Sampling (EBS)
![Page 24: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/24.jpg)
Integrated EBS Profiles: Motivation
24
• TAU Event-Based Sampling (TAUebs) combines three
approaches:
– TAU for probe-based instrumentation and measurement
– Timer-based sampling measurement
– Call stack unwinding for sampled callpath structure
• A flexible hybrid approach exploits advantages of probe-
based instrumentation with advantages of sampling.
![Page 25: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/25.jpg)
Integrated EBS Profile Design
25
Sampling Timeline
Instrumentation Timeline
Probe-
Based
Profile
Sampling-
Based
Profile
Integration
Sampling-
Based
Trace
PC Histogram
0x3820
0x4894
0x2110
Samples
since
last
flush
TAUkey 0 TAUkey 1 TAUkey 2
PC
addr
range
0x2110
0x3820
0x4894
Active
event
stack =
TAUkey
1
1
1
![Page 26: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/26.jpg)
Integration of Sampling and TAU Profile
Output
• Previously, to visualize profiles, TAUebs:
– Captures a trace of EBS samples
– Post-processes the trace to recover symbol information
– Relies on ParaProf to merge sample traces with generated
profiles for profile visualization output.
• Sampled information now integrated into TAU data
structures at runtime.
• TAU Profile output now incorporates sampled data.
26
![Page 27: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/27.jpg)
Integrated EBS Profile Design (cont)
• Instance of new sample contextualized by TAUkey and
integrated into TAU data structures at runtime.
27
New Sample:
Function a(), address: 0x21098
loop
foo()
…
main
Active TAU
Event Stack
TAUkey:
“main-> … -> foo() -> loop”
PC Histogram:
0x21098 == 15
…
0x45362 == 23
![Page 28: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/28.jpg)
Implementation
• Uses existing timer-interrupt framework to trigger
samples at regular intervals.
• At each sample, a histogram of PC addresses is updated
for the active TAU event context represented by its
TAUkey.
• Current implementation of integrated profiles does not
unwind the callstack at each sample point. This implies a
flat sample profile for each TAU event context.
• Addresses are resolved to meaningful symbol
information via BFD at the end of the run.
28
![Page 29: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/29.jpg)
Examples: Simple Benchmark (cont)
• Mix of Sampling and Probed Instrumentation
29
Sampling discovers
un-instrumented libraries
Samples in the context of
probed events.
![Page 30: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/30.jpg)
Examples: FLASH4
30
Follow the “breadcrumbs” of
significantly lengthy probed events
to a probed loop …
Samples highlight the most significant
code in the context of the loop.
![Page 31: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/31.jpg)
3d Scatterplot correlation of EBS data (cont)
31
Sample: (Under MPI_Allreduce)
File:
bgp/comm/sys/dcmf/messager/prod/bgp/msgr.h
Line: 234
Function:
_ZN4DCMF11BGPMessager7advanceEv Sampled event
MPI_Allreduce (total)
![Page 32: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/32.jpg)
Implementation Issues: Signal Safety
• Signal safety
– Strictly, signal handlers cannot attempt to acquire locks.
– To ensure lock safety, EBS conservatively drops a sample if a
sample occurs inside TAU operations.
– EBS sample handler must avoid memory allocation. We use our
own C++ allocator that minimizes actual calls to malloc as a
workaround.
32
Conservative signal safety handling will drop samples
that occur while inside TAU.
Rules can be loosened if we refine the handler to drop
samples only if it is unsafe to update TAU structures.
![Page 33: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/33.jpg)
EBS Future Work
• Basic callstack unwinding for samples.
• Selective callstack unwinding for probed event sets:
– Not every sample block for a TAU event context needs to have
the callstack unwound for each sample.
– Mix of flat and structured sample blocks enclosed by the TAU
event context.
• Considerations for interpretation of code structures
discovered to callstack unwinding for samples in non-
function TAU contexts.
• Sampling with other sources of metric information (eg.
GPU instruction count).
33
![Page 34: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/34.jpg)
EBS Future Work (cont)
• Sampling GPU metrics prototype visualization:
34
![Page 35: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/35.jpg)
TAU
35
3D Visualization of Performance
Data
![Page 36: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/36.jpg)
Motivation
Large-scale parallel application performance analysis
Requires scalable performance measurement
Ability to understand multi-dimensional performance data
Automated analysis and diagnosis is still a challenge
Performance interpretation continues to involve the user
How to build better tool capabilities to support this goal?
Need to be able to handle data dimensionality
Presentation of performance information is important
Performance visualization is an opportunity
Convey visual characteristics and traits in the data
Aids in data exploration and pattern analysis
3D visualization helps in identifying relations and scale
![Page 37: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/37.jpg)
Using Process Topology Metadata
• Inspired by the Scalasca CUBE topology display
• Each point represents a thread of execution (MPI process)
– Positioned according to the Cartesian (x,y,z,t) coordinates
• Color is determined by selected event/metric value
• Topology information can be recorded in TAU metadata
• ParaProf reads metadata to determine topology and create layout
• Example: Sweep3D 3D neutron transport application
– 16K run on BG/P
– Color is exclusive time in the “sweep” function
![Page 38: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/38.jpg)
Viewing Internal Structure
• Dense topologies
can hide internal
structure in the
visualization
• Restrict visibility
by color value to
expose performance
patterns
• ParaProf visualization UI now allows for range filtering
– Mid-level values can be excluded
– Include high outliers (hotspots)
– Include low outliers (under-utilized ranks)
![Page 39: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/39.jpg)
Slicing to Reduce Dimensionality
• Restrict visibility to
sub-dimensions
– Slices along spatial axes
• ParaProf visualization UI
provides dimensionality
reduction control
– Multiple axis controls
allow selection of planes, lines, or an individual point
• Several alternatives for value reduction
– Show only value of points
– Averaging the color value for all points in the selected area
![Page 40: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/40.jpg)
Custom Topologies
• Certain views may hide deeper
inter-process behavior
• Spatially dependent performance issues
may be revealed by manipulating topology
• Sweep3D profile with alternative
Cartesian mapping exposes distribution
of computational effort
• Topology has direct effect on communication
• Visualization mapped to hardware topologies
can suggest better node/rank mapping
• MPI_AllReduce() values for
Sweep3D highlights waiting distribution
from rank 0 (lower left) to the most distant
rank (upper right)
![Page 41: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/41.jpg)
Topology Control UI
• Layout tab allows
customization of the
position and visibility
of data points
• Performance
event/metric data
used to define color
and position is selected
in the Event tab
• Additional rendering
options, such as color
scale and point size are
available
• 4k-core S3D run
on IBM BG/P
Hide points with
values outside of
the range
Move both range
sliders at once
Show only points
along the selected
slice of an axis
The average value
of visible points.
Select preset or
custom layouts
Set X-, Y-, Z-axis
dimensions
Select the plot type. The
'Topolgy' plot allows
custom definition of
performance data layouts
![Page 42: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/42.jpg)
Cray XE6 Topology Visualization
• Example: GCRM global cloud resolving mode
• Visualization of 10K execution on Cray XE6
– Topological view shows core distribution (24 cores per node)
– Custom layout makes decision on node/core blocking
MPI_Init MPI_Allreduce
24-core
node
![Page 43: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/43.jpg)
Visual Layout Specification
• Want to allow creation of explicit layouts by the user
• IDEA: Define a specification “language” that allows mathematical expressions to describe features of performance display – Equations define X, Y, Z coordinates and color per process
– Event and metrics are seen as variables
• eventX.val : value for Xth specified event and metric
• eventX.{min,max,mean} : global aggregate values
• atomicY : Yth atomic event value
– Intermediate variables can be used in the calculation
– Defined global variables (e.g., max rank) are provided
• Specifications are loaded and processed by ParaProf – Use the MESP expression parser
![Page 44: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/44.jpg)
ParaProf Events Panel
Events / metrics get bound in
ParaProf UI
• Example:
– event0 is the FLOP count for function foo
– event1 is the time value for function foo
– To set the X coordinate for each process
point to the FLOPS for event foo:
x = event0.val / event1.val
– To set the Y coordinate for each process
point to the global average FLOPS for
event foo:
y = event0.mean / event1.mean
![Page 45: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/45.jpg)
Sphere Layout Specification
• Spatially mediated performance behavior may not be
represented directly in topology metadata
– Applications allocate resources with respect to a data-
driven model
• Position of each point can be defined by custom equations in
terms of event/metric, aggregate, atomic event and metadata
• Sweep3D profile mapped to a sphere
BEGIN_VIZ=Sphere
rootRanks=sqrt(maxRank)
theta=2*pi()/rootRanks*mod(rank,rootRanks)
phi=pi()/rootRanks*(ceil(rank/rootRanks))
x=cos(theta)*sin(phi)*100
y=sin(theta)*sin(phi)*100
z=cos(phi)*100
END_VIZ
![Page 46: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/46.jpg)
Adding Dimensionality
• Topologies can involve more than three dimensions (e.g., intranode)
• Mirror actual machine layout to capture communication structure and cores
• Custom layouts allow specification of multiple points from a single process/rank
• 4K-core S3D run on BG/P
• Default topology only covers X, Y, Z coordinates
• A custom topology divides each nth core into its own block
BEGIN_VIZ=4K_8x8x16Block xdim=8 ydim=8 zdim=16 x=mod(rank,xdim)+16*floor(rank/1024) y=mod(floor(rank/xdim),ydim) z=mod(floor(rank/xdim/ydim),zdim) END_VIZ
![Page 47: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/47.jpg)
Conclusion and Next Steps
• Have extended TAU‟s ParaProf tool for visualization
– Working with topology information
– Provide custom topology views through UI
– User-defined layouts
• Demonstrate on relatively large applications
• Future work
– Expand UI for more general access to performance model
– Allow independent manipulation of unconnected segments
– Improve presentation of data values, ranks, and metrics
– Better functionality for automatic higher-dimensional layouts
– Build library of user-defined layouts for reuse
![Page 48: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/48.jpg)
TAU
48
TAU/Eclipse Integration
![Page 49: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/49.jpg)
2-49
Parallel Tools Platform (PTP)
• The Parallel Tools Platform aims to provide a highly integrated environment specifically designed for parallel application development
• Features include:
– An integrated development environment (IDE) that supports a wide range of parallel architectures and runtime systems
– A scalable parallel debugger
– Parallel programming tools (MPI, OpenMP, UPC, etc.)
– Support for the integration of parallel tools
– An environment that simplifies the end-user interaction with parallel systems
• http://www.eclipse.org/ptp
![Page 50: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/50.jpg)
Eclipse PTP Family of Tools
Coding & Analysis (C, C++, Fortran)
Parallel Debugging
Launching & Monitoring
Performance Tuning (TAU, PPW, …)
2-50
![Page 51: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/51.jpg)
PTP/External Tools Framework formerly “Performance Tools Framework”
Goal:
Reduce the “eclipse plumbing” necessary to integrate tools
Provide integration for instrumentation, measurement, and analysis for a variety of performance tools
Dynamic Tool Definitions: Workflows & UI
Tools and tool workflows are specified in an XML file
Tools are selected and configured in the launch configuration window
Output is generated, managed and analyzed as specified in the workflow
5-51
![Page 52: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/52.jpg)
PTP TAU plug-ins http://www.cs.uoregon.edu/research/tau
• TAU (Tuning and Analysis Utilities)
• First implementation of External Tools Framework (ETFw)
• Eclipse plug-ins wrap TAU functions, make them available from Eclipse
• Compatible with Photran and CDT projects and with PTP parallel application launching
• Other plug-ins launch Paraprof from Eclipse too
5-52
![Page 53: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/53.jpg)
TAU Integration with PTP
• TAU: Tuning and
Analysis Utilities
– Performance data
collection and analysis for
HPC codes
– Numerous features
– Command line interface
• The TAU Workflow:
– Instrumentation
– Execution
– Analysis
7-53
![Page 54: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/54.jpg)
TAU/ETFw Hands-On
• Performance analysis/External tools use the same launch
configurations and resource managers as
debugging/launching
• The relevant tools must be in the $PATH on the remote
machine
• Select the Profile button‟s “Profile Configurations…” option
to begin:
7-54
![Page 55: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/55.jpg)
TAU/ETFw Hands-On (2)
• Select an existing launch configuration or create a new
one
• Select the Performance Analysis tab and choose the
TAU tool set
7-55
![Page 56: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/56.jpg)
TAU/ETFw Hands-On (3)
• Select the TAU tab and
choose a Makefile with MPI,
PDT and PAPI options
• Select PAPI counters
• Other options are available
but not needed here
• Hit „Profile‟
7-56
![Page 57: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/57.jpg)
TAU/ETFW Hands-On (4)
• The application will rebuild and launch.
• Performance data will appear in the Performance
Data Management view
• Double click the new entry to view in ParaProf
• Right click on a function bar and select Show Source
Code for source callback to Eclipse
7-57
![Page 58: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/58.jpg)
Performance Tools: Summary
• Performance Tools integrated with PTP
– Performance Tools integrated with PTP help tune parallel
applications
– External Tools Framework (ETFw) eases integration of existing
(command-line, etc.) tools
• TAU Performance Tuning uses ETFw
• PPW (Parallel Perf. Wizard) uses ETFw for UPC analysis
– MPI Analysis: GEM
– Now Supporting remote analysis
7-58
![Page 59: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/59.jpg)
3-59
Synchronized Projects
• Projects types can be:
File Service Index Service
Launch Service
Build Service
Debug Service
Local source code
Source code copy
Local Remote
Compute
Edit Search/Index Navigation
Synchronize
Executable
![Page 60: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/60.jpg)
60
Support Acknowledgements
• US Department of Energy (DOE)
– Office of Science contracts
– SciDAC, LBL contracts
– LLNL-LANL-SNL ASC/NNSA contract
– Battelle, PNNL contract
– ANL, ORNL contract
• Department of Defense (DoD)
– PETTT, HPTi
• National Science Foundation (NSF)
– SDCI, SI-2
• University of Oregon
• ParaTools, Inc.
• University of Tennessee, Knoxville
– Dr. Shirley Moore
• T.U. Dresden, GWT
– Dr. Wolfgang Nagel and Dr. Andreas Knupfer
• Juelich Supercomputing Center
– Dr. Bernd Mohr, Dr. Felix Wolf Dr. Markus Geimer, Dr. Brian Wylie
![Page 61: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation](https://reader033.vdocuments.mx/reader033/viewer/2022050216/5f61bfaadf23203b8c250c04/html5/thumbnails/61.jpg)
For more information
• TAU Website:
http://tau.uoregon.edu
– Software
– Release notes
– Documentation
61