tau performance systemtools.zih.tu-dresden.de/2011/downloads/spear_tau.pdf · • measurement:...

61
ADVANCES IN THE TAU PERFORMANCE SYSTEM Sameer Shende Chee Wai Lee, Wyatt Spear, Scott Biersdorff, Suzanne Millstein Performance Research Lab Allen D. Malony, Nick Chaimov, William Voorhees Department of Computer and Information Science University of Oregon

Upload: others

Post on 21-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

ADVANCES IN THE TAU

PERFORMANCE SYSTEM

Sameer Shende Chee Wai Lee, Wyatt Spear, Scott Biersdorff, Suzanne Millstein

Performance Research Lab

Allen D. Malony, Nick Chaimov, William Voorhees Department of Computer and Information Science

University of Oregon

Page 2: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Outline

2

• TAU Overview

• Performance Analysis of GPU Accelerated Codes

• Event Based Sampling (EBS)

• 3D Visualization of Performance Data

• TAU/Eclipse Integration

Page 3: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU

3

A Brief Introduction To TAU

Page 4: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU Performance System®

• Tuning and Analysis Utilities (18+ year project)

• Comprehensive performance profiling and tracing

– Integrated, scalable, flexible, portable

– Targets all parallel programming/execution paradigms

• Integrated performance toolkit

– Instrumentation, measurement, analysis, visualization

– Widely-ported performance profiling / tracing system

– Performance data management and data mining

– Open source (BSD-style license)

• Easy to integrate in application frameworks

http://tau.uoregon.edu

4

Page 5: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

How does TAU work?

• Instrumentation: Adds probes to perform measurements

– Source code instrumentation using pre-processors and compiler scripts

– Wrapping external libraries (I/O, MPI, Memory, CUDA, OpenCL, pthread)

– Rewriting the binary executable

• Measurement: Profiling or Tracing using wallclock time or hardware counters

– Direct instrumentation (Interval events measure exclusive or inclusive duration)

– Indirect instrumentation (Sampling measures statement level contribution)

– Throttling and runtime control of low-level events that execute frequently

– Per-thread storage of performance data

– Interface with external packages (Scalasca, VampirTrace, Score-P, PAPI)

• Analysis: Visualization of profiles and traces

– 3D visualization of profile data in paraprof, perfexplorer tools

– Trace conversion & display in external visualizers (Vampir, Jumpshot, ParaVer)

5

Page 6: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Parallel Profile Visualization: ParaProf

Page 7: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Using TAU: A Brief Introduction

• TAU supports several measurement and thread options

– Phase profiling, profiling with hardware counters, trace with Score-P…

• Each measurement configuration of TAU corresponds to a unique stub makefile and library that is generated when you configure it

• To instrument source code automatically using PDT

– Choose an appropriate TAU stub makefile in <arch>/lib:

% export TAU_MAKEFILE=$TAU/Makefile.tau-mpi-pdt

% export TAU_OPTIONS=„-optVerbose …‟ (see tau_compiler.sh )

Use tau_f90.sh, tau_cxx.sh or tau_cc.sh as F90, C++ or C compilers:

% mpif90 foo.f90 changes to

% tau_f90.sh foo.f90

• Set runtime environment variables, execute application and analyze performance data:

% pprof (for text based profile display)

% paraprof (for GUI)

7

Page 8: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU

8

Instrumentation of GPU

Accelerated Code

Page 9: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Heterogeneous Parallel Systems and

Performance

Heterogeneous parallel systems are highly relevant

today

Multi-CPU, multicore

shared memory nodes

Manycore (throughput)

accelerators with

high-BW I/O

Cluster interconnection network

Performance is the main driving concern

Heterogeneity is an important (the?) path to extreme scale

Heterogeneous software technology to get performance

More sophisticated parallel programming environments

Integrated parallel performance tools

support heterogeneous performance model and perspectives

Page 10: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Heterogeneous Performance Views

• Want to create performance views that capture heterogeneous

concurrency and execution behavior

– Reflect interactions between heterogeneous components

– Capture performance semantics relative to computation model

– Assimilate performance for all execution paths for shared view

• Existing parallel performance tools are CPU (host)-centric

– Event-based sampling (not appropriate for accelerators)

– Direct measurement (through instrumentation of events)

• What perspective does the host have of other components?

– Determines the semantics of the measurement data

– Determines assumptions about behavior and interactions

• Performance views may have to work with reduced data

Page 11: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Heterogeneous Performance Measurement

• Multi-level heterogeneous performance perspectives

• Inter-node communication

– Message communication, overhead, synchronization

• Intra-node execution

– Multicore thread execution and interactions

• Host-GPU interactions (general CPU – “special” device)

– Kernel setup, memory transfer, concurrency overlap,

synchronization

• GPU kernel execution

– Use of GPU compute and memory resources

Page 12: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Host-GPU Measurement – Synchronous

Method

• Consider three measurement approaches:

– Synchronous, Event queue, Callback

• Synchronous approach treats Host-GPU interactions as

synchronous events that are measured on the CPU

• Approximates measurement of actual kernel start/stop

actions

Page 13: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Host-GPU Measurement – Event Queue

Method

• Event queue methods inserts events in GPU stream

• Events are measured by GPU

• Performance information read at sync points on CPU

• Support an asynchronous performance view

• Events must be placed around kernel launch !!!

Page 14: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Host-GPU Measurement – Callback Method

• Callback method is based on GPU driver and runtime

support for exposing certain routines and runtime actions

• Measurement tool registers the callbacks

– Application code is not modified !!!

• Where measurements occur depends on implementation

– Measurement might be made on CPU or GPU

– Measurements are accessed at callback point (CPU)

– mea

Page 15: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Method Support and Implementation

Synchronous method

Place instrumentation appropriately around GPU calls

Wrap (synchronous) library with performance tool

Event queue method

Utilize CUDA and OpenCL event support

Need instrumentation to 1) create and insert events in the

streams with kernel launch and 2) process events

Can be implemented with driver library wrapping

Callback method

Utilize language-level callback support in OpenCL

Utilize NVIDIA CUDA Performance Tool Interface (CUPTI)

Need to appropriately register callbacks in program

Page 16: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU for Heterogeneous Measurement

• TAU Performance System® (http://tau.uoregon.edu)

– Instrumentation, measurement, analysis for parallel systems

– Extend to support heterogeneous performance analysis

• Integrate Host-GPU support in TAU measurement

– Enable host-GPU measurement approach

• CUDA and OpenCL

• utilize PAPI CUDA and CUPTI

– Provide both heterogeneous profiling and tracing support

• contextualization of asynchronous kernel invocation

• Additional support

– TAU wrapping of libraries (tau_gen_wrapper)

– Work with binaries using library preloading (tau_exec)

Page 17: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

PAPI CUDA

• Performance API (PAPI) (http://icl.cs.utk.edu/papi)

• PAPI CUDA component

– PAPI component to support measurement of GPU counters

– Based on CUPTI (works with NVIDIA GPUs and CUDA)

– Device-level access to GPU counters (different devices)

Page 18: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Vampir / VampirTrace for GPU

• Vampir / VampirTrace (http://www.vampir.eu/)

– Trace measurement and analysis of parallels application

– Extend to support GPU performance measurement

• Integrate Host-GPU measurement in trace measurement

– Based on the event queue method

– Library wrapping for CUDA and OpenCL

– Per kernel thread recording asynchronous events

– Use of CUPTI to capture performance counters

– Translation of GPU trace information to valid Vampir form

• Visualization of heterogeneous performance traces

– Presentation of memory transfer and kernel launches

– Includes calculation of counter statistics and rates

Page 19: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

NVIDIA CUPTI

• NVIDIA is developing CUPTI to enable the creation of

profiling and tracing tools

– CUPTI support was released with CUDA 4.0

• Two APIs:

– Callback API

• interject tool callback code at the entry and exist to each CUDA

runtime and driver API call

• registered tools are invoked for selected events

– Counter API

• query, configure, start, stop, read counters on CUDA devices

• device-level counter access

• CUPTI is delivered as a dynamic library

Page 20: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

GPU Performance Tool Interoperability

Page 21: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

CUDA Linpack Profile (4 processes, 4 GPUs)

• GPU-accelerated Linpack benchmark (NVIDIA)

Page 22: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

CUDA Linpack Trace

MPI communication (yellow) CUDA memory transfer (white)

CP

U t

races

GP

U t

race

s

Page 23: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU

23

Event Based Sampling (EBS)

Page 24: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Integrated EBS Profiles: Motivation

24

• TAU Event-Based Sampling (TAUebs) combines three

approaches:

– TAU for probe-based instrumentation and measurement

– Timer-based sampling measurement

– Call stack unwinding for sampled callpath structure

• A flexible hybrid approach exploits advantages of probe-

based instrumentation with advantages of sampling.

Page 25: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Integrated EBS Profile Design

25

Sampling Timeline

Instrumentation Timeline

Probe-

Based

Profile

Sampling-

Based

Profile

Integration

Sampling-

Based

Trace

PC Histogram

0x3820

0x4894

0x2110

Samples

since

last

flush

TAUkey 0 TAUkey 1 TAUkey 2

PC

addr

range

0x2110

0x3820

0x4894

Active

event

stack =

TAUkey

1

1

1

Page 26: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Integration of Sampling and TAU Profile

Output

• Previously, to visualize profiles, TAUebs:

– Captures a trace of EBS samples

– Post-processes the trace to recover symbol information

– Relies on ParaProf to merge sample traces with generated

profiles for profile visualization output.

• Sampled information now integrated into TAU data

structures at runtime.

• TAU Profile output now incorporates sampled data.

26

Page 27: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Integrated EBS Profile Design (cont)

• Instance of new sample contextualized by TAUkey and

integrated into TAU data structures at runtime.

27

New Sample:

Function a(), address: 0x21098

loop

foo()

main

Active TAU

Event Stack

TAUkey:

“main-> … -> foo() -> loop”

PC Histogram:

0x21098 == 15

0x45362 == 23

Page 28: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Implementation

• Uses existing timer-interrupt framework to trigger

samples at regular intervals.

• At each sample, a histogram of PC addresses is updated

for the active TAU event context represented by its

TAUkey.

• Current implementation of integrated profiles does not

unwind the callstack at each sample point. This implies a

flat sample profile for each TAU event context.

• Addresses are resolved to meaningful symbol

information via BFD at the end of the run.

28

Page 29: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Examples: Simple Benchmark (cont)

• Mix of Sampling and Probed Instrumentation

29

Sampling discovers

un-instrumented libraries

Samples in the context of

probed events.

Page 30: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Examples: FLASH4

30

Follow the “breadcrumbs” of

significantly lengthy probed events

to a probed loop …

Samples highlight the most significant

code in the context of the loop.

Page 31: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

3d Scatterplot correlation of EBS data (cont)

31

Sample: (Under MPI_Allreduce)

File:

bgp/comm/sys/dcmf/messager/prod/bgp/msgr.h

Line: 234

Function:

_ZN4DCMF11BGPMessager7advanceEv Sampled event

MPI_Allreduce (total)

Page 32: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Implementation Issues: Signal Safety

• Signal safety

– Strictly, signal handlers cannot attempt to acquire locks.

– To ensure lock safety, EBS conservatively drops a sample if a

sample occurs inside TAU operations.

– EBS sample handler must avoid memory allocation. We use our

own C++ allocator that minimizes actual calls to malloc as a

workaround.

32

Conservative signal safety handling will drop samples

that occur while inside TAU.

Rules can be loosened if we refine the handler to drop

samples only if it is unsafe to update TAU structures.

Page 33: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

EBS Future Work

• Basic callstack unwinding for samples.

• Selective callstack unwinding for probed event sets:

– Not every sample block for a TAU event context needs to have

the callstack unwound for each sample.

– Mix of flat and structured sample blocks enclosed by the TAU

event context.

• Considerations for interpretation of code structures

discovered to callstack unwinding for samples in non-

function TAU contexts.

• Sampling with other sources of metric information (eg.

GPU instruction count).

33

Page 34: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

EBS Future Work (cont)

• Sampling GPU metrics prototype visualization:

34

Page 35: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU

35

3D Visualization of Performance

Data

Page 36: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Motivation

Large-scale parallel application performance analysis

Requires scalable performance measurement

Ability to understand multi-dimensional performance data

Automated analysis and diagnosis is still a challenge

Performance interpretation continues to involve the user

How to build better tool capabilities to support this goal?

Need to be able to handle data dimensionality

Presentation of performance information is important

Performance visualization is an opportunity

Convey visual characteristics and traits in the data

Aids in data exploration and pattern analysis

3D visualization helps in identifying relations and scale

Page 37: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Using Process Topology Metadata

• Inspired by the Scalasca CUBE topology display

• Each point represents a thread of execution (MPI process)

– Positioned according to the Cartesian (x,y,z,t) coordinates

• Color is determined by selected event/metric value

• Topology information can be recorded in TAU metadata

• ParaProf reads metadata to determine topology and create layout

• Example: Sweep3D 3D neutron transport application

– 16K run on BG/P

– Color is exclusive time in the “sweep” function

Page 38: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Viewing Internal Structure

• Dense topologies

can hide internal

structure in the

visualization

• Restrict visibility

by color value to

expose performance

patterns

• ParaProf visualization UI now allows for range filtering

– Mid-level values can be excluded

– Include high outliers (hotspots)

– Include low outliers (under-utilized ranks)

Page 39: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Slicing to Reduce Dimensionality

• Restrict visibility to

sub-dimensions

– Slices along spatial axes

• ParaProf visualization UI

provides dimensionality

reduction control

– Multiple axis controls

allow selection of planes, lines, or an individual point

• Several alternatives for value reduction

– Show only value of points

– Averaging the color value for all points in the selected area

Page 40: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Custom Topologies

• Certain views may hide deeper

inter-process behavior

• Spatially dependent performance issues

may be revealed by manipulating topology

• Sweep3D profile with alternative

Cartesian mapping exposes distribution

of computational effort

• Topology has direct effect on communication

• Visualization mapped to hardware topologies

can suggest better node/rank mapping

• MPI_AllReduce() values for

Sweep3D highlights waiting distribution

from rank 0 (lower left) to the most distant

rank (upper right)

Page 41: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Topology Control UI

• Layout tab allows

customization of the

position and visibility

of data points

• Performance

event/metric data

used to define color

and position is selected

in the Event tab

• Additional rendering

options, such as color

scale and point size are

available

• 4k-core S3D run

on IBM BG/P

Hide points with

values outside of

the range

Move both range

sliders at once

Show only points

along the selected

slice of an axis

The average value

of visible points.

Select preset or

custom layouts

Set X-, Y-, Z-axis

dimensions

Select the plot type. The

'Topolgy' plot allows

custom definition of

performance data layouts

Page 42: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Cray XE6 Topology Visualization

• Example: GCRM global cloud resolving mode

• Visualization of 10K execution on Cray XE6

– Topological view shows core distribution (24 cores per node)

– Custom layout makes decision on node/core blocking

MPI_Init MPI_Allreduce

24-core

node

Page 43: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Visual Layout Specification

• Want to allow creation of explicit layouts by the user

• IDEA: Define a specification “language” that allows mathematical expressions to describe features of performance display – Equations define X, Y, Z coordinates and color per process

– Event and metrics are seen as variables

• eventX.val : value for Xth specified event and metric

• eventX.{min,max,mean} : global aggregate values

• atomicY : Yth atomic event value

– Intermediate variables can be used in the calculation

– Defined global variables (e.g., max rank) are provided

• Specifications are loaded and processed by ParaProf – Use the MESP expression parser

Page 44: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

ParaProf Events Panel

Events / metrics get bound in

ParaProf UI

• Example:

– event0 is the FLOP count for function foo

– event1 is the time value for function foo

– To set the X coordinate for each process

point to the FLOPS for event foo:

x = event0.val / event1.val

– To set the Y coordinate for each process

point to the global average FLOPS for

event foo:

y = event0.mean / event1.mean

Page 45: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Sphere Layout Specification

• Spatially mediated performance behavior may not be

represented directly in topology metadata

– Applications allocate resources with respect to a data-

driven model

• Position of each point can be defined by custom equations in

terms of event/metric, aggregate, atomic event and metadata

• Sweep3D profile mapped to a sphere

BEGIN_VIZ=Sphere

rootRanks=sqrt(maxRank)

theta=2*pi()/rootRanks*mod(rank,rootRanks)

phi=pi()/rootRanks*(ceil(rank/rootRanks))

x=cos(theta)*sin(phi)*100

y=sin(theta)*sin(phi)*100

z=cos(phi)*100

END_VIZ

Page 46: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Adding Dimensionality

• Topologies can involve more than three dimensions (e.g., intranode)

• Mirror actual machine layout to capture communication structure and cores

• Custom layouts allow specification of multiple points from a single process/rank

• 4K-core S3D run on BG/P

• Default topology only covers X, Y, Z coordinates

• A custom topology divides each nth core into its own block

BEGIN_VIZ=4K_8x8x16Block xdim=8 ydim=8 zdim=16 x=mod(rank,xdim)+16*floor(rank/1024) y=mod(floor(rank/xdim),ydim) z=mod(floor(rank/xdim/ydim),zdim) END_VIZ

Page 47: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Conclusion and Next Steps

• Have extended TAU‟s ParaProf tool for visualization

– Working with topology information

– Provide custom topology views through UI

– User-defined layouts

• Demonstrate on relatively large applications

• Future work

– Expand UI for more general access to performance model

– Allow independent manipulation of unconnected segments

– Improve presentation of data values, ranks, and metrics

– Better functionality for automatic higher-dimensional layouts

– Build library of user-defined layouts for reuse

Page 48: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU

48

TAU/Eclipse Integration

Page 49: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

2-49

Parallel Tools Platform (PTP)

• The Parallel Tools Platform aims to provide a highly integrated environment specifically designed for parallel application development

• Features include:

– An integrated development environment (IDE) that supports a wide range of parallel architectures and runtime systems

– A scalable parallel debugger

– Parallel programming tools (MPI, OpenMP, UPC, etc.)

– Support for the integration of parallel tools

– An environment that simplifies the end-user interaction with parallel systems

• http://www.eclipse.org/ptp

Page 50: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Eclipse PTP Family of Tools

Coding & Analysis (C, C++, Fortran)

Parallel Debugging

Launching & Monitoring

Performance Tuning (TAU, PPW, …)

2-50

Page 51: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

PTP/External Tools Framework formerly “Performance Tools Framework”

Goal:

Reduce the “eclipse plumbing” necessary to integrate tools

Provide integration for instrumentation, measurement, and analysis for a variety of performance tools

Dynamic Tool Definitions: Workflows & UI

Tools and tool workflows are specified in an XML file

Tools are selected and configured in the launch configuration window

Output is generated, managed and analyzed as specified in the workflow

5-51

Page 52: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

PTP TAU plug-ins http://www.cs.uoregon.edu/research/tau

• TAU (Tuning and Analysis Utilities)

• First implementation of External Tools Framework (ETFw)

• Eclipse plug-ins wrap TAU functions, make them available from Eclipse

• Compatible with Photran and CDT projects and with PTP parallel application launching

• Other plug-ins launch Paraprof from Eclipse too

5-52

Page 53: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU Integration with PTP

• TAU: Tuning and

Analysis Utilities

– Performance data

collection and analysis for

HPC codes

– Numerous features

– Command line interface

• The TAU Workflow:

– Instrumentation

– Execution

– Analysis

7-53

Page 54: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU/ETFw Hands-On

• Performance analysis/External tools use the same launch

configurations and resource managers as

debugging/launching

• The relevant tools must be in the $PATH on the remote

machine

• Select the Profile button‟s “Profile Configurations…” option

to begin:

7-54

Page 55: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU/ETFw Hands-On (2)

• Select an existing launch configuration or create a new

one

• Select the Performance Analysis tab and choose the

TAU tool set

7-55

Page 56: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU/ETFw Hands-On (3)

• Select the TAU tab and

choose a Makefile with MPI,

PDT and PAPI options

• Select PAPI counters

• Other options are available

but not needed here

• Hit „Profile‟

7-56

Page 57: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

TAU/ETFW Hands-On (4)

• The application will rebuild and launch.

• Performance data will appear in the Performance

Data Management view

• Double click the new entry to view in ParaProf

• Right click on a function bar and select Show Source

Code for source callback to Eclipse

7-57

Page 58: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

Performance Tools: Summary

• Performance Tools integrated with PTP

– Performance Tools integrated with PTP help tune parallel

applications

– External Tools Framework (ETFw) eases integration of existing

(command-line, etc.) tools

• TAU Performance Tuning uses ETFw

• PPW (Parallel Perf. Wizard) uses ETFw for UPC analysis

– MPI Analysis: GEM

– Now Supporting remote analysis

7-58

Page 59: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

3-59

Synchronized Projects

• Projects types can be:

File Service Index Service

Launch Service

Build Service

Debug Service

Local source code

Source code copy

Local Remote

Compute

Edit Search/Index Navigation

Synchronize

Executable

Page 60: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

60

Support Acknowledgements

• US Department of Energy (DOE)

– Office of Science contracts

– SciDAC, LBL contracts

– LLNL-LANL-SNL ASC/NNSA contract

– Battelle, PNNL contract

– ANL, ORNL contract

• Department of Defense (DoD)

– PETTT, HPTi

• National Science Foundation (NSF)

– SDCI, SI-2

• University of Oregon

• ParaTools, Inc.

• University of Tennessee, Knoxville

– Dr. Shirley Moore

• T.U. Dresden, GWT

– Dr. Wolfgang Nagel and Dr. Andreas Knupfer

• Juelich Supercomputing Center

– Dr. Bernd Mohr, Dr. Felix Wolf Dr. Markus Geimer, Dr. Brian Wylie

Page 61: TAU PERFORMANCE SYSTEMtools.zih.tu-dresden.de/2011/downloads/spear_TAU.pdf · • Measurement: Profiling or Tracing using wallclock time or hardware counters – Direct instrumentation

For more information

• TAU Website:

http://tau.uoregon.edu

– Software

– Release notes

– Documentation

61