tau evaluation report adam leko, hung-hsun su upc group hcs research laboratory university of...

TAU Evaluation Report

Adam Leko,Hung-Hsun Su

UPC Group

HCS Research LaboratoryUniversity of Florida

Color encoding key:

Blue: Information

Red: Negative note

Green: Positive note

2

Basic Information

Name: Tuning and Analysis Utilities (TAU) Developer: University of Oregon Current version:

TAU 2.14.4 Program database toolkit 3.3.1

Website: http://www.cs.uoregon.edu/research/paracomp/tau/tautools/

Contact: Sameer Shende: [email protected]

3

TAU Overview Performance tool suite that offers profiling and tracing of

programs Available instrumentation methods: source (manual), source

(automatic), binary (DynInst) Supported languages: C, C++, Fortran, Python, Java, SHMEM

(TurboSHMEM and Cray SHMEM), OpenMP, MPI, Charm Hardware counter support

Relies on existing toolkits and libraries for some functionality PDToolkit and Opari for automatic source instrumentation DynInst for runtime binary instrumentation PCL and PAPI for hardware counter information libvtf3, slog2sdk, and EPILOG for exporting trace files

4

TAU Architecture

5

Configuring & Installing TAU TAU relies on several existing toolkits for efficient usage, but some of these toolkits are

time-consuming to install PDToolkit, PAPI, etc

Users must choose between modes at compile time using ./configure script Profiling via -PROFILE, tracing via -TRACE TAU must also be notified about the location of supported languages and compilers

-mpilib=/path/to/mpi/lib -dyninst=/path/to/dyninst -pdt=/path/to/pdt Other supported languages/libraries handled in a similar manner

This results in a very flexible installation process Users can easily install different configurations of TAU in their home directory However, several configuration options are mutually exclusive, such as

Profiling and tracing Using PAPI counters vs. gettimeofday or TSC counters Profiling w/callpaths vs. profiling with extra statistics

Unfortunately, mutually exclusive nature of things proves to be annoying Would be nice if TAU supported (for instance) tracing and profiling without compiling & installing twice! Luckily, software compiles quickly on modern machines, so this is not fatal However, TAU relies on several environment variables, which makes switching between installations

cumbersome

6

The Many Faces of TAU Two main methods of operation: profiling and tracing Profile mode

Reports aggregate spent in each function per each node/thread Several profile recording options

Report min/max/std. dev of times using the -TRACESTATS configure option Attempt to compensate for profiling overhead (-COMPENSATE) Record memory stats while profiling (-PROFILEMEMORY, -PROFILEHEADROOM) Stop profiling after a certain function depth (-DEPTHLIMIT) Record call trees in profile (-PROFILECALLPATH) Record phase of program in profiles (-PROFILEPHASE, requires manual instrumentation of

phases) If instrumented code uses the TAU_INIT macros, can also pass arguments to

compiled, instrumented program to restrict what is recorded at runtime --profile main+func2

Metrics that can be recorded: wall clock time (via gettimeofday or several hardware-specific timers) or hardware counter metrics (via PAPI or PCL)

Data visualized using pprof (text-based) or paraprof (Java-based GUI) Profile data can be exported to KOJAK’s cube viewer Profile data can be imported from Vampir VTF traces

7

The Many Faces of TAU (2) Trace mode

Records timestamps for function entry/exit points Or arbitrary code section points via manual instrumentation

Also records messages sent/received for MPI programs No trace visualizer, but can export to

ALOG: Upshot/nupshot Paraver’s trace format SLOG-2: Jumpshot VTF: Vampir/Intel Trace Analyzer 5 SDDF: Format used by Pablo/SvPablo EPILOG: KOJAK’s trace format

8

TAU Instrumentation: Profile Mode Source-level instrumentation

tau_instrument (which requires PDToolkit) is used to produce an instrumented source code for C, C++, and Fortran files

For OpenMP code, TAU can use OPARI (from KOJAK) Users may insert instrumentation using TAU’s simple API (TAU_PROFILE_START,

TAU_PROFILE_STOP) When compiling, must use stub Makefiles which define compilation macros like

CFLAGS, LDFLAGS, etc. This can complicate the compile & link cycle greatly, especially if fully automatic source

instrumentation is desired Selective instrumentation is supported through a flag to tau_instrument

Give a file containing which functions to include or exclude from instrumentation Can tau_reduce use in conjuction with existing profiles to exclude functions matching certain

criteria, like numcalls > 10000 & usecs/call < 2

Binary-level instrumentation Based on DynInst, considered “experimental” according to documentation Use tau_run wrapper script with instrumentation file in same format as selective

instrumentation file

9

TAU Instrumentation: Trace Mode Source-level instrumentation

Same procedure as in profile mode Binary instrumentation

Can link against MPI wrapper library (only re-linking necessary)

Runtime instrumentation for trace mode is not supported using DynInst

10

Source Instrumentation Process

Source C file

pdb file

Instrumented C file

Executable

execute

cparse

tau_instrumentor

native c compiler

GUI

papaprof

Profiling / tracing file N

...Profiling /

tracing file 1

11

Instrumentation Test Suite: Problems Problem with using selective instrumentation + MPI wrapper library + PAPI

metrics Only instrumenting main in CAMEL caused several floating point instructions to be

attributed to MPI_Send and MPI_Recv instead of main For timing measurements and overhead measurements, used wallclock time with the

low-overhead -LINUXTIMERS option Some code had to be modified before feeding it through PDToolkit’s cparse

cparse usese the Edison Design Group’s parser, which is stricter about some things than other compilers

ANSI C/standard Fortran code poses no problems, though NAS NPB LU benchmark (NPBv3.1-MPI) would not run with TAU libraries

Segfaults, “signal 11s” when using either LAM or MPICH with only MPI wrapper libraries (profiling & tracing)

Modified, updated version (3.2) of LU comes with TAU Had problems compiling and running this

Gave TAU the benefit of the doubt for the rest of the evaluations Guessed at what TAU profile would tell us had it been working with LU for bottleneck tests LU timing overheads omitted from overhead measurements

12

Instrumentation Overhead: Notes Performed automatic instrumentation of CAMEL using tau_instrument

Like KOJAK, program execution time was several orders of magnitude slower This is likely due to the use of very small functions which normally get inlined by

the compiler For profile measurements on the following slides, only main was instrumented

Under this scenario, profiling and tracing overhead was almost nonexistent (<1%)

Instrumentation points chosen for overhead measurements Profiling

CAMEL: all MPI calls, main enter + exit PPerfMark suite: all MPI calls, all function calls Used –PROFILECALLPATH configuration option

Using other profile flavors (without call paths, with extra stats) made a negligible difference on overall profile overhead

Tracing CAMEL: all MPI calls PPerfMark suite: all MPI calls Similar to what we have done for other tools

Benchmarks marked with * had high variability in runtimes

13

Instrumentation Overhead: Notes (2) Used LAM for all measurements

Some benchmarks with high overhead (small-messages, wrong-way, ping-pong) had slightly smaller overhead using MPICH Small messages: 54.2% vs. 483.316% Wrong way: 24.5% vs. 28.573% Ping-pong: 51.5% vs. 56.259%

Probably due to LAM running faster (especially on small-messages) and execution time being limited by I/O time for writing trace file Same I/O time, smaller execution time -> higher % overhead

In general, overhead for profiling and tracing extremely low except for a few cases High profile overhead programs with small functions that get called a lot

small-messages, wrong-way, ping-pong, CAMEL with everything instrumented High trace overhead for programs with large traces generated very quickly

small-messages, wrong-way, ping-pong tau_reduce provides a nice way to help reduce instrumentation overhead,

although an initial profile must be first gathered

14

Instrumentation Overhead: Profiles

TAU overhead, profiling

0.003%

0.065%

5.264%

5.270%

0.059%

17.588%

0.013%

131.274%

0.069%

29.380%

0% 20% 40% 60% 80% 100% 120% 140%

CAMEL

PP: Big message

PP: Diffuse procedure

PP: Hot procedure

PP: Intensive server

PP: Ping pong

PP: Random barrier

PP: Small messages*

PP: System time

PP: Wrong way

Ben

chm

ark

Overhead (instrumented / uninstrumented)

15

Instrumentation Overhead: Traces

TAU overhead, tracing

0.064%

0.043%

0.000%

0.000%

0.043%

56.259%

0.000%

483.316%

0.000%

28.573%

0% 100% 200% 300% 400% 500% 600%

CAMEL

PP: Big message

PP: Diffuse procedure

PP: Hot procedure

PP: Intensive server

PP: Ping pong

PP: Random barrier

PP: Small messages*

PP: System time

PP: Wrong way

Ben

chm

ark

Overhead (instrumented / uninstrumented)

16

Visualizations: pprof Gives text-based dump of profile files, similar to gprof/prof output Example (partial) output:

…USER EVENTS Profile :NODE 7, CONTEXT 0, THREAD 0---------------------------------------------------------------------------------------NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name--------------------------------------------------------------------------------------- 4 2016 4 516 866.2 Message size received from all nodes 86 28 28 28 0 Message size sent to all nodes---------------------------------------------------------------------------------------

FUNCTION SUMMARY (total):---------------------------------------------------------------------------------------%Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call---------------------------------------------------------------------------------------100.0 5:44.295 5:45.407 8 1440 43175935 main() 0.2 541 553 8 232 69161 MPI_Init() 0.2 541 553 8 232 69161 main() => MPI_Init() 0.2 543 543 704 0 772 MPI_Recv() 0.2 543 543 704 0 772 main() => MPI_Recv() 0.0 9 9 8 0 1136 MPI_Barrier()

…

17

Visualizations: paraprof paraprof provides visual representations of

same data given by pprof Used to be a Tcl/Tk application known as

“racy” Racy has been deprecated, but is still included

with TAU for historical reasons

Java application with three main views Main profile view Histograms (next slides) Three-dimensional visualization (next slides)

Main profile view (right top) “Function ledger” maps colors to function names

(right, bottom left) Overall time for each function displayed as a

stacked bar chart Can click on each function to get detailed

information (right, bottom right)

No line-level source code correlation Can infer this information Indirectly if call paths

are used

Main profile view

Function ledger Function details view

18

Visualizations: paraprof (2)

paraprof can also show histogram views for each function of the main profile view (right)

Simply show histogram of aggregate time for a function across all threads Histogram to right shows

that most functions spent around 75.8 seconds (midpoint between min and max) in MPI_Barrier

19

Visualizations: paraprof (3)

paraprof also can display three-dimensional displays of profile data

Bar and triangle meshes axes Time spent in each function

(height) Which function (width) Which node (depth)

Scatter plot lets you pick axes Plots support transparency,

rotation, and highlighting a particular function or node

Surprisingly responsive for a Java application!

20

Bottleneck Identification Test Suite Testing metric: what did pprof/paraprof tell us from wallclock time profiles?

Since no built-in trace visualizer, we ignored what could be done with other trace tools Programs correctness not affected by instrumentation

Except for our version of LU CAMEL: PASSED

Showed work evenly distributed among nodes When full tracing used, can easily show which functions take the most wall clock time

LU: FAILED Could not run, got segfaults using MPICH or LAM Even if it worked, it would be very difficult/impossible to garner communication patterns from profile views

Big messages: PASSED Profile showed most of application time dominated by MPI calls to send and receive

Diffuse procedure: TOSS-UP Profile showed most time taken by MPI_Barrier calls However, profile also showed the bottleneck procedure (which is dispersed across all nodes) taking up a

negligible amount of overall time Really need a trace view to see diffuse behavior of program

Hot procedure: PASSED Profile clearly shows that one function is responsible for most execution time

21

Bottleneck Identification Test Suite (2) Intensive server: PASSED

Profile showed most time spent in MPI_Recv for all nodes except first node Profile also illustrated most time for first node spent in waste_time

Ping-pong: PASSED Easy to see from profile that most time is being spent in MPI_Send and MPI_Recv pprof and paraprof also showed a large number of MPI calls

Random barrier: TOSS-UP Profile showed most time being spent in MPI_Barrier However, random nature of barrier not shown by profile Trace view is necessary to see random barrier behavior

Small messages: PASS Profile showed one process spending most time in MPI_Send and the other process in MPI_Recv

System time: FAILED No built-in way to separate wall clock time into system time vs. user time PAPI metrics can’t record system time vs. user time either

Wrong order: FAILED Impossible to see communication behavior without a trace

22

TAU General Comments Good things

Supports profiling & tracing

Very portable

Wide range of software support Several programming models & libraries supported

Visualization tools seem very stable

Good support for exporting data to other tools

Things that could use improvement Dependence on other software for basic functionality (instrumentation via PDToolkit or DynInst) makes

installation difficult

Source code correlation could be better Only at the function or function call level (with call paths)

Export is nice, but lots of things are easier to do directly in other tools For example, mpicc -mpilog to get a trace for Jumpshot instead of cparse, tau_instrument, wrapper Makefiles, …

TAU does add automatic instrumentation for profiling functions, which is an added benefit

Three-dimensional visualizations are nice but “Cube” viewer from KOJAK is easier to use and displays data

in a very concise manner Text is also hard to read on three-dimensional views for function names

Some interoperability features (export to SLOG-2 and ALOG) do not work well in version we tested

TAU could potentially serve as a base for our UPC and SHMEM performance tool

23

TAU: Adding UPC & SHMEM SHMEM

Not much extra work needed Have already created weak binding patches for GPSHMEM & created a

wrapper library that calls the appropriate TAU functions UPC

If we have source code instrumentation, then just put in TAU* instrumentation calls in the appropriate places

If we do binary instrumentation, we’ll probably have to make major modifications to DynInst

In any case, once the UPC instrumentation problem is solved, adding support for UPC into TAU will not be too hard However, how to instrument UPC programs while retaining low overhead? Also, how to extend TAU to support more advanced analyses?

Support for profiles and traces a nice bonus

24

Evaluation (1) Available metrics: 4/5

Supports recording execution time (broken down into call trees) Supports several methods of gathering profile data Supports all PAPI metrics for profiles

Cost: 5/5 Free!

Documentation quality: 3.5/5 User’s manual very good, but out of date For example, three-dimensional visualizations not covered in manual

Extensibility: 4/5 Open source, uses documented APIs Can add support for new languages using source instrumentation

Filtering and aggregation: 2.5/5 Filtering & aggregation available through profile view No advanced filter or custom aggregation methods built in for traces

25

Evaluation (2) Hardware support: 5/5

Many platforms supported: 64-bit Linux (Opteron, Itanium, Alpha, SPARC); IBM SP2 (AIX); IBM BlueGene/L; AlphaServer (Tru64); SPARC-based clusters (Solaris); SGI (IRIX 6.x) systems, including Indy, Power Challenge, Onyx, Onyx2, Origin 200, 2000, 3000 series; NEC SX-5; Cray X1, T3E; Apple OS X; HP RISC systems (HP-UX)

Heterogeneity support: 0/5 (not supported) Installation: 2.5/5

As simple as ./configure with options, then make install However, dependence on other software for source or binary instrumentation makes

installation time-consuming

Interoperability: 5/5 Profile files use simple ASCII format; trace files use documented binary format Can export to VAMPIR, Jumpshot/upshot (ALOG & SLOG-2), CUBE, SDDF, Paraver

Learning curve: 2.5/5 Learning how to use the different Makefile wrappers and command-line programs

takes a while After a short period, instrumentation & tool usage relatively easy

26

Evaluation (3) Manual overhead: 4/5

Automatic instrumentation of MPI calls on all platforms Automatic instrumentation of all functions or a selected group of functions Call path support gives almost the same information as instrumenting call sites MPI and OpenMP instrumentation support

Measurement accuracy: 5/5 CAMEL overhead < 1% for profiling and tracing when a few functions were

instrumented Overall, accuracy pretty good except for a few cases

Multiple executions: 3/5 Can relate profile metrics between runs in paraprof Can store performance data in DBMS (PerfDB)

Seems like PerfDB is in a preliminary state, though Multiple analyses & views: 4/5

Both profiling and tracing are supported (although no built-in trace viewer) Profile view has stacked bar charts, “regular” views, three-dimensional views,

and histograms

27

Evaluation (4) Performance bottleneck identification: 3.5/5

No automatic bottleneck identification Profile viewer helpful for identifying methods that take most time Lack of built-in trace viewer makes identification of some bottlenecks impossible, but trace export means

could combine with several other viewers to cover just about anything Profiling/tracing support: 4/5

Tracing & profiling supported Default trace file format size reasonable but not most compact

Response time: 3/5 Loading profiles after run almost instantaneous using paraprof viewer Exporting traces to other tools time consuming (have to run tau_merge, tau_convert, etc; a few extra

disk I/Os) Software support: 5/5

Supports OpenMP, MPI, and several other programming models A wide range of compilers are supported Can support linking against any library, but does not instrument library functions

Source code correlation: 2/5 Supported down to the function and function call site level (when collecting call paths is enabled)

Searching: 0/5 (not supported)

28

Evaluation (5) System stability: 3/5

Software is generally stable Bugs encountered:

Segfaults on instrumented version of our LU code SLOG-2 export seems to give Jumpshot-4 some trouble (several “unsupported event”

messages on a few exported traces) Exporting to ALOG format puts stray “: %d” lines in ALOG file

Technical support: 5/5 Good response from our contact (Sameer), most emails answered within 48

hours with useful information

tau evaluation report adam leko, hung-hsun su upc group hcs research laboratory university of...

Documents