tau evaluation report adam leko, hung-hsun su upc group hcs research laboratory university of...
TRANSCRIPT
TAU Evaluation Report
Adam Leko,Hung-Hsun Su
UPC Group
HCS Research LaboratoryUniversity of Florida
Color encoding key:
Blue: Information
Red: Negative note
Green: Positive note
2
Basic Information
Name: Tuning and Analysis Utilities (TAU) Developer: University of Oregon Current version:
TAU 2.14.4 Program database toolkit 3.3.1
Website: http://www.cs.uoregon.edu/research/paracomp/tau/tautools/
Contact: Sameer Shende: [email protected]
3
TAU Overview Performance tool suite that offers profiling and tracing of
programs Available instrumentation methods: source (manual), source
(automatic), binary (DynInst) Supported languages: C, C++, Fortran, Python, Java, SHMEM
(TurboSHMEM and Cray SHMEM), OpenMP, MPI, Charm Hardware counter support
Relies on existing toolkits and libraries for some functionality PDToolkit and Opari for automatic source instrumentation DynInst for runtime binary instrumentation PCL and PAPI for hardware counter information libvtf3, slog2sdk, and EPILOG for exporting trace files
4
TAU Architecture
5
Configuring & Installing TAU TAU relies on several existing toolkits for efficient usage, but some of these toolkits are
time-consuming to install PDToolkit, PAPI, etc
Users must choose between modes at compile time using ./configure script Profiling via -PROFILE, tracing via -TRACE TAU must also be notified about the location of supported languages and compilers
-mpilib=/path/to/mpi/lib -dyninst=/path/to/dyninst -pdt=/path/to/pdt Other supported languages/libraries handled in a similar manner
This results in a very flexible installation process Users can easily install different configurations of TAU in their home directory However, several configuration options are mutually exclusive, such as
Profiling and tracing Using PAPI counters vs. gettimeofday or TSC counters Profiling w/callpaths vs. profiling with extra statistics
Unfortunately, mutually exclusive nature of things proves to be annoying Would be nice if TAU supported (for instance) tracing and profiling without compiling & installing twice! Luckily, software compiles quickly on modern machines, so this is not fatal However, TAU relies on several environment variables, which makes switching between installations
cumbersome
6
The Many Faces of TAU Two main methods of operation: profiling and tracing Profile mode
Reports aggregate spent in each function per each node/thread Several profile recording options
Report min/max/std. dev of times using the -TRACESTATS configure option Attempt to compensate for profiling overhead (-COMPENSATE) Record memory stats while profiling (-PROFILEMEMORY, -PROFILEHEADROOM) Stop profiling after a certain function depth (-DEPTHLIMIT) Record call trees in profile (-PROFILECALLPATH) Record phase of program in profiles (-PROFILEPHASE, requires manual instrumentation of
phases) If instrumented code uses the TAU_INIT macros, can also pass arguments to
compiled, instrumented program to restrict what is recorded at runtime --profile main+func2
Metrics that can be recorded: wall clock time (via gettimeofday or several hardware-specific timers) or hardware counter metrics (via PAPI or PCL)
Data visualized using pprof (text-based) or paraprof (Java-based GUI) Profile data can be exported to KOJAK’s cube viewer Profile data can be imported from Vampir VTF traces
7
The Many Faces of TAU (2) Trace mode
Records timestamps for function entry/exit points Or arbitrary code section points via manual instrumentation
Also records messages sent/received for MPI programs No trace visualizer, but can export to
ALOG: Upshot/nupshot Paraver’s trace format SLOG-2: Jumpshot VTF: Vampir/Intel Trace Analyzer 5 SDDF: Format used by Pablo/SvPablo EPILOG: KOJAK’s trace format
8
TAU Instrumentation: Profile Mode Source-level instrumentation
tau_instrument (which requires PDToolkit) is used to produce an instrumented source code for C, C++, and Fortran files
For OpenMP code, TAU can use OPARI (from KOJAK) Users may insert instrumentation using TAU’s simple API (TAU_PROFILE_START,
TAU_PROFILE_STOP) When compiling, must use stub Makefiles which define compilation macros like
CFLAGS, LDFLAGS, etc. This can complicate the compile & link cycle greatly, especially if fully automatic source
instrumentation is desired Selective instrumentation is supported through a flag to tau_instrument
Give a file containing which functions to include or exclude from instrumentation Can tau_reduce use in conjuction with existing profiles to exclude functions matching certain
criteria, like numcalls > 10000 & usecs/call < 2
Binary-level instrumentation Based on DynInst, considered “experimental” according to documentation Use tau_run wrapper script with instrumentation file in same format as selective
instrumentation file
9
TAU Instrumentation: Trace Mode Source-level instrumentation
Same procedure as in profile mode Binary instrumentation
Can link against MPI wrapper library (only re-linking necessary)
Runtime instrumentation for trace mode is not supported using DynInst
10
Source Instrumentation Process
Source C file
pdb file
Instrumented C file
Executable
execute
cparse
tau_instrumentor
native c compiler
GUI
papaprof
Profiling / tracing file N
...Profiling /
tracing file 1
11
Instrumentation Test Suite: Problems Problem with using selective instrumentation + MPI wrapper library + PAPI
metrics Only instrumenting main in CAMEL caused several floating point instructions to be
attributed to MPI_Send and MPI_Recv instead of main For timing measurements and overhead measurements, used wallclock time with the
low-overhead -LINUXTIMERS option Some code had to be modified before feeding it through PDToolkit’s cparse
cparse usese the Edison Design Group’s parser, which is stricter about some things than other compilers
ANSI C/standard Fortran code poses no problems, though NAS NPB LU benchmark (NPBv3.1-MPI) would not run with TAU libraries
Segfaults, “signal 11s” when using either LAM or MPICH with only MPI wrapper libraries (profiling & tracing)
Modified, updated version (3.2) of LU comes with TAU Had problems compiling and running this
Gave TAU the benefit of the doubt for the rest of the evaluations Guessed at what TAU profile would tell us had it been working with LU for bottleneck tests LU timing overheads omitted from overhead measurements
12
Instrumentation Overhead: Notes Performed automatic instrumentation of CAMEL using tau_instrument
Like KOJAK, program execution time was several orders of magnitude slower This is likely due to the use of very small functions which normally get inlined by
the compiler For profile measurements on the following slides, only main was instrumented
Under this scenario, profiling and tracing overhead was almost nonexistent (<1%)
Instrumentation points chosen for overhead measurements Profiling
CAMEL: all MPI calls, main enter + exit PPerfMark suite: all MPI calls, all function calls Used –PROFILECALLPATH configuration option
Using other profile flavors (without call paths, with extra stats) made a negligible difference on overall profile overhead
Tracing CAMEL: all MPI calls PPerfMark suite: all MPI calls Similar to what we have done for other tools
Benchmarks marked with * had high variability in runtimes
13
Instrumentation Overhead: Notes (2) Used LAM for all measurements
Some benchmarks with high overhead (small-messages, wrong-way, ping-pong) had slightly smaller overhead using MPICH Small messages: 54.2% vs. 483.316% Wrong way: 24.5% vs. 28.573% Ping-pong: 51.5% vs. 56.259%
Probably due to LAM running faster (especially on small-messages) and execution time being limited by I/O time for writing trace file Same I/O time, smaller execution time -> higher % overhead
In general, overhead for profiling and tracing extremely low except for a few cases High profile overhead programs with small functions that get called a lot
small-messages, wrong-way, ping-pong, CAMEL with everything instrumented High trace overhead for programs with large traces generated very quickly
small-messages, wrong-way, ping-pong tau_reduce provides a nice way to help reduce instrumentation overhead,
although an initial profile must be first gathered
14
Instrumentation Overhead: Profiles
TAU overhead, profiling
0.003%
0.065%
5.264%
5.270%
0.059%
17.588%
0.013%
131.274%
0.069%
29.380%
0% 20% 40% 60% 80% 100% 120% 140%
CAMEL
PP: Big message
PP: Diffuse procedure
PP: Hot procedure
PP: Intensive server
PP: Ping pong
PP: Random barrier
PP: Small messages*
PP: System time
PP: Wrong way
Ben
chm
ark
Overhead (instrumented / uninstrumented)
15
Instrumentation Overhead: Traces
TAU overhead, tracing
0.064%
0.043%
0.000%
0.000%
0.043%
56.259%
0.000%
483.316%
0.000%
28.573%
0% 100% 200% 300% 400% 500% 600%
CAMEL
PP: Big message
PP: Diffuse procedure
PP: Hot procedure
PP: Intensive server
PP: Ping pong
PP: Random barrier
PP: Small messages*
PP: System time
PP: Wrong way
Ben
chm
ark
Overhead (instrumented / uninstrumented)
16
Visualizations: pprof Gives text-based dump of profile files, similar to gprof/prof output Example (partial) output:
…USER EVENTS Profile :NODE 7, CONTEXT 0, THREAD 0---------------------------------------------------------------------------------------NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name--------------------------------------------------------------------------------------- 4 2016 4 516 866.2 Message size received from all nodes 86 28 28 28 0 Message size sent to all nodes---------------------------------------------------------------------------------------
FUNCTION SUMMARY (total):---------------------------------------------------------------------------------------%Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call---------------------------------------------------------------------------------------100.0 5:44.295 5:45.407 8 1440 43175935 main() 0.2 541 553 8 232 69161 MPI_Init() 0.2 541 553 8 232 69161 main() => MPI_Init() 0.2 543 543 704 0 772 MPI_Recv() 0.2 543 543 704 0 772 main() => MPI_Recv() 0.0 9 9 8 0 1136 MPI_Barrier()
…
17
Visualizations: paraprof paraprof provides visual representations of
same data given by pprof Used to be a Tcl/Tk application known as
“racy” Racy has been deprecated, but is still included
with TAU for historical reasons
Java application with three main views Main profile view Histograms (next slides) Three-dimensional visualization (next slides)
Main profile view (right top) “Function ledger” maps colors to function names
(right, bottom left) Overall time for each function displayed as a
stacked bar chart Can click on each function to get detailed
information (right, bottom right)
No line-level source code correlation Can infer this information Indirectly if call paths
are used
Main profile view
Function ledger Function details view
18
Visualizations: paraprof (2)
paraprof can also show histogram views for each function of the main profile view (right)
Simply show histogram of aggregate time for a function across all threads Histogram to right shows
that most functions spent around 75.8 seconds (midpoint between min and max) in MPI_Barrier
19
Visualizations: paraprof (3)
paraprof also can display three-dimensional displays of profile data
Bar and triangle meshes axes Time spent in each function
(height) Which function (width) Which node (depth)
Scatter plot lets you pick axes Plots support transparency,
rotation, and highlighting a particular function or node
Surprisingly responsive for a Java application!
20
Bottleneck Identification Test Suite Testing metric: what did pprof/paraprof tell us from wallclock time profiles?
Since no built-in trace visualizer, we ignored what could be done with other trace tools Programs correctness not affected by instrumentation
Except for our version of LU CAMEL: PASSED
Showed work evenly distributed among nodes When full tracing used, can easily show which functions take the most wall clock time
LU: FAILED Could not run, got segfaults using MPICH or LAM Even if it worked, it would be very difficult/impossible to garner communication patterns from profile views
Big messages: PASSED Profile showed most of application time dominated by MPI calls to send and receive
Diffuse procedure: TOSS-UP Profile showed most time taken by MPI_Barrier calls However, profile also showed the bottleneck procedure (which is dispersed across all nodes) taking up a
negligible amount of overall time Really need a trace view to see diffuse behavior of program
Hot procedure: PASSED Profile clearly shows that one function is responsible for most execution time
21
Bottleneck Identification Test Suite (2) Intensive server: PASSED
Profile showed most time spent in MPI_Recv for all nodes except first node Profile also illustrated most time for first node spent in waste_time
Ping-pong: PASSED Easy to see from profile that most time is being spent in MPI_Send and MPI_Recv pprof and paraprof also showed a large number of MPI calls
Random barrier: TOSS-UP Profile showed most time being spent in MPI_Barrier However, random nature of barrier not shown by profile Trace view is necessary to see random barrier behavior
Small messages: PASS Profile showed one process spending most time in MPI_Send and the other process in MPI_Recv
System time: FAILED No built-in way to separate wall clock time into system time vs. user time PAPI metrics can’t record system time vs. user time either
Wrong order: FAILED Impossible to see communication behavior without a trace
22
TAU General Comments Good things
Supports profiling & tracing
Very portable
Wide range of software support Several programming models & libraries supported
Visualization tools seem very stable
Good support for exporting data to other tools
Things that could use improvement Dependence on other software for basic functionality (instrumentation via PDToolkit or DynInst) makes
installation difficult
Source code correlation could be better Only at the function or function call level (with call paths)
Export is nice, but lots of things are easier to do directly in other tools For example, mpicc -mpilog to get a trace for Jumpshot instead of cparse, tau_instrument, wrapper Makefiles, …
TAU does add automatic instrumentation for profiling functions, which is an added benefit
Three-dimensional visualizations are nice but “Cube” viewer from KOJAK is easier to use and displays data
in a very concise manner Text is also hard to read on three-dimensional views for function names
Some interoperability features (export to SLOG-2 and ALOG) do not work well in version we tested
TAU could potentially serve as a base for our UPC and SHMEM performance tool
23
TAU: Adding UPC & SHMEM SHMEM
Not much extra work needed Have already created weak binding patches for GPSHMEM & created a
wrapper library that calls the appropriate TAU functions UPC
If we have source code instrumentation, then just put in TAU* instrumentation calls in the appropriate places
If we do binary instrumentation, we’ll probably have to make major modifications to DynInst
In any case, once the UPC instrumentation problem is solved, adding support for UPC into TAU will not be too hard However, how to instrument UPC programs while retaining low overhead? Also, how to extend TAU to support more advanced analyses?
Support for profiles and traces a nice bonus
24
Evaluation (1) Available metrics: 4/5
Supports recording execution time (broken down into call trees) Supports several methods of gathering profile data Supports all PAPI metrics for profiles
Cost: 5/5 Free!
Documentation quality: 3.5/5 User’s manual very good, but out of date For example, three-dimensional visualizations not covered in manual
Extensibility: 4/5 Open source, uses documented APIs Can add support for new languages using source instrumentation
Filtering and aggregation: 2.5/5 Filtering & aggregation available through profile view No advanced filter or custom aggregation methods built in for traces
25
Evaluation (2) Hardware support: 5/5
Many platforms supported: 64-bit Linux (Opteron, Itanium, Alpha, SPARC); IBM SP2 (AIX); IBM BlueGene/L; AlphaServer (Tru64); SPARC-based clusters (Solaris); SGI (IRIX 6.x) systems, including Indy, Power Challenge, Onyx, Onyx2, Origin 200, 2000, 3000 series; NEC SX-5; Cray X1, T3E; Apple OS X; HP RISC systems (HP-UX)
Heterogeneity support: 0/5 (not supported) Installation: 2.5/5
As simple as ./configure with options, then make install However, dependence on other software for source or binary instrumentation makes
installation time-consuming
Interoperability: 5/5 Profile files use simple ASCII format; trace files use documented binary format Can export to VAMPIR, Jumpshot/upshot (ALOG & SLOG-2), CUBE, SDDF, Paraver
Learning curve: 2.5/5 Learning how to use the different Makefile wrappers and command-line programs
takes a while After a short period, instrumentation & tool usage relatively easy
26
Evaluation (3) Manual overhead: 4/5
Automatic instrumentation of MPI calls on all platforms Automatic instrumentation of all functions or a selected group of functions Call path support gives almost the same information as instrumenting call sites MPI and OpenMP instrumentation support
Measurement accuracy: 5/5 CAMEL overhead < 1% for profiling and tracing when a few functions were
instrumented Overall, accuracy pretty good except for a few cases
Multiple executions: 3/5 Can relate profile metrics between runs in paraprof Can store performance data in DBMS (PerfDB)
Seems like PerfDB is in a preliminary state, though Multiple analyses & views: 4/5
Both profiling and tracing are supported (although no built-in trace viewer) Profile view has stacked bar charts, “regular” views, three-dimensional views,
and histograms
27
Evaluation (4) Performance bottleneck identification: 3.5/5
No automatic bottleneck identification Profile viewer helpful for identifying methods that take most time Lack of built-in trace viewer makes identification of some bottlenecks impossible, but trace export means
could combine with several other viewers to cover just about anything Profiling/tracing support: 4/5
Tracing & profiling supported Default trace file format size reasonable but not most compact
Response time: 3/5 Loading profiles after run almost instantaneous using paraprof viewer Exporting traces to other tools time consuming (have to run tau_merge, tau_convert, etc; a few extra
disk I/Os) Software support: 5/5
Supports OpenMP, MPI, and several other programming models A wide range of compilers are supported Can support linking against any library, but does not instrument library functions
Source code correlation: 2/5 Supported down to the function and function call site level (when collecting call paths is enabled)
Searching: 0/5 (not supported)
28
Evaluation (5) System stability: 3/5
Software is generally stable Bugs encountered:
Segfaults on instrumented version of our LU code SLOG-2 export seems to give Jumpshot-4 some trouble (several “unsupported event”
messages on a few exported traces) Exporting to ALOG format puts stray “: %d” lines in ALOG file
Technical support: 5/5 Good response from our contact (Sameer), most emails answered within 48
hours with useful information