intel trace collector and trace analyzer evaluation report

Intel Trace Collector andTrace Analyzer Evaluation Report

Hans Sherburne,Adam LekoUPC Group

HCS Research LaboratoryUniversity of Florida

Color encoding key:

Blue: Information

Red: Negative note

Green: Positive note

2

Basic Information Name: Intel Trace Collector, Intel Trace Analyzer Developer: Intel Current versions:

Intel Trace Collector 5.0.1.0 Intel Trace Analyzer 4.0.3.1

Website: http://www.intel.com/software/products/cluster

Contact: http://premier.intel.com

3

Intel Cluster Tools Overview A toolkit for creating high-performance applications on Intel’s architectures

(x86, IA64) Intel MPI Library

Intel’s implementation of MPI Intel Cluster Math Kernel Library

Contains several Intel-optimized math routines Also has a version of ScaLAPACK

Intel Trace Collector & Trace Analyzer Represent the performance analysis portion of Intel Cluster Tools The two are used in conjunction to analyze performance of parallel applications

(mostly MPI): Trace Collector: Provides a method for instrumenting programs and recording

performance data Trace Analyzer: Provides graphical representation of trace data from STF trace

file Formerly known as Vampirtrace & Vampir

4

Trace Collector Overview What can be traced:

MPI applications can be traced automatically by linking against profiling library Records of MPI routine calls Data describing communication (point-to-point and collective) Hardware counter data if available Statistics – function calls, sent messages, collective operations (count,

duration, bytes) User-level code can be traced through manual instrumentation using ITC

API User defined states User defined counters

Non-MPI (distributed) applications can be traced Use same API calls as instrumenting user code in MPI apps

Binaries instrumentation without recompilation is possible Use itcinstrument tool Must use MPI or must explicitly initialize/finalize Trace Collector

Java programs

5

Trace Collector Libraries ITC Offers four different libraries for creating trace files. Each offers

different operation characteristics libVT

Contains wrapper functions for automatic logging of MPI calls Offers extended functionality through an API for logging of user

defined data libVTnull

Contains dummy versions of API calls libVTfs

Same functionality as libVT Trace file writing is done via TCP sockets In case of failure, trace data is not lost

libVTcs Similar to VTfs in that it uses TCP sockets to write tracefiles Does not automatically log MPI calls Requires that a process be explicitly designated as server for trace

file creation coordination

6

Structured Trace File Format (STF) Structured Trace File format is the default format for traces Data is divided into logical frames, which helps to partition data for large-

scale programs with large traces (possibly GBs) Time axis Location axis Type of data (state, collective operations, point-to-point messages,

counter values, MPI-IO) Indexing allows for quick random access Uses multiple files

File division does not necessarily reflect frame division Allows for parallelism in reading and writing Documentation does not detail the innerworkings

Can be converted to single-file STF for ease of file handling and transmission

No documentation provided on how actual construct STF trace files without using Trace Collector

7

STF Utilities STF files can be manipulated using stftool and xstftool:

Extract various data Manipulate frames, and groupings Convert STF files into AVT, or XVT

AVT Format used by previous versions of Vampir Should be understood by Trace Analyzer Created by other existing tools

XVT Similar to AVT in syntax Replaces integer descriptors with more easily understood titles Combine all data in one file

Alternatives are human and script readable No means is provided to facilitate importing the data into another

tool

8

Trace Collector API Intel Trace Collector offers an API to

Trace user code in detail Trace non-MPI distributed apps

Functions are defined to: Record user defined states in trace Record user defined communication events in the trace Record source code locations for correlation in Intel Trace Analyzer Record user defined counters in trace Define process groupings used in trace analyzer Define frames (recommended to use config options instead) Turn tracing on and off during execution Enable tracing of multithreaded applications Initialize and finalize Intel Trace Collector - needed for non-MPI

applications

9

Trace Collector Overhead All programs executed correctly when instrumented Benchmarks marked with a star had high variability in execution time

Readings with stars probably not accurate In most cases overhead less than 8% Wasn’t able to test overhead of hardware counter instrumentation However, trace file writing for class B LU with 32 processes took almost 20 minutes!

mpiP Profiling Overhead

5%

3%

2%

3%

1%

1%

0%

7%

5%

0%

1%

0%

0% 1% 2% 3% 4% 5% 6% 7% 8%

CAMEL

NAS LU (8p, W)

NAS LU (32p, B)

PP: Big message

PP: Diffuse procedure

PP: Hot procedure

PP: Intensive server

PP: Ping pong

PP: Random barrier

PP: Small messages

PP: System time

PP: Wrong way

Be

nc

hm

ark

Overhead (instrumented/uninstrumented)

10

Trace Analyzer Intel Trace Analyzer (ITA) is a visualization program

Reads STF tracefiles Tracefiles from previous versions should also work ITA can display:

Event based data (including messages) Statistical data Counter data if it is contained in the trace

Displays may represent view of: Multiple processes

Individual processes Group of processes (depending on selected filtering options)

Single process Possible to configure the views in various ways

Activities / Symbols Absolute Time / Scaled (percentage of total) Time Number of processes displayed at once Colors used for activities

11

Trace Analyzer (2) Data from a large trace file can be viewed in increments

Select the appropriate frames from the STF file

Views may be linked to visible portion of zoomed timeline Pre-computed statistical data can be viewed without

loading trace data General Notes on ITA Interface

Uses X-windows Is quite stable Provides good interface responsiveness Interface is intuitive (for the most part)

ITA is not capable of automatic analysis of trace data.

12

Trace Analyzer Views

Summary Chart Display Allows the user to see how

much work is spent in MPI calls

Timeline Display Zoomable, scrollable timeline

representation of program execution

Summary Chart

Timeline Display

13

Trace Analyzer Views (2)

Summary Timeline Timeline/histogram

representation showing the number of processes in each activity per time bin

Counter Timeline Value over time representation

(behavior depends on counter definition in trace)

Summary Timeline

Counter Timeline

14


Message Statistics Display Message data to/from each

process (count,length, rate, duration)

Process Profile Display Per process data regarding

activities

Message Statistics

Process Profile Display

15


Statistics Display Various statistics regarding

activities in histogram, table, or text format

Call Tree Display

Statistics Display

Call Tree Display

16


Source View Source code

correlation with events in Timeline

Activity Chart Per Process histograms

of Application and MPI activity

Source View

Activity Chart

17

Trace Analyzer Views (6) Process Timeline

Activity timeline and counter timeline for a single process

Process Activity Chart Same type of information

as Global Summary Chart Process Call Tree

Same type of information as Global Call Tree

Process Timeline

Process Activity Chart & Call Tree

18

Bottleneck Identification Test Suite Testing metric: what did trace visualization tell us (automatic

instrumentation)? CAMEL: PASSED

Identified large number of small messages at beginning of program execution

Easily see that MPI calls take up small portion of run time (<3%) NAS LU: PASSED

Showed communication bottlenecks very clearly Large(!) number of small messages Shows sensitivity to latency for processors waiting on data from other processors

“W” Class: 18 MB trace file Loads quickly

“B” Class: 240 MB trace file Loads slowly (2-3 min.), responsiveness of program is diminished

However, can be loaded in small pieces that load much faster Some information is available with out loading any frames

Took nearly 20 minutes to write trace after program completion!

19

Bottleneck Identification Test Suite (2) Big message: PASSED

Traces illustrated large amount of time spent in send and receive

Diffuse procedure: PASSED Traces illustrated a lot of synchronization with

each process executing user code in an exclusive, alternating manner

Hot procedure: TOSS-UP Assuming hardware counters work, would be

easy to see extra CPU utilization Manually instrumenting code would improve

accuracy of source code correlation Intensive server: PASSED

Trace clearly shows that all processes communicate with a single process whose response time is delayed by user code

Ping pong: PASSED Traces illustrated that most time is spent in MPI

code sending and receiving messages, with little time spent in user code

Random barrier: PASSED Traces show that there are many barriers,

with each one held up by a random processor in user code

Small messages: PASSED Traces illustrated a large number of

messages being sent to node 0 System time: TOSS-UP

Hardware counter timeline might be able to indicate bottleneck if they were working

Wrong way: PASSED Trace shows that first receive takes a long

time, but the rest of the messages sent during this time period are received quickly

20

General Comments Intel Trace Collector/Analyzer are very popular and effective tools

for creating and displaying trace files. These tools are proprietary, and closed source. Analyzing performance of MPI applications is the primary intended

use. Support for analyzing non-MPI applications is provided via an API,

and a special library (libVTcs - allows for coordination of tracefile creation without MPI).

Performance analysis requires the user to have a good understanding of the types of problems likely to affect performance.

No automatic detection of bottlenecks

21

Evaluation (1) Available metrics: 4.5/5

Can use PAPI Many metrics (event-based and counter-based) are available, but it is not possible

to create custom metrics as in Paraver Cost: 3/5

A single-user license costs ~$500 Multiple user licenses are for a single cluster only

A 20-user license costs ~$5000 A100-user license costs ~$15,000 , A unlimited user license costs ~$30,000

Documentation quality: 4/5 Documentation covers most of the features in a clear and consistent fashion Trace Analyzer documentation includes a section that walks a user through the

process of analyzing a trace file for bottlenecks through a sample scenario However, some parts of the documentation are confusing if the document is not

read in it’s entirety Doesn’t describe inner-workings of trace collection/display

*Note: evaluated IA:32 MPICH Linux version

22

Evaluation (2) Extensibility: 0/5

Commerical (no source) Trace file format is not documented However could possible use distributed application tracing features to create

traces Filtering and aggregation: 4/5

Much of what is recorded in trace files can be controlled through a configuration file (or command line arguments)

Some post-mortem filtering and aggregation can be controlled from within Trace Analyzer, but it is not as customizable as other tools

Hardware support: 1/5 Supports only systems using Intel IA-32, Itanium 2, or Intel Extended Memory 64

Heterogeneity support: 5/5 Through the use of libVTcs one may manually instrument the code of distributed

applications across heterogeneous platforms No automatic event capturing for heterogeneous applications, however

23

Evaluation (3) Installation: 4.5/5

Install was very simple, and worked immediately However, I was never able to get hardware counters to function due to incompatibilities with

installed PAPI and getrusage Interoperability: 1/5

Trace Analyzer is capable of reading older vampirtrace trace file format files which can be output by some other tools

A tracefiles can be output in (or converted to) older ASCII-based vampirtrace trace file format

Learning curve: 4.5/5 Most important, and useful views and features are intuitive and easy to understand Some features seem a bit redundant or oddly named

Manual overhead: 3/5 MPI call tracing is done automatically by linking against profiling library Can also instrument all functions or a handful of functions using binary instrumentation More detailed tracing information requires manually inserting API function calls A null library is included so that binaries utilizing API function calls need not be altered

24

Evaluation (4) Measurement accuracy: 4/5

CAMEL overhead ~5% Tracing overhead is negligible However, sometimes trace analyzer finds reversed messages that shouldn’t be there

Multiple executions: 1/5 Multiple instances of Trace Analyzer can be opened at once, but comparing views must be

done manually Some support is offered for comparing statistics between two different tracefiles but it is

greatly limited (difference or quotient of histograms between two runs) Multiple analyses & views: 4/5

A number of common, useful views are available However, the values displayed are not as customizable as other tools No automatic analysis is offered Analysis can be performed by examining timelines, histograms, or textual representations

Performance bottleneck identification: 4.5/5 No automatic detection Views provided should allow for manual detection of most common bottlenecks

25

Evaluation (5) Profiling/tracing support: 5/5

Both tracing (recording events, and messages) and profiling (recording statistics) are supported and can be used independent of each other

Response time: 2/5 No data at all until after run has completed and tracefile has been

opened Some information available without fully loading tracefile Large trace files can take a long time to write out and read back in

Searching: 0/5 (not supported) Software support: 4.5/5

MPI profiling interface should permit use with many MPI implementations (support of Intel, Lam, and MPICH is explicitly offered)

Full support is available for C/C++, Fortran, and some support for Java and OpenMP

26

Evaluation (6) Source code correlation: 4/5

All MPI calls on time line offer click source code correlation User code correlation requires more manual effort

System stability: 4.5/5 Trace Analyzer crashed (segmentation fault) only once throughout evaluation Trace Collector never caused an application to fail

Technical support: 4/5 Quick initial response through support webpage (a few hours) Subsequent responses required a few days

27

References

Intel Trace Analyzer 4.0 User’s Guide 4.0.3.0

Intel Trace Collector -

IA32-LIN-MPICH PRODUCT.5.0.1.0 User’s Guide PRODUCT 5.0.1.0

intel trace collector and trace analyzer evaluation report

Documents

intel trace analyzerdeveloper

stf trace fileformerly

mpi applications

mpi callsrequires

singlefile stf

libvttrace file writing

ease of file handling

highperformance applications