class16_profiling-handout.pdf

7/29/2019 class16_profiling-handout.pdf

1/102

Application Performance Tuning

M. D. Jones, Ph.D.

Center for Computational ResearchUniversity at Buffalo

State University of New York

High Performance Computing I, 2012

M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 1 / 108


2/102

Introduction to Performance Engineering in HPC Performance Fundamentals

Performance Foundations

Three pillars of performance optimization:

Algorithmic - choose the most effective algorithm that you can for theproblem of interest

Serial Efficiency - optimize the code to run efficiently in a non-parallelenvironment

Parallel Efficiency - effectively use multiple processors to achieve areduction in execution time, or equivalently, to solve a

proportionately larger problem



3/102

Introduction to Performance Engineering in HPC Performance Fundamentals

Algorithmic Efficiency

Choose the best algorithm beforeyou start coding (recall that good

planning is an essential part of writing good software):

Running on large number of processors? Choose an algorithm

that scales well with increasing processor countRunning a large system (mesh points, particle count, etc.)?

Choose an algorithm that scales well with system size

If you are going to run on a massively parallel machine, plan from

the beginning on how you intend to decompose the problem (it

may save you a lot of time later)



4/102

Introduction to Performance Engineering in HPC Performance Baseline

Serial Efficiency

Getting efficient code in parallel is made much more difficult if you

have not optimized the sequential code, and in fact can lead to amisleading picture of parallel performance. Recall that our definition of

parallel speedup,S(Np) =

S

p(Np),

involves the time, S, for an optimal sequential implementation (not

just p(1)!)



5/102

Introduction to Performance Engineering in HPC Performance Baseline

Establish a Performance Baseline

Steps to establishing a baseline for your own performance

expectations:

Choose a representative problem (or better still a suite of

problems) that can be run under identical circumstances on a

multitude of platforms/compilersRequires portable code!

How fast is fast? You can utilize hardware performance counters

to measure actual code performance

Profile, profile, and then profile some more ... to find bottlenecksand spend your time more effectively in optimizing code


I d i P f E i i i HPC Pi f ll i P ll l P f


6/102

Introduction to Performance Engineering in HPC Pitfalls in Parallel Performance

Parallel Performance Trap

Pitfalls when measuring the performance of parallel codes:

For many, speedup or linear scalability is the ultimate goal.

This goal is incomplete - a terribly inefficient code can scale well,

but actually deliver poor efficiency.For example, consider a simple Monte Carlo based code that usesthe most rudimentary uniform sampling (i.e. no importancesampling) - this can be made to scale perfectly in parallel, but the

algorithmic efficiency (measured perhaps by the delivered

variance per cpu-hour) is quite low.


Si l P fili T l Ti


7/102

Simple Profiling Tools Timers

time Command

Note that this is not the time built-in function in many shells (bashand tcsh included), but instead the one located in /usr/bin. Thiscommand is quite useful for getting an overall picture of code

performance. The default output format:

%Uuse r %Ssys tem %Eel aps ed %PCPU (%X t e x t+%Ddata %Mmax) k%Ii n p u t s+%Ooutputs (%Fmajor+%Rminor ) pa ge fa ul ts %Wswaps

and using the -p option:

re a l %eus er %U

sys %S




8/102


time Example

[ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e . / l a p l a c e _ s:

Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008

7 5 .8 2 u se r 0 .0 0 system 1 :1 7 .7 2 e l a p sed 97%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k

0 i n p u t s +0 o u t p u t s ( 0 m a j o r +913 m i no r ) p a g e f a u l t s 0 swaps[ b on o : ~/ d _ l a p l a ce ]$ / u sr / b i n / t i me p . / l a pl a c e_ s:


r e a l 7 5. 73u s e r 7 4 .6 8s ys 0 . 00




9/102


time MPI Example

[ b on o : ~/ d _ l a p l a ce ] $ / u sr / b i n / t i me mp i ru n np 2 . / l a pl a c e_ m p i:


W r i t in g l o g f i l e . . . .F in is he d w r i t i n g l o g f i l e .2 8 .4 3 u se r 1 .5 4 system 0 :3 1 .9 5 e l a p sed 93%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k0 i n p u ts +0 o u tp u ts (0 ma j o r+14 92 0 min o r ) p a g e fa u l ts 0 swap s[ bono : ~ / d _ l a p l a c e ] $ q sub q debug l n o d e s=2 :p p n =2 ,wa l l t i me =0 0 :3 0 :0 0 Iqsub : w a i t i n g f o r j o b 57 72 55 .bono . c c r . b u f f a l o . e du t o s t a r t

#############################PBS Pr olo gu e ##############################PBS p r ol o gu e s c r i p t r u n o n h o st c 15n32 a t F r i Sep 2 8 1 5 :0 3 :0 5 EDT 2007PBSTMPDIR is / sc rat ch /577255 .bono . ccr . bu ff al o . edu/ u s r / b i n / t i m e mp ie xe c . / l a p l a c e _ m p i:


0 .0 1 u se r 0 .0 1 syste m 0 :3 0 .8 2 e l a p sed 0%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k0 i n p u ts +0 o u tp u ts (0 ma j o r+1 416 mi n o r) p a g e fa u l ts 0 swaps

OSCs mpiexec will report accurate timings in the (optional) email

report, as it does not rely on rsh/ssh to launch tasks (but Intel MPIdoes, so in that case you will see the result of timing the mpiexecshell script, not the MPI code).


Simple Profiling Tools Code Section Timing (Calipers)


10/102


Code Section Timing (Calipers)

Timing sections of code requires a bit more work on the part of the

programmer, but there are reasonably portable means of doing so:

Routine Type Resolutiontimes user/sys ms

gettimeofday wall s

clock_gettime wall nssystem_clock (f90) wall system-dependent

cpu_time (f95) cpu compiler-dependentMPI_Wtime* wall system-dependent

OMP_GET_WTIME* wall system-dependent

Generally I prefer the MPI and OpenMP timing calls whenever I canuse them (*the MPI and OpenMP specifications call for their intrinsic

timers to be high precision).




11/102


More information on code section timers (and code for doing so):

LLNL Performance Tools:https://computing.llnl.gov/tutorials/performance_tools/#gettimeofday

Stopwatch (nice F90 module, but you need to supply a low-levelfunction for accessing a timer):

http://math.nist.gov/StopWatch


Simple Profiling Tools gprof
https://computing.llnl.gov/tutorials/performance_tools/#gettimeofdayhttp://math.nist.gov/StopWatchhttp://math.nist.gov/StopWatchhttps://computing.llnl.gov/tutorials/performance_tools/#gettimeofday


12/102


GNU Tools: gprof

Tool that we used briefly before:Generic GNU profiler

Requires recompiling code with -pg option

Running subsequent instrumented code produces gmon.out to

be read by gprofUse the environment variable GMON_OUT_PREFIX to specify anew gmon.out prefix to which the process ID will be appended(especially usefule for parallel runs) - this is a largely

undocumented feature ...

Line-level profiling is possible, as we will see in the following

example




13/102

p g gp

gprof Shortcomings

Shortcomings of gprof (which apply also to any statistical profilingtool):

Need to recompile to instrument the codeInstrumentation can affect the statistics in the profile

Overhead can significantly increase the running time

Compiler optimization can be affected by instrumentation




14/102

p g gp

Types of gprof Profiles

gprof profiles come in three types:1 Flat Profile: shows how much time your program spent in each

function, and how many times that function was called

2 Call Graph: for each function, which functions called it, which

other functions it called, and how many times. There is also anestimate of how much time was spent in the subroutines of eachfunction

3 Basic-block: Requires compilation with the -a flag (supportedonly by GNU?) - enables gprof to construct an annotated source

code listing showing how many times each line of code wasexecuted




15/102

gprof example

g77 I . O3f f a s tmath g pg o r p r p _ re a d . o i n i t i a l . o en_ gde . o \

a dw fn s . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o g en _e tg . o g e n_ r t g . o \ gdewfn . o l i b / * . o

[ jonesm@bono ~/ d_bench ] $ qsub qdebug l n o d e s=1 :p p n =2 ,wa l l t i me =0 0 :3 0 :0 0 I[ jonesm@c16n30 ~/ d_bench] $ . / rp

E n t e r r u n i d : ( < = 9 c h a rs )s h o r t. . .. . . s k i p c o pi o us amount o f s t a nd a rd o u t p u t . . .. . .

[ jonesm@bono ~/d _ be nch ] $ g p ro f rp gmon . o u t > & o u t . g p ro f[ jonesm@bono ~/ d_bench ] $ les s out . gpr ofF l at p r o f i l e :

Ea ch sampl e co u n ts a s 0 .0 1 se co nd s .% c u m u l a t i v e s e l f s e l f t o t a l

t i me seconds seconds c a l ls s / c a l l s / c a l l name89.74 123.23 123.23 204008 0.00 0.00 tr i w f n s _

6.96 132.79 9.56 1 9.56 137.22 MAIN__

1.18 134.41 1.62 200004 0.00 0.00 en_gde__ 1.05 135.86 1.44 204002 0.00 0.00 e v o l _ 0.71 136.83 0.97 14790551 0.00 0.00 r a n f _ 0.27 137.20 0.37 204008 0.00 0.00 gdewfn_

. . .




16/102

[ jonesm@bono ~/ d_bench ] $ gpr of - l i n e r p gmon . o u t > & o u t . g p r o f . l i n e[ jonesm@bono ~/d _ be nch ] $ l e s s o u t . g p ro f . l i n eF l at p r o f i l e :

Ea ch sampl e co u n ts a s 0 .0 1 se co nd s .% c u m u l a t i v e s e l f s e l f t o t a l

t ime seconds seconds c a l l s ns / c a l l ns / c a l l name17.45 23.96 23.96 t r i w f n s _ ( adwfns . f :129 @ 403c94 )

14.46 43.82 19.86 t r i w f n s _ ( adwfns . f :130 @ 403ce6 )12.87 61.50 17.68 t r i w f n s _ ( adwfns . f :171 @ 404755)12.31 78.41 16.91 t r i w f n s _ ( adwfns . f :172 @ 4047e2 )

0.67 79.33 0.92 t r i w f n s _ ( adwfns . f :1 30 @ 403cd6 )0.59 80.14 0.82 MAIN__ (cc4WTuQH . f :3 08 @ 4070c6 )0.51 80.84 0.70 MAIN__ (cc4WTuQH . f :3 04 @ 4072da )




17/102

More gprof Information

More gprof documentation:

gprof GNU Manual:http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html

gprof man page:man gprof


Simple Profiling Tools pgprof
http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.htmlhttp://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html


18/102

PGI Tools: pgprof

PGI tools also have profiling capabilities (c.f. man pgf95)

Graphical profiler, pgprof



19/102


20/102

Simple Profiling Tools pgprof


21/102

N.B. you can also use the -text option to pgprof to make it behavemore like gprof

See the PGI Tools Guide for more information (should be a PDFcopy in $PGI/doc)


Parallel Profiling Tools Statistical MPI Profiling With mpiP

iP St ti ti l MPI P fili


22/102

mpiP: Statistical MPI Profiling

http://mpip.sourceforge.net

Not a tracing tool, but a lightweight interface to accumulate

statistics using the MPI profiling interface

Quite useful in conjunction with a tracefile analysis (e.g. using

jumpshot)

Installed on CCR systems - see module avail for mpiP

availability and location



iP C il ti
http://mpip.sourceforge.net/http://mpip.sourceforge.net/


23/102

mpiP Compilation

To use mpiP you need to:

Add a -g flag to add symbols (this will allow mpiP to access thesource code symbols and line numbers)

Link in the necessary mpiP profiling library and the binary utilitylibraries for actually decoding symbols (There is a trick that youcan use most of the time to avoid having to link with mpiP, though).

Compilation examples (from U2) follow ...



mpiP Runtime Flags


24/102

mpiP Runtime Flags

You can set various mpiP runtime flags (e.g. export MPIP=-t

10.0 -k 2):

Option Description Default-c Generate concise version of report, omitting callsite

process-specific detail.-e Print report data using floating-point format

-f dir Record output file in directory .-g Enable mpiP debug mode disabled-k n Sets callsite stack traceback depth to 1-n Do not truncate full pathname of filename in callsites-o Disable profiling at initialization.

Application must enable profiling with MPI_Pcontrol()-s n Set hash table size to 256-t x Set print threshold for report, where is the

MPI percentage of time for each callsite 0.0-v Generates both concise and verbose report output



mpiP Example Output


25/102

mpiP Example Output

From an older version of mpiP, but still almost entirely the same - thisone links directly with mpiP; first compile:

mpicc g o a t l as a i j2 _ ba s is . o a nal yz e . o a t l as . o b a r r i e r . o b y t e f l i p . o \ c ho rd gn 2 . o c s t r i n g s . o i o 2 . o map . o m u t i l s . o numrec . o paramods . o p r o j . o \ p r o j A t l a s . o sym2 . o u t i l . o lm L / Pr oj ec ts /CCR/ jonesm / mpiP2 .8 .2 /gn u / ch_gm/ l i b \ lmpiP l b f d l i b e r t y lm

then run (in this case using 16 processors) and examine the output file:

[ j o n e sm@j o p l i n d _ de ren zo ] $ l s * . mpiPa t l a s . g c c3mpipapimp i P.1 6 .2 0 5 7 8 .1 .mp i P[ j o ne s m @j o pl i n d _d er en zo ] $ l e s s a t l a s . g cc 3mpipapimp i P.1 6 .2 0 5 7 8 .1 .mp i P:

:




26/102

@ mpiP@ Command : . / at la s . gcc3mpipapimpiP s t u dy . i n i 0@ Ve rs i o n : 2 . 8 . 2@ MPIP B u i l d date : Jun 29 2005 , 1 4 :5 3 :4 1

@ S t a r t t i me : 2005 06 29 1 5 :1 8: 52@ Stop t i m e : 2005 06 29 1 5 :2 8: 34@ Timer Used : g e t t i m e o f d a y@ MPIP env v a r : [ n u l l ]@ C o l l e c t o r Rank : 0@ C o l l e c t o r PID : 20578@ F i n a l Output D i r : .@ MPI T ask A ss ig nm en t : 0 bb18n17 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 1 bb18n17 . c c r . b u f f a l o . e du

@ MPI T ask A ss ig nm en t : 2 bb18n16 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 3 bb18n16 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 4 bb18n15 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 5 bb18n15 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 6 bb18n14 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 7 bb18n14 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 8 bb18n13 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 9 bb18n13 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 10 b b18n12 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 11 b b18n12 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 12 b b18n11 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 13 b b18n11 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 14 b b18n10 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 15 b b18n10 . c c r . b u f f a l o . e du




27/102

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@- MPI Time ( seconds) - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Task AppTime MPITime MPI%0 582 44.7 7.6 91 579 41.9 7.2 42 579 40.7 7.0 33 579 36.9 6.3 74 579 22.3 3.8 45 579 16.6 2.8 76 579 32 5 .5 3

7 579 35.9 6.2 08 579 28.6 4.9 39 579 25.9 4.4 8

10 579 39.2 6.7 611 579 33.8 5.8 412 579 35.3 6.1 013 579 41 7.0 714 579 29.9 5.1 615 579 41.4 7.1 6

* 9.27 e+03 546 5.89




28/102

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@- C a l l s i t e s : 13 - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

ID Lev F i l e / Address L i n e Parent_ Funct MPI_Call1 0 u t i l . c 833 gsync B a r r i e r2 0 a t l a s . c 1531 re a d Pro j Da ta A l l r e d u c e3 0 p r o j A t l a s . c 745 b a c k P r o j A t l a s A l l r e d u c e4 0 a t l a s . c 1545 re a d Pro j Da ta A l l r e d u c e5 0 a t l a s . c 1525 re a d Pro j Da ta A l l r e d u c e

6 0 a t l a s . c 1541 re a d Pro j Da ta A l l r e d u c e7 0 a t l a s . c 1589 re a d Pro j Da ta A l l r e d u c e8 0 a t l a s . c 1519 re a d Pro j Da ta A l l r e d u c e9 0 u t i l . c 789 mygcast Bcast

10 0 p ro jA t l as . c 1100 c o m p u t e L o g l i k e A t l a s A l l r e d u c e11 0 a t l a s . c 1514 re a d Pro j Da ta A l l r e d u c e12 0 a t l a s . c 1537 re a d Pro j Da ta A l l r e d u c e13 0 p r o j A t l a s . c 425 f wd Ba c k Pr o j At l a s 2 A l l r e d u c e




29/102

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@- A g gr eg a te Time ( t o p t w e nt y , d es c en di ng , m i l l i s e c o n d s ) - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

C a l l S i t e Time App% MPI% COVA l l r e d u c e 13 3.09 e+05 3.33 56.50 0.46B a r r i e r 1 2.13 e+05 2.30 38.97 0.35Bcast 9 1.69 e+04 0.18 3.10 0.37A l l r e d u c e 3 7.78 e+03 0.08 1.42 0.11A l l r e d u c e 10 62.7 0.00 0.01 0.20

A l l r e d u c e 11 2.42 0.00 0.00 0.09A l l r e d u c e 7 2.17 0.00 0.00 0.26A l l r e d u c e 12 1.15 0.00 0.00 0.20A l l r e d u c e 6 1.14 0.00 0.00 0.19A l l r e d u c e 5 1.13 0.00 0.00 0.15A l l r e d u c e 8 1.12 0.00 0.00 0.18A l l r e d u c e 2 1 .1 0.00 0.00 0.13A l l r e d u c e 4 1 .1 0.00 0.00 0.12




30/102

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@-

Ag g re g a te Se nt Message Si ze ( to p twe n ty , d e sce nd i ng , b yte s ) - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

C a l l S i t e Count T o t a l Avrg Sent%A l l r e d u c e 13 65536 2.28 e+09 3.4 8 e+04 83.69A l l r e d u c e 3 8192 2.85 e+08 3.4 8 e+04 10.46Bcast 9 490784 1.59 e+08 325 5 .85A l l r e d u c e 11 16 2.07 e+04 1 . 3 e+03 0 .00A l l r e d u c e 10 512 4 .1 e+03 8 0 .00A l l r e d u c e 7 16 256 16 0 .00

A l l r e d u c e 2 16 64 4 0 .00A l l r e d u c e 6 16 64 4 0 .00A l l r e d u c e 5 16 64 4 0 .00A l l r e d u c e 4 16 64 4 0 .00A l l r e d u c e 8 16 64 4 0 .00A l l r e d u c e 12 16 64 4 0 .00. . .. . .



Using mpiP at Runtime


31/102

Using mpiP at Runtime

Now lets examine an example using mpiP at runtime.

This example is solves a simple Laplace equation with Dirichletboundary conditions using finite differences.




32/102

#PBS -S /bin/bash

#PBS -q debug

#PBS -l walltime=00:20:00

#PBS -l nodes=2:MEM24GB:ppn=8

#PBS -M [email protected]

#PBS -m e

#PBS -N test

#PBS -o subMPIP.out

#PBS -j oe

module load intel

module load intel-mpi

module load mpip

module list

cd $PBS_O_WORKDIR

which mpiexec

NNODES=`cat $PBS_NODEFILE | uniq | wc -l`

NPROCS=`cat $PBS_NODEFILE | wc -l`

export I_MPI_DEBUG=5

# Use LD_PRELOAD trick to load mpiP wrappers at runtime

export LD_PRELOAD=$MPIPDIR/lib/libmpiP.so

mpdboot -n $NNODES -f $PBS_NODEFILE -v

mpiexec -np $NPROCS -envall ./laplace_mpi


33/102



34/102

@ mpiP

@ Command : ./laplace_mpi

@ Version : 3.3.0

@ MPIP Build date : Oct 14 2011, 16:16:34

@ Start time : 2011 10 14 16:26:15

@ Stop time : 2011 10 14 16:28:11@ Timer Used : PMPI_Wtime

@ MPIP env var : [null]

@ Collector Rank : 0

@ Collector PID : 8618

@ Final Output Dir : .

@ Report generation : Single collector task

@ MPI Task Assignment : 0 d15n33.ccr.buffalo.edu


@ MPI Task Assignment : 2 d15n33.ccr.buffalo.edu@ MPI Task Assignment : 3 d15n33.ccr.buffalo.edu








@ MPI Task Assignment : 11 d15n23.ccr.buffalo.edu@ MPI Task Assignment : 12 d15n23.ccr.buffalo.edu







35/102

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@ - - MPI Time (seconds) - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Task AppTime MPITime MPI%

0 116 16.9 14.57

1 115 16.4 14.18

2 115 16.8 14.53

3 115 17.7 15.34

4 115 16.6 14.39

5 115 14.3 12.37

6 115 13.7 11.90

7 115 11.1 9.658 115 12.5 10.87

9 115 13.8 12.00

10 115 14.5 12.60

11 115 16.9 14.67

12 115 17.5 15.15

13 115 19.4 16.82

14 115 16.4 14.22

15 115 17.5 15.13

* 1.85e+03 252 13.65




36/102

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@ - - Callsites: 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ID Lev File/Address Line Parent_Funct MPI_Call

1 0 laplace_mpi.f90 118 MAIN__ Allreduce

2 0 laplace_mpi.f90 143 MAIN__ Recv

3 0 laplace_mpi.f90 48 __paramod_MOD_xchange Sendrecv

4 0 laplace_mpi.f90 80 MAIN__ Bcast

5 0 laplace_mpi.f90 46 __paramod_MOD_xchange Sendrecv

6 0 laplace_mpi.f90 138 MAIN__ Send

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@ - - Aggregate Time (top twenty, descending, milliseconds) - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Call Site Time App% MPI% COV

Allreduce 1 2.34e+05 12.69 92.94 0.15

Sendrecv 3 8.88e+03 0.48 3.52 0.19

Sendrecv 5 8.47e+03 0.46 3.36 0.16

Send 6 321 0.02 0.13 0.51

Bcast 4 97.7 0.01 0.04 0.26

Recv 2 14.4 0.00 0.01 0.00

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -




37/102

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

@ - - Callsite Time statistics (all, milliseconds): 80 - - - - - - - - - - -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Name Site Rank Count Max Mean Min App% MPI%

Allreduce 1 0 23845 28.5 0.666 0.036 13.69 94.00

Allreduce 1 1 23845 28.6 0.644 0.035 13.30 93.79

Allreduce 1 2 23845 28.6 0.661 0.037 13.66 93.98

Allreduce 1 3 23845 28.6 0.695 0.038 14.36 93.63

Allreduce 1 4 23845 29.2 0.651 0.038 13.46 93.54

Allreduce 1 5 23845 29.2 0.555 0.035 11.46 92.71

Allreduce 1 6 23845 29 0.528 0.037 10.91 91.64

Allreduce 1 7 23845 26.3 0.405 0.038 8.37 86.74Allreduce 1 8 23845 26.8 0.468 0.034 9.67 88.92

Allreduce 1 9 23845 28.4 0.538 0.033 11.11 92.59

Allreduce 1 10 23845 29.7 0.566 0.033 11.70 92.90

Allreduce 1 11 23845 29.7 0.661 0.033 13.65 93.10

Allreduce 1 12 23845 28.6 0.686 0.039 14.17 93.49

Allreduce 1 13 23845 28.8 0.768 0.041 15.87 94.35

Allreduce 1 14 23845 28.6 0.641 0.036 13.25 93.19

Allreduce 1 15 23845 28.7 0.694 0.038 14.33 94.76

Allreduce 1 * 381520 29.7 0.614 0.033 12.69 92.94



38/102


39/102

Parallel Profiling Tools Intel Trace Analyzer/Collector

ITAC Example


40/102

Note that you do not have to recompile your application to use ITAC(unless you are building it statically), you can just build it usual, usingIntel MPI:

[ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ m od ul e l o a d i n t e lmpi

[ b on o : ~/ d _ l a p l a c e / d _ i ta c ]$ make l a p l a ce _ mp im p i i f o r t i p o O3V a x l i b g c l a p l a c e _ mp i . f 9 0m p i i f o r t i p o O3V a x l i b g o l a p la c e _m p i l a p la c e _m p i . oi p o : r em ar k # 11 00 1: p e r f or m i n g s i n g l ef i l e o p t i m i z a t i o n si p o : r em ar k # 11 00 5: g e n e r at i n g o b j e c t f i l e / t mp / i p o _ i f o r t j O z Y A n . o[ bono : ~ / d _ l a p l a c e / d _ i t a c ] $

You can turn trace collection on at run-time ...



41/102



42/102

Running the preceding batch job on U2 produces a bunch (many!)ofprofiling output files, the most important of which can be is the name of

your binary with a .stf suffix, in this case laplace_mpi.stf, which wefeed to the Intel Trace Analyzer using the traceanalyzer command ...and we should see a profile that looks very much like what you can see

using jumpshot.




43/102



More ITAC Documentation


44/102

Some helpful pointers to more ITAC documentation:[ k 07n14 : ~ ] $ w hi ch t r a c e a n a l y z e r

/ u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / b i n / t r a c e a n a l y z e r[ k 07n14 : ~ ] $ l s l / u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / doc / t o t a l 4211r - r - r - 1 r o o t r o o t 91050 Aug 25 0 6 : 02 FAQ . p d fr - r - r - 1 r o o t r o o t 15566 Aug 25 0 6: 02 G e t t in g _ S ta r t e d . h tm ldrwxrxrx 3 r o o t ro o t 57 Oct 4 08:02 h t ml

r - r - r - 1 r o o t r o o t 61598 Aug 25 0 6: 02 I NSTALL . h t m lr - r - r - 1 ro o t ro o t 2 05 10 29 Aug 25 0 6 :0 2 ITA_ Refe re n ce _ Gu i de . p d fr - r - r - 1 ro o t ro o t 1 05 02 76 Aug 25 0 6 :0 2 ITC_ Refe re n ce _ Gu i de . p d fr - r - r - 1 r o o t r o o t 20681 Aug 2 5 0 6 :0 2 R el ea se _N ot es . t x t

Generally a good idea to refer to the documentation for the same

version that you are using (you can check with module showintel-mpi).


Performance API (PAPI) Introduction to PAPI

Introduction


45/102

Performance Application Programming Interface

Implement a portable(!) and efficient API to access existing

hardware performance countersEase the optimization of code by providing base infrastructure forcross-platform tools



46/102


47/102


48/102


49/102


50/102

Performance API (PAPI) Reminder: U2 Hardware

Nehalem Xeon Block Diagram


51/102

Block diagram for Nehalem architecture:

10/14/11 5:25 PM

Page 1 of 2file:///Users/jonesm/Desktop/Intel_Nehalem_arch.svg

quadruple associative Instruction Cache 32 KByte,

128-entry TLB-4K, 7 TLB-2/4M per thread

Prefetch Buffer (16 Bytes)

Predecode &

Instruction Length Decoder

Instruction Queue

18 x86 Instructions

Alignment

MacroOp Fusion

Complex

Decoder

Simple

DecoderSimple

Decoder

Simple

Decoder

Decoded Instruction Queue (28 OP entries)

MicroOp Fusion

Loop

Stream

Decoder

2 x Register Allocation Table (RAT)

Reorder Buffer (128-entry) fused

2 x

Retirement

Register

File

Reservation Station (128-entry) fused

Store

Addr.

Unit

AGU

Load

Addr.

Unit

AGUStore

Data

Micro

Instruction

Sequencer

256 KByte

8-way,

64 Byte

Cacheline,

private

L2-Cache

512-entry

L2-TLB-4K

Integer/MMX ALU,

Branch

SSE

ADD

Move

Integer/MMX

ALU

SSE

ADD

Move

FP

ADD

Integer/MMX ALU,

2x AGU

SSE

MUL/DIV

Move

FP

MUL

Memory Order Buffer (MOB)

octuple associative Data Cache 32 KByte,

64-entry TLB-4K, 32-entry TLB-2/4M

Branch

Prediction

global/bimodal,

loop, indirect

jmp

128

Port 4 Port 0Port 3 Port 2 Port 5 Port 1

128 128

128 128 128

Result Bus256

Quick Path

Inter-

connect

DDR3

Memory

Controller

Common

L3-Cache

8 MByte

Uncore

4 x 20 Bit

6,4 GT/s

3 x 64 Bit

1,33 GT/s

Intel Nehalem microarchitecture


Performance API (PAPI) Reminder: U2 Hardware

Superscalar - EM64T Irwindale


52/102

Characteristics of the Irwindale Xeons that form (part of) CCRs U2

cluster:

Clock Cycle 3.2 GHzTPP 6.4 GFlop/s

Pipeline 31 stagesL2 Cache Size 2 MByte

L2 Bandwidth 102.4 GByte/sCPU-Memory Bandwidth 6.4 GByte/s (shared)



53/102


54/102


55/102


56/102

Performance API (PAPI) High vs. Low-level PAPI

High-level PAPI


57/102

Intended for coarse-grained measurements

Requires little (or no) setup code

Allows only PAPI preset

Allows only aggregate counting (no statistical profiling)



Low-level PAPI


58/102

More efficient (and functional) than high-level

About 60 functions

Thread-safe

Supports presets and native events



Preset vs. Native Events


59/102

preset or pre-defined events, are those which have been

considered useful by the PAPI community and developers:http://icl.cs.utk.edu/projects/papi/presets.html

native events are those countable by the CPUs hardware.These events are highly platform specific, and you would

need to consult the processor architecture manuals forthe relevant native event lists

http://icl.cs.utk.edu/projects/papi/presets.htmlhttp://icl.cs.utk.edu/projects/papi/presets.html


60/102


61/102


How to Access PAPI at CCR


62/102

Consider a simple example code to measure Flop/s using thehigh-level PAPI API:

1 #include < s t d i o . h>2 #include < s t d l i b . h>3 #include " p a p i .h "4

56 i n t main ( )7 {8 f l o a t re a l _ t i me , p ro c_ t i me , mf l o p s ;9 l o n g _ l o n g f lp o p s ;

10 f l o a t i r e a l _ t i m e , i p r oc _ t im e , i m f l o p s ;11 l o n g _ l o n g i f l p o p s ;12 i n t r e t v a l ;



13 i f ( ( r e t v a l = P A P I _ f l op s ( & i r e a l _ t i m e , & i p r o c _ t i m e , & i f l p o p s ,& i m f l o p s ) ) < PAPI_OK )14 {15 p r i n t f ( " Could n ot i n i t i a l i s e PAPI f lo ps \ n " ) ;


63/102

15 p r i n t f ( Could n ot i n i t i a l i s e PAPI_ f lo ps \ n ) ;16 p r i n t f ( " Your p la t fo r m may no t s u pp or t f l o a t i n g p o in t o pe ra ti on ev en t . \ n " ) ;17 p r i n t f ( " r e t v a l : %d \ n" , r e t v a l ) ;18 e x i t ( 1 ) ;19 }

2021 y our _s lo w_c ode ( ) ;2223 i f ( ( r e t v a l = P A P I_ f l op s ( & r e a l _ t i m e , & pr o c _t i m e , & f l p o p s , & m fl o p s ) ) < PAPI_OK )24 {25 p r i n t f ( " r e t v a l : %d \ n" , r e t v a l ) ;26 e x i t ( 1 ) ;27 }28

29 p r i n t f ( " R ea l_ t im e : % f P r oc _t i me : % f T o t a l f l p o p s : % l l d MFLOPS : %f \ n " ,30 r ea l _ t i m e , pro c_time , f lp o p s , mfl ops ) ;31 e x i t ( 0 ) ;32 }33 i n t your_slow_code ( )34 {35 i n t i ;36 double tmp =1 .1 ;37

38 f o r ( i =1 ; i


64/102

On U2, you access the papi module and compile accordingly:[ k07 n1 4 : ~/ d _p a p i ] $ q su b q debug lnode s =1:MEM48GB: ppn=12 , wa ll ti me =01: 00:00 Iqs ub : w a i t i n g f o r j o b 1 14 44 85 .d15n41 . c c r . b u f f a l o . e du t o s t a r tq su b : j o b 1 1 44 4 85 .d1 5n 41 . ccr . b u ff a l o . e du re a d y

Job 1 1 44 4 85 .d1 5n 41 . ccr . b u ff a l o . ed u h as re q u e ste d 12 co re s / p ro ce sso rs p e r n od e .PBSTMPDIR is / sc rat ch /1144485. d15n41. cc r . bu ff al o . edu[ k16n01b : ~ ] $ cd $PBS_O_WORKDIR

[ k16 n0 1b :~ / d _ p a pi ] $ mod ul e l o a d p a p i' p a p i / v 4 . 1 . 4 ' l o a d c o mp l et e .[ k16 n0 1b :~ / d _ p a pi ] $ g cc I$PAPI / inc lu de o P A P I _f l o ps P A P I _f l o ps . c L$PAPI/ l ib l p a p i[ k16 n0 1b :~ / d _ p a pi ] $ . / PAPI_ f l o p sRe a l _ t ime : 0 .00 0 04 2 Pro c_ t i me : 0 .00 0 02 9 To t a l f l p o p s : 7 98 3 MFLOPS: 2 76 .57 2 44 9

N.B., PAPI needs to support the underlying hardware, and this

version does not support the 32-core Intel nodes (including thefront end).




65/102

... and on Lennon (SGI Altix):

[ j on esm@le nn on ~/ d _ p a pi ] $ mod ul e l o a d p a p i' p a p i / v 3 . 2 . 1 ' l o a d c o mp l et e .[ jonesm@lennon ~/ d_papi ] $ echo $PAPI

/ u t i l / p e r f t o o l s / p ap i3.2.1[ jonesm@lennon ~/ d_papi ] $ gcc I$PAPI / inc lu de o P A PI _ fl o ps P A PI _ fl o ps . c \

L$PAPI/ l ib l p a p i[ j on esm@le nn on ~/ d _ p a pi ] $ l d d PAPI_ f l o p s

l i b p a p i . s o => / u t i l / p e r f t o o l s / p a pi3 . 2. 1/ l i b / l i b p a p i . s o(0x2000000000040000)

l i b c . s o . 6 . 1 => / l i b / t l s / l i b c . s o . 6 . 1 ( 0 x20000000000ac000 )l i b p fm . so .2 => / u sr / l i b / l i b p fm . so .2 (0 x2 00 00 00 00 03 18 00 0 )/ l i b / l dl i n u xi a 64 . so . 2 => / l i b / l dl i n u xia64 . so. 2 (0 x2000000000000000 )

[ j on esm@le nn on ~/ d _ p a pi ] $ . / PAPI_ f l o p sReal_time : 0.000143 Proc_time : 0.000132 To ta l f l po ps : 48056 MFLOPS: 363.964111

N.B., The Altix is dead, but this is a good example of the cross-platform

portability of PAPI accessign the hardware performance counters.



papi_avail Command

Y i il h k il bili (diff CPU


66/102

You can use papi_avail to check event availability (different CPUs

support various events):

[k16n01b:~/d_papi]$ papi_avail

Available events and hardware information.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

PAPI Version : 4.1.4.0

Vendor string and code : GenuineIntel (1)

Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44)

CPU Revision : 2.000000

CPUID Info : Family: 6 Model: 44 Stepping: 2

CPU Megahertz : 2400.391113

CPU Clock Megahertz : 2400

Hdw Threads per core : 1

Cores per Socket : 6

NUMA Nodes : 2

CPU's per Node : 6

Total CPU's : 12

Number Hardware Counters : 16

Max Multiplex Counters : 512

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The following correspond to fields in the PAPI_event_info_t structure.

Name Code Avail Deriv Description (Note)

PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses

PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses



PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses

PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses

PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses

PAPI L3 ICM 0x80000005 No No Level 3 instruction cache misses


67/102

PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses

PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses

PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses

PAPI_L3_TCM 0x80000008 Yes No Level 3 cache misses

PAPI_CA_SNP 0x80000009 No No Requests for a snoopPAPI_CA_SHR 0x8000000a No No Requests for exclusive access to shared cache line

PAPI_CA_CLN 0x8000000b No No Requests for exclusive access to clean cache line

PAPI_CA_INV 0x8000000c No No Requests for cache line invalidation

PAPI_CA_ITV 0x8000000d No No Requests for cache line intervention

PAPI_L3_LDM 0x8000000e Yes No Level 3 load misses

PAPI_L3_STM 0x8000000f No No Level 3 store misses

PAPI_BRU_IDL 0x80000010 No No Cycles branch units are idle

PAPI_FXU_IDL 0x80000011 No No Cycles integer units are idle

PAPI_FPU_IDL 0x80000012 No No Cycles floating point units are idlePAPI_LSU_IDL 0x80000013 No No Cycles load/store units are idle

PAPI_TLB_DM 0x80000014 Yes No Data translation lookaside buffer misses

PAPI_TLB_IM 0x80000015 Yes No Instruction translation lookaside buffer misses

PAPI_TLB_TL 0x80000016 Yes Yes Total translation lookaside buffer misses

PAPI_L1_LDM 0x80000017 Yes No Level 1 load misses

PAPI_L1_STM 0x80000018 Yes No Level 1 store misses

PAPI_L2_LDM 0x80000019 Yes No Level 2 load misses

PAPI_L2_STM 0x8000001a Yes No Level 2 store misses

PAPI_BTAC_M 0x8000001b No No Branch target address cache misses

PAPI_PRF_DM 0x8000001c No No Data prefetch cache misses

PAPI_L3_DCH 0x8000001d No No Level 3 data cache hits

PAPI_TLB_SD 0x8000001e No No Translation lookaside buffer shootdowns

PAPI_CSR_FAL 0x8000001f No No Failed store conditional instructions

PAPI_CSR_SUC 0x80000020 No No Successful store conditional instructions

PAPI_CSR_TOT 0x80000021 No No Total store conditional instructions



68/102


69/102



70/102

PAPI_FML_INS 0x80000061 No No Floating point multiply instructions

PAPI_FAD_INS 0x80000062 No No Floating point add instructionsPAPI_FDV_INS 0x80000063 No No Floating point divide instructions

PAPI_FSQ_INS 0x80000064 No No Floating point square root instructions

PAPI_FNV_INS 0x80000065 No No Floating point inverse instructions

PAPI_FP_OPS 0x80000066 Yes Yes Floating point operations

PAPI_SP_OPS 0x80000067 Yes Yes Floating point operations; optimized to count

scaled single precision vector operations

PAPI_DP_OPS 0x80000068 Yes Yes Floating point operations; optimized to count

scaled double precision vector operations

PAPI_VEC_SP 0x80000069 Yes No Single precision vector/SIMD instructionsPAPI_VEC_DP 0x8000006a Yes No Double precision vector/SIMD instructions

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Of 107 possible events, 57 are available, of which 14 are derived.

avail.c PASSED



71/102


PAPI_L1_TCM 0x80000006 Yes Level 1 cache misses

PAPI_L2_TCM 0x80000007 No Level 2 cache misses

PAPI_L3_TCM 0x80000008 No Level 3 cache misses

PAPI_L3_LDM 0x8000000e No Level 3 load misses

PAPI TLB DM 0 80000014 N D t t l ti l k id b ff i


72/102

PAPI_TLB_DM 0x80000014 No Data translation lookaside buffer misses

PAPI_TLB_IM 0x80000015 No Instruction translation lookaside buffer misses

PAPI_TLB_TL 0x80000016 Yes Total translation lookaside buffer misses

PAPI_L1_LDM 0x80000017 No Level 1 load misses

PAPI_L1_STM 0x80000018 No Level 1 store misses

PAPI_L2_LDM 0x80000019 No Level 2 load misses

PAPI_L2_STM 0x8000001a No Level 2 store misses

PAPI_BR_UCN 0x8000002a No Unconditional branch instructions

PAPI_BR_CN 0x8000002b No Conditional branch instructions

PAPI_BR_TKN 0x8000002c No Conditional branch instructions taken

PAPI_BR_NTK 0x8000002d Yes Conditional branch instructions not taken

PAPI_BR_MSP 0x8000002e No Conditional branch instructions mispredicted

PAPI_BR_PRC 0x8000002f Yes Conditional branch instructions correctly predicted

PAPI_TOT_IIS 0x80000031 No Instructions issuedPAPI_TOT_INS 0x80000032 No Instructions completed

PAPI_FP_INS 0x80000034 No Floating point instructions

PAPI_LD_INS 0x80000035 No Load instructions

PAPI_SR_INS 0x80000036 No Store instructions

PAPI_BR_INS 0x80000037 No Branch instructions

PAPI_RES_STL 0x80000039 No Cycles stalled on any resource

PAPI_TOT_CYC 0x8000003b No Total cycles

PAPI_LST_INS 0x8000003c Yes Load/store instructions completed

PAPI_L2_DCH 0x8000003f Yes Level 2 data cache hitsPAPI_L2_DCA 0x80000041 No Level 2 data cache accesses

PAPI_L3_DCA 0x80000042 Yes Level 3 data cache accesses

PAPI_L2_DCR 0x80000044 No Level 2 data cache reads

PAPI_L3_DCR 0x80000045 No Level 3 data cache reads

PAPI_L2_DCW 0x80000047 No Level 2 data cache writes

PAPI_L3_DCW 0x80000048 No Level 3 data cache writes



73/102


74/102


75/102

PAPI API Examples

PAPI Naming Scheme


76/102

The C interfaces:PAPI C interface

(return type) PAPI_function_name(arg1, arg2, ...)

andFortran

interfacesPAPI Fortran interfaces

PAPIF_function_name(arg1, arg2, ..., check)

note that the check parameter is the same type and value as the C

return value.



77/102

PAPI API Examples High-level API Example in C

High-level API Example in C

Lets consider the following example code for using the high level API


78/102

Let s consider the following example code for using the high-level APIin C.

#include " p a p i .h "#include < s t d i o . h># d e fi n e NUM_EVENTS 2# d e fi n e THRESHOLD 10 000# d e fi n e ERROR_RETURN ( r e t v a l ) { f p r i n t f ( s t d e r r , " E r r o r %d %s : l i n e %d : \ n " , \

r e t v a l , _ _FILE__ , __LINE__ ) ; e x i t ( r e t v a l ) ; }void c o m p u ta t i o n_ m u l t ( ) { /* s t u p i d c ode s t o be m o ni t or e d */

double tmp =1 .0 ;

i n t i =1;f o r ( i = 1 ; i < THRESHOLD ; i + + ) {

tmp = tmp * i ;}

}void co mp uta t i on _ a dd ( ) { /* s t u p i d c od es t o be m o ni t or e d */

i n t tmp = 0 ;i n t i =0;

f o r ( i = 0 ; i < THRESHOLD ; i + + ) {tmp = tmp + i ;

}}



79/102


/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *P AP I nu m co un te rs r e t u r n s t h e number o f h ar dw ar e c o u n t e r s t h e p l a t f o r m


80/102

* P AP I_ nu m_ co un te rs r e t u r n s t h e number o f h ar dw ar e c o u n t e r s t h e p l a t f o r m ** has o r a n e ga t iv e number i f t h e re i s an e r r o r ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

i f ( ( num_hwcntrs = PAPI_num_counters ( ) ) < PAPI_OK) {p r i n t f ( " T her e a re no c o un t er s a v a i l a b l e . \ n " ) ;e x i t ( 1 ) ;

}

p r i n t f ( " T h er e a r e %d c o u n t e r s i n t h i s s ys te m \ n " , n um _h wc nt rs ) ;

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** P AP I_ st ar t_ co un te rs i n i t i a l i z e s t he PAPI l i b r a r y ( i f n ec es sa ry ) and *

* s t a r t s c o u nt i ng t he e ve nt s named i n t he e ve nt s a r r ay . T hi s f u n c t i o n ** i m p l i c i t l y s to p s and i n i t i a l i z e s any c ou n t er s r un n in g as a r e s u l t o f ** a p r ev i ou s c a l l t o P A PI _ st a rt _ co u nt e rs . ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

i f ( ( r e t v a l = PAP I_s tart _co unte rs ( Events , NUM_EVENTS) ) != PAPI_OK)ERROR_RETURN( r e t v a l ) ;

p r i n t f ( " \ n Co un te r S t a r t e d : \ n " ) ;

/* You r cod e g oe s h e re */co mp uta t i on _ a dd ( ) ;



/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** P AP I_ re ad _c ou nt er s r ea ds t h e c o u nt e r v a l ue s i n t o v a l ue s a r r a y ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /


81/102

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */i f ( ( r e t v a l =PAPI_read_counters ( values , NUM_EVENTS) ) != PAPI_OK)

ERROR_RETURN( r e t v a l ) ;

p r i n t f ( " Read s u c c e s s f u l l y \ n " ) ;

p r i n t f ( " The t o t a l i n s t r u c t i o n s ex ec ut ed f o r a d d i t i on a re %l l d \ n " , v a lu es [ 0 ] ) ;p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;

p r i n t f ( " \ nNow we t r y t o u se PAPI_accum t o a c cu m ul a te v a l u e s \ n " ) ;

/* Do some co mp u ta t i o n h e re */co mp uta t i on _ a dd ( ) ;

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** What P AP I_ ac cu m_ co un te rs d oe s i s i t a dd s t h e r u n n i n g c o u n t e r v a l u e s ** t o what i s i n t he v a lu es a r r ay . The h ar dw ar e c o un t er s a re r e s e t and ** l e f t r un n in g a f t e r th e c a l l . ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

i f ( ( r e t v a l =PAPI_accum_counters ( value s , NUM_EVENTS) ) != PAPI_OK)ERROR_RETURN( r e t v a l ) ;

p r i n t f ( "We d i d an a d d i t i o n a l %d t i m e s a d d i t i o n ! \ n " , THRESHOLD ) ;

p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d di t i on are %l l d \ n " ,v al ue s [ 0 ] ) ;p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;

M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 86 / 108 PAPI API Examples High-level API Example in C


82/102

/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** S to p c o u n t in g e v en t s ( t h i s r ea ds t h e c o u nt e r s as w e l l as s t o ps them ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

p r i n t f ( " \ nNow we t r y t o do some m u l t i p l i c a t i o n s \ n " ) ;c o m p u ta t i o n _m u l t ( ) ;

/* * * * * * * * * * * * * * * * * * * PAPI_stop_counters * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */i f ( ( r e t v a l =PAPI _stop_ count ers ( value s , NUM_EVENTS) ) != PAPI_OK)

ERROR_RETURN( r e t v a l ) ;

p r i n t f ( " The t o t a l i n s t r u c t i o n ex ecuted f o r m u l t i p l i c a t i o n are %l l d \ n " ,v al ue s [ 0 ] ) ;

p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;e x i t ( 0 ) ;

}

M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 87 / 108 PAPI API Examples High-level API Example in C

Running on E5645 U2 nodes:


83/102

g

[k11n30b:~/d_papi]$ module load papi

'papi/v4.1.4' load complete.[k11n30b:~/d_papi]$ gcc -I$PAPI/include -o highlev highlev.c -L$PAPI/lib -lpapi

[k11n30b:~/d_papi]$ ./highlev

There are 16 counters in this system

Counter Started:

Read successfully

The total instructions executed for addition are 54977

The total cycles used are 73894

Now we try to use PAPI_accum to accumulate values

We did an additional 10000 times addition!

The total instructions executed for addition are 112352


Now we try to do some multiplications

The total instruction executed for multiplication are 77459


M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 88 / 108 PAPI API Examples High-level API Example in C

PAPI Initialization


84/102

The preceding example used PAPI_library_init to initialize PAPI,which is also used for the low-level API, but you can also use thePAPI_num_counters, PAPI_start_counters, or one of the ratecalls, PAPI_flips, PAPI_flops, or PAPI_ipc.

Events are counted, as we saw in the example, usingPAPI_accum_counters, PAPI_read_counters, and

PAPI_stop_counters.

Lets look at an even simpler example just using one of the rate

counters.

M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 89 / 108 PAPI API Examples High-level Example in F90

High-level Example in F90


85/102

For something a little different we can look at our old friend, matrixmultiplication, this time in Fortran:

! A s i m pl e example f o r t he use o f PAPI , t he number o f f l o p s you s ho ul d! g et i s a bo ut INDEX^3 on m ac hines t h a t c o ns i de r add a nd m u l t i p l y one f l o p! such as S GI , and 2 * ( INDEX ^ 3 ) t h a t don ' t c o n s i de r i t 1 f l o p s uc h as INTEL! Kevin Londonprogram f l o p s

i m p l i c i t noneinclude " f9 0 p a p i . h "

i n te g e r , parameter : : i 8 =SELECTED_INT_KIND( 1 6 ) ! i n te g er *8i n te g e r , parameter : : i n d e x =1000r e a l : : ma tr i xa ( i n d e x , i n d e x ) , ma tr i xb ( i n d e x , i n d e x ) ,mre s( i n d e x , i n d e x )r e a l : : p r oc _ ti m e , m fl op s , r e a l _ t i m ei n t e g e r ( kind= i8 ) : : f l p i n si n t e g e r : : i , j , k , r e t v a l

M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 90 / 108


86/102

PAPI API Examples High-level Example in F90

! M t i M t i M l t i l


87/102

! M at ri xM a tr i x M u l t i p l ydo i =1 , i n d e x

do j =1 , i n d e xdo k = 1 , i n d e xm res ( i , j ) = m res ( i , j ) + m a t r i x a ( i , k ) * ma tr i xb (k , j )

end doend do

end do

! C o l l e c t t he d at a i n t o t he V a ri a bl e s p assed i nc a l l P A PI f _f l op s ( r e al _ t im e , p ro c_ ti me , f l p i n s , m fl op s , r e t v a l )i f ( r et v al .NE.PAPI_OK) then

p r i n t * , ' F a i l u r e on P A PI f _f l op s : ' , r e t v a lend i fp r i n t * , ' R ea l_ ti me : ' , r e a l _ ti m ep r i n t * , ' P ro c_ ti me : ' , p ro c _t im ep r i n t * , ' T o t al f l p i n s : ' , f l p i n sp r i n t * , ' MFLOPS: ' , mf l o p s

end program f l o p s

M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 92 / 108


88/102


89/102

PAPI API Examples Low-level API

PAPI Low-level Example

A simple example using the low-level API:


90/102

p p g

#include #include < s t d i o . h>

# d e fi n e NUM_FLOPS 1000 0

m ai n ( ) {i n t re tv al , EventSet=PAPI_NULL ;l o n g_ l o ng v a lu e s [ 1 ] ;

/* I n i t i a l i z e th e PAPI l i b r a r y */

r et v a l = PA P I_ li br ar y _i ni t (PAPI_VER_CURRENT) ;i f ( r e t v a l != PAPI_VER_CURRENT) {

f p r i n t f ( s t d er r , " PAPI l i b r a r y i n i t e r r o r ! \ n " ) ;e x i t ( 1 ) ;

}

/* C re at e t h e E ve nt S et */i f ( PAPI_ cre a te _ e ve n tse t(&Eve n tSet ) != PAPI_OK) h a n d l e _ e rr o r ( 1 ) ;

/* Add T o t a l I n s t r u c t i o n s Ex ec ut ed t o o ur E ve nt S et */

i f ( PAPI_add_event ( EventSet , PAPI_TOT_INS) != PAPI_OK) han dle _er ror ( 1 ) ;

M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 95 / 108 PAPI API Examples Low-level API

/* S t a r t c o un t in g e ve nt s i n t he E ve nt S et */i f ( P A P I _ s t a r t ( E v en t Se t ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ;

/* D ef in ed i n t e s t s / d o lo op s . c i n t he PAPI s ou rc e d i s t r i b u t i o n */


91/102

_ pdo _ f l o ps ( NUM_FLOPS ) ;

/* Read t h e c o u n t in g e v en t s i n t h e E ve nt S et */i f ( PAPI_ re ad ( Even tSet , va l u e s ) != PAPI_OK) h a n d l e _ e rro r ( 1 ) ;

p r i n t f ( " A f t e r r e ad i ng t he c o un t er s : % l l d \ n " , v a l ue s [ 0 ] ) ;

/* Res et t he c o un t in g e ve nt s i n t h e E ve nt S et */i f ( P A PI _ re s et ( E v en t Se t ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ;

do _ f l o ps ( NUM_FLOPS ) ;

/* Add t he c o un t er s i n t he E ve nt S et */i f (PAPI_accum( Even tSe t , va l u e s ) != PAPI_OK) h a n d l e _ e rr o r ( 1 ) ;p r i n t f ( " A f t e r a dd in g t he c o un t er s : %l l d \ n " , v a l ue s [ 0 ] ) ;

do _ f l o ps ( NUM_FLOPS ) ;

/* S top t he c o un t in g o f e ve nt s i n t he E ven t S et */i f ( PAPI_ sto p ( Even tSet , va l u e s ) != PAPI_OK) h a n d l e _ e rro r ( 1 ) ;

p r i n t f ( " A f t e r s t op p in g t he c o un t er s : %l l d \ n " , v a l ue s [ 0 ] ) ;}

M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 96 / 108 PAPI API Examples PAPI in Parallel

PAPI in Parallel


92/102

threads PAPI_thread_init enables PAPIs thread support, and

should be called immediately after

PAPI_library_init.

MPI codes are treated very simply - each process has its ownaddress space, and potentially its own hardware counters.

M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 97 / 108


93/102

High-level Tools Popular Examples

Tool Examples


94/102

A list of such high-level tool examples (not exhaustive):

TAU, Tuning and Analysis Utility,http://www.cs.uoregon.edu/Research/tau/home.php

Open|SpeedShop,funded by U.S. DOEhttp://www.openspeedshop.org

IPM, Integrated Performance Managementhttp://ipm-hpc.sourceforge.net

M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 100 / 108 High-level Tools Specific Example: IPM

Example: IPM
http://www.cs.uoregon.edu/Research/tau/home.phphttp://www.openspeedshop.org/http://ipm-hpc.sourceforge.net/http://ipm-hpc.sourceforge.net/http://www.openspeedshop.org/http://www.cs.uoregon.edu/Research/tau/home.php


95/102

IPM is relatively simple to install and use, so we can easily walkthrough our favorite example. Note that IPM does:

MPI

PAPI

I/O profiling

Memory

Timings: wall, user, and system

M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 101 / 108


96/102


97/102


98/102

High-level Tools Specific Example: IPM

Script to Generate HTML from XML Results


99/102

1 #!/bin/bash23 if [ $# -ne 1 ]; then4 echo "Usage: $0 xml_filename"5 exit6 fi7 XMLFILE=$1

89 export IPM_KEYFILE=/projects/jonesm/ipm/src/ipm/ipm_key

10 export PATH=${PATH}:/projects/jonesm/ipm/src/ipm/bin11 /projects/jonesm/ipm/src/ipm/bin/ipm_parse -html $XMLFILE



100/102


Visualize Results in Browser

Transfer compressed tar file to your local machine, unpack, and

browse the results:


101/102



Summary


102/102

Summary of high-level tools

IPM is pretty easy to use, provides some good functionality

TAU and Open|SpeedShop have steeper learning curves, muchmore functionality


class16_profiling-handout.pdf

Documents