class16_profiling-handout.pdf
TRANSCRIPT
-
7/29/2019 class16_profiling-handout.pdf
1/102
Application Performance Tuning
M. D. Jones, Ph.D.
Center for Computational ResearchUniversity at Buffalo
State University of New York
High Performance Computing I, 2012
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 1 / 108
-
7/29/2019 class16_profiling-handout.pdf
2/102
Introduction to Performance Engineering in HPC Performance Fundamentals
Performance Foundations
Three pillars of performance optimization:
Algorithmic - choose the most effective algorithm that you can for theproblem of interest
Serial Efficiency - optimize the code to run efficiently in a non-parallelenvironment
Parallel Efficiency - effectively use multiple processors to achieve areduction in execution time, or equivalently, to solve a
proportionately larger problem
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 3 / 108
-
7/29/2019 class16_profiling-handout.pdf
3/102
Introduction to Performance Engineering in HPC Performance Fundamentals
Algorithmic Efficiency
Choose the best algorithm beforeyou start coding (recall that good
planning is an essential part of writing good software):
Running on large number of processors? Choose an algorithm
that scales well with increasing processor countRunning a large system (mesh points, particle count, etc.)?
Choose an algorithm that scales well with system size
If you are going to run on a massively parallel machine, plan from
the beginning on how you intend to decompose the problem (it
may save you a lot of time later)
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 4 / 108
-
7/29/2019 class16_profiling-handout.pdf
4/102
Introduction to Performance Engineering in HPC Performance Baseline
Serial Efficiency
Getting efficient code in parallel is made much more difficult if you
have not optimized the sequential code, and in fact can lead to amisleading picture of parallel performance. Recall that our definition of
parallel speedup,S(Np) =
S
p(Np),
involves the time, S, for an optimal sequential implementation (not
just p(1)!)
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 5 / 108
-
7/29/2019 class16_profiling-handout.pdf
5/102
Introduction to Performance Engineering in HPC Performance Baseline
Establish a Performance Baseline
Steps to establishing a baseline for your own performance
expectations:
Choose a representative problem (or better still a suite of
problems) that can be run under identical circumstances on a
multitude of platforms/compilersRequires portable code!
How fast is fast? You can utilize hardware performance counters
to measure actual code performance
Profile, profile, and then profile some more ... to find bottlenecksand spend your time more effectively in optimizing code
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 6 / 108
I d i P f E i i i HPC Pi f ll i P ll l P f
-
7/29/2019 class16_profiling-handout.pdf
6/102
Introduction to Performance Engineering in HPC Pitfalls in Parallel Performance
Parallel Performance Trap
Pitfalls when measuring the performance of parallel codes:
For many, speedup or linear scalability is the ultimate goal.
This goal is incomplete - a terribly inefficient code can scale well,
but actually deliver poor efficiency.For example, consider a simple Monte Carlo based code that usesthe most rudimentary uniform sampling (i.e. no importancesampling) - this can be made to scale perfectly in parallel, but the
algorithmic efficiency (measured perhaps by the delivered
variance per cpu-hour) is quite low.
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 7 / 108
Si l P fili T l Ti
-
7/29/2019 class16_profiling-handout.pdf
7/102
Simple Profiling Tools Timers
time Command
Note that this is not the time built-in function in many shells (bashand tcsh included), but instead the one located in /usr/bin. Thiscommand is quite useful for getting an overall picture of code
performance. The default output format:
%Uuse r %Ssys tem %Eel aps ed %PCPU (%X t e x t+%Ddata %Mmax) k%Ii n p u t s+%Ooutputs (%Fmajor+%Rminor ) pa ge fa ul ts %Wswaps
and using the -p option:
re a l %eus er %U
sys %S
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 9 / 108
Simple Profiling Tools Timers
-
7/29/2019 class16_profiling-handout.pdf
8/102
Simple Profiling Tools Timers
time Example
[ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e . / l a p l a c e _ s:
Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008
7 5 .8 2 u se r 0 .0 0 system 1 :1 7 .7 2 e l a p sed 97%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k
0 i n p u t s +0 o u t p u t s ( 0 m a j o r +913 m i no r ) p a g e f a u l t s 0 swaps[ b on o : ~/ d _ l a p l a ce ]$ / u sr / b i n / t i me p . / l a pl a c e_ s:
Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008
r e a l 7 5. 73u s e r 7 4 .6 8s ys 0 . 00
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 10 / 108
Simple Profiling Tools Timers
-
7/29/2019 class16_profiling-handout.pdf
9/102
Simple Profiling Tools Timers
time MPI Example
[ b on o : ~/ d _ l a p l a ce ] $ / u sr / b i n / t i me mp i ru n np 2 . / l a pl a c e_ m p i:
Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008
W r i t in g l o g f i l e . . . .F in is he d w r i t i n g l o g f i l e .2 8 .4 3 u se r 1 .5 4 system 0 :3 1 .9 5 e l a p sed 93%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k0 i n p u ts +0 o u tp u ts (0 ma j o r+14 92 0 min o r ) p a g e fa u l ts 0 swap s[ bono : ~ / d _ l a p l a c e ] $ q sub q debug l n o d e s=2 :p p n =2 ,wa l l t i me =0 0 :3 0 :0 0 Iqsub : w a i t i n g f o r j o b 57 72 55 .bono . c c r . b u f f a l o . e du t o s t a r t
#############################PBS Pr olo gu e ##############################PBS p r ol o gu e s c r i p t r u n o n h o st c 15n32 a t F r i Sep 2 8 1 5 :0 3 :0 5 EDT 2007PBSTMPDIR is / sc rat ch /577255 .bono . ccr . bu ff al o . edu/ u s r / b i n / t i m e mp ie xe c . / l a p l a c e _ m p i:
Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008
0 .0 1 u se r 0 .0 1 syste m 0 :3 0 .8 2 e l a p sed 0%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k0 i n p u ts +0 o u tp u ts (0 ma j o r+1 416 mi n o r) p a g e fa u l ts 0 swaps
OSCs mpiexec will report accurate timings in the (optional) email
report, as it does not rely on rsh/ssh to launch tasks (but Intel MPIdoes, so in that case you will see the result of timing the mpiexecshell script, not the MPI code).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 11 / 108
Simple Profiling Tools Code Section Timing (Calipers)
-
7/29/2019 class16_profiling-handout.pdf
10/102
Simple Profiling Tools Code Section Timing (Calipers)
Code Section Timing (Calipers)
Timing sections of code requires a bit more work on the part of the
programmer, but there are reasonably portable means of doing so:
Routine Type Resolutiontimes user/sys ms
gettimeofday wall s
clock_gettime wall nssystem_clock (f90) wall system-dependent
cpu_time (f95) cpu compiler-dependentMPI_Wtime* wall system-dependent
OMP_GET_WTIME* wall system-dependent
Generally I prefer the MPI and OpenMP timing calls whenever I canuse them (*the MPI and OpenMP specifications call for their intrinsic
timers to be high precision).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 12 / 108
Simple Profiling Tools Code Section Timing (Calipers)
-
7/29/2019 class16_profiling-handout.pdf
11/102
Simple Profiling Tools Code Section Timing (Calipers)
More information on code section timers (and code for doing so):
LLNL Performance Tools:https://computing.llnl.gov/tutorials/performance_tools/#gettimeofday
Stopwatch (nice F90 module, but you need to supply a low-levelfunction for accessing a timer):
http://math.nist.gov/StopWatch
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 13 / 108
Simple Profiling Tools gprof
https://computing.llnl.gov/tutorials/performance_tools/#gettimeofdayhttp://math.nist.gov/StopWatchhttp://math.nist.gov/StopWatchhttps://computing.llnl.gov/tutorials/performance_tools/#gettimeofday -
7/29/2019 class16_profiling-handout.pdf
12/102
Simple Profiling Tools gprof
GNU Tools: gprof
Tool that we used briefly before:Generic GNU profiler
Requires recompiling code with -pg option
Running subsequent instrumented code produces gmon.out to
be read by gprofUse the environment variable GMON_OUT_PREFIX to specify anew gmon.out prefix to which the process ID will be appended(especially usefule for parallel runs) - this is a largely
undocumented feature ...
Line-level profiling is possible, as we will see in the following
example
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 14 / 108
Simple Profiling Tools gprof
-
7/29/2019 class16_profiling-handout.pdf
13/102
p g gp
gprof Shortcomings
Shortcomings of gprof (which apply also to any statistical profilingtool):
Need to recompile to instrument the codeInstrumentation can affect the statistics in the profile
Overhead can significantly increase the running time
Compiler optimization can be affected by instrumentation
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 15 / 108
Simple Profiling Tools gprof
-
7/29/2019 class16_profiling-handout.pdf
14/102
p g gp
Types of gprof Profiles
gprof profiles come in three types:1 Flat Profile: shows how much time your program spent in each
function, and how many times that function was called
2 Call Graph: for each function, which functions called it, which
other functions it called, and how many times. There is also anestimate of how much time was spent in the subroutines of eachfunction
3 Basic-block: Requires compilation with the -a flag (supportedonly by GNU?) - enables gprof to construct an annotated source
code listing showing how many times each line of code wasexecuted
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 16 / 108
Simple Profiling Tools gprof
-
7/29/2019 class16_profiling-handout.pdf
15/102
gprof example
g77 I . O3f f a s tmath g pg o r p r p _ re a d . o i n i t i a l . o en_ gde . o \
a dw fn s . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o g en _e tg . o g e n_ r t g . o \ gdewfn . o l i b / * . o
[ jonesm@bono ~/ d_bench ] $ qsub qdebug l n o d e s=1 :p p n =2 ,wa l l t i me =0 0 :3 0 :0 0 I[ jonesm@c16n30 ~/ d_bench] $ . / rp
E n t e r r u n i d : ( < = 9 c h a rs )s h o r t. . .. . . s k i p c o pi o us amount o f s t a nd a rd o u t p u t . . .. . .
[ jonesm@bono ~/d _ be nch ] $ g p ro f rp gmon . o u t > & o u t . g p ro f[ jonesm@bono ~/ d_bench ] $ les s out . gpr ofF l at p r o f i l e :
Ea ch sampl e co u n ts a s 0 .0 1 se co nd s .% c u m u l a t i v e s e l f s e l f t o t a l
t i me seconds seconds c a l ls s / c a l l s / c a l l name89.74 123.23 123.23 204008 0.00 0.00 tr i w f n s _
6.96 132.79 9.56 1 9.56 137.22 MAIN__
1.18 134.41 1.62 200004 0.00 0.00 en_gde__ 1.05 135.86 1.44 204002 0.00 0.00 e v o l _ 0.71 136.83 0.97 14790551 0.00 0.00 r a n f _ 0.27 137.20 0.37 204008 0.00 0.00 gdewfn_
. . .
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 17 / 108
Simple Profiling Tools gprof
-
7/29/2019 class16_profiling-handout.pdf
16/102
[ jonesm@bono ~/ d_bench ] $ gpr of - l i n e r p gmon . o u t > & o u t . g p r o f . l i n e[ jonesm@bono ~/d _ be nch ] $ l e s s o u t . g p ro f . l i n eF l at p r o f i l e :
Ea ch sampl e co u n ts a s 0 .0 1 se co nd s .% c u m u l a t i v e s e l f s e l f t o t a l
t ime seconds seconds c a l l s ns / c a l l ns / c a l l name17.45 23.96 23.96 t r i w f n s _ ( adwfns . f :129 @ 403c94 )
14.46 43.82 19.86 t r i w f n s _ ( adwfns . f :130 @ 403ce6 )12.87 61.50 17.68 t r i w f n s _ ( adwfns . f :171 @ 404755)12.31 78.41 16.91 t r i w f n s _ ( adwfns . f :172 @ 4047e2 )
0.67 79.33 0.92 t r i w f n s _ ( adwfns . f :1 30 @ 403cd6 )0.59 80.14 0.82 MAIN__ (cc4WTuQH . f :3 08 @ 4070c6 )0.51 80.84 0.70 MAIN__ (cc4WTuQH . f :3 04 @ 4072da )
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 18 / 108
Simple Profiling Tools gprof
-
7/29/2019 class16_profiling-handout.pdf
17/102
More gprof Information
More gprof documentation:
gprof GNU Manual:http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html
gprof man page:man gprof
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 19 / 108
Simple Profiling Tools pgprof
http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.htmlhttp://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html -
7/29/2019 class16_profiling-handout.pdf
18/102
PGI Tools: pgprof
PGI tools also have profiling capabilities (c.f. man pgf95)
Graphical profiler, pgprof
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 20 / 108
-
7/29/2019 class16_profiling-handout.pdf
19/102
-
7/29/2019 class16_profiling-handout.pdf
20/102
Simple Profiling Tools pgprof
-
7/29/2019 class16_profiling-handout.pdf
21/102
N.B. you can also use the -text option to pgprof to make it behavemore like gprof
See the PGI Tools Guide for more information (should be a PDFcopy in $PGI/doc)
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 23 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
iP St ti ti l MPI P fili
-
7/29/2019 class16_profiling-handout.pdf
22/102
mpiP: Statistical MPI Profiling
http://mpip.sourceforge.net
Not a tracing tool, but a lightweight interface to accumulate
statistics using the MPI profiling interface
Quite useful in conjunction with a tracefile analysis (e.g. using
jumpshot)
Installed on CCR systems - see module avail for mpiP
availability and location
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 25 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
iP C il ti
http://mpip.sourceforge.net/http://mpip.sourceforge.net/ -
7/29/2019 class16_profiling-handout.pdf
23/102
mpiP Compilation
To use mpiP you need to:
Add a -g flag to add symbols (this will allow mpiP to access thesource code symbols and line numbers)
Link in the necessary mpiP profiling library and the binary utilitylibraries for actually decoding symbols (There is a trick that youcan use most of the time to avoid having to link with mpiP, though).
Compilation examples (from U2) follow ...
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 26 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
mpiP Runtime Flags
-
7/29/2019 class16_profiling-handout.pdf
24/102
mpiP Runtime Flags
You can set various mpiP runtime flags (e.g. export MPIP=-t
10.0 -k 2):
Option Description Default-c Generate concise version of report, omitting callsite
process-specific detail.-e Print report data using floating-point format
-f dir Record output file in directory .-g Enable mpiP debug mode disabled-k n Sets callsite stack traceback depth to 1-n Do not truncate full pathname of filename in callsites-o Disable profiling at initialization.
Application must enable profiling with MPI_Pcontrol()-s n Set hash table size to 256-t x Set print threshold for report, where is the
MPI percentage of time for each callsite 0.0-v Generates both concise and verbose report output
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 27 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
mpiP Example Output
-
7/29/2019 class16_profiling-handout.pdf
25/102
mpiP Example Output
From an older version of mpiP, but still almost entirely the same - thisone links directly with mpiP; first compile:
mpicc g o a t l as a i j2 _ ba s is . o a nal yz e . o a t l as . o b a r r i e r . o b y t e f l i p . o \ c ho rd gn 2 . o c s t r i n g s . o i o 2 . o map . o m u t i l s . o numrec . o paramods . o p r o j . o \ p r o j A t l a s . o sym2 . o u t i l . o lm L / Pr oj ec ts /CCR/ jonesm / mpiP2 .8 .2 /gn u / ch_gm/ l i b \ lmpiP l b f d l i b e r t y lm
then run (in this case using 16 processors) and examine the output file:
[ j o n e sm@j o p l i n d _ de ren zo ] $ l s * . mpiPa t l a s . g c c3mpipapimp i P.1 6 .2 0 5 7 8 .1 .mp i P[ j o ne s m @j o pl i n d _d er en zo ] $ l e s s a t l a s . g cc 3mpipapimp i P.1 6 .2 0 5 7 8 .1 .mp i P:
:
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 28 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
26/102
@ mpiP@ Command : . / at la s . gcc3mpipapimpiP s t u dy . i n i 0@ Ve rs i o n : 2 . 8 . 2@ MPIP B u i l d date : Jun 29 2005 , 1 4 :5 3 :4 1
@ S t a r t t i me : 2005 06 29 1 5 :1 8: 52@ Stop t i m e : 2005 06 29 1 5 :2 8: 34@ Timer Used : g e t t i m e o f d a y@ MPIP env v a r : [ n u l l ]@ C o l l e c t o r Rank : 0@ C o l l e c t o r PID : 20578@ F i n a l Output D i r : .@ MPI T ask A ss ig nm en t : 0 bb18n17 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 1 bb18n17 . c c r . b u f f a l o . e du
@ MPI T ask A ss ig nm en t : 2 bb18n16 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 3 bb18n16 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 4 bb18n15 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 5 bb18n15 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 6 bb18n14 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 7 bb18n14 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 8 bb18n13 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 9 bb18n13 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 10 b b18n12 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 11 b b18n12 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 12 b b18n11 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 13 b b18n11 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 14 b b18n10 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 15 b b18n10 . c c r . b u f f a l o . e du
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 29 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
27/102
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@- MPI Time ( seconds) - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Task AppTime MPITime MPI%0 582 44.7 7.6 91 579 41.9 7.2 42 579 40.7 7.0 33 579 36.9 6.3 74 579 22.3 3.8 45 579 16.6 2.8 76 579 32 5 .5 3
7 579 35.9 6.2 08 579 28.6 4.9 39 579 25.9 4.4 8
10 579 39.2 6.7 611 579 33.8 5.8 412 579 35.3 6.1 013 579 41 7.0 714 579 29.9 5.1 615 579 41.4 7.1 6
* 9.27 e+03 546 5.89
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 30 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
28/102
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@- C a l l s i t e s : 13 - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ID Lev F i l e / Address L i n e Parent_ Funct MPI_Call1 0 u t i l . c 833 gsync B a r r i e r2 0 a t l a s . c 1531 re a d Pro j Da ta A l l r e d u c e3 0 p r o j A t l a s . c 745 b a c k P r o j A t l a s A l l r e d u c e4 0 a t l a s . c 1545 re a d Pro j Da ta A l l r e d u c e5 0 a t l a s . c 1525 re a d Pro j Da ta A l l r e d u c e
6 0 a t l a s . c 1541 re a d Pro j Da ta A l l r e d u c e7 0 a t l a s . c 1589 re a d Pro j Da ta A l l r e d u c e8 0 a t l a s . c 1519 re a d Pro j Da ta A l l r e d u c e9 0 u t i l . c 789 mygcast Bcast
10 0 p ro jA t l as . c 1100 c o m p u t e L o g l i k e A t l a s A l l r e d u c e11 0 a t l a s . c 1514 re a d Pro j Da ta A l l r e d u c e12 0 a t l a s . c 1537 re a d Pro j Da ta A l l r e d u c e13 0 p r o j A t l a s . c 425 f wd Ba c k Pr o j At l a s 2 A l l r e d u c e
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 31 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
29/102
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@- A g gr eg a te Time ( t o p t w e nt y , d es c en di ng , m i l l i s e c o n d s ) - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
C a l l S i t e Time App% MPI% COVA l l r e d u c e 13 3.09 e+05 3.33 56.50 0.46B a r r i e r 1 2.13 e+05 2.30 38.97 0.35Bcast 9 1.69 e+04 0.18 3.10 0.37A l l r e d u c e 3 7.78 e+03 0.08 1.42 0.11A l l r e d u c e 10 62.7 0.00 0.01 0.20
A l l r e d u c e 11 2.42 0.00 0.00 0.09A l l r e d u c e 7 2.17 0.00 0.00 0.26A l l r e d u c e 12 1.15 0.00 0.00 0.20A l l r e d u c e 6 1.14 0.00 0.00 0.19A l l r e d u c e 5 1.13 0.00 0.00 0.15A l l r e d u c e 8 1.12 0.00 0.00 0.18A l l r e d u c e 2 1 .1 0.00 0.00 0.13A l l r e d u c e 4 1 .1 0.00 0.00 0.12
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 32 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
30/102
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@-
Ag g re g a te Se nt Message Si ze ( to p twe n ty , d e sce nd i ng , b yte s ) - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
C a l l S i t e Count T o t a l Avrg Sent%A l l r e d u c e 13 65536 2.28 e+09 3.4 8 e+04 83.69A l l r e d u c e 3 8192 2.85 e+08 3.4 8 e+04 10.46Bcast 9 490784 1.59 e+08 325 5 .85A l l r e d u c e 11 16 2.07 e+04 1 . 3 e+03 0 .00A l l r e d u c e 10 512 4 .1 e+03 8 0 .00A l l r e d u c e 7 16 256 16 0 .00
A l l r e d u c e 2 16 64 4 0 .00A l l r e d u c e 6 16 64 4 0 .00A l l r e d u c e 5 16 64 4 0 .00A l l r e d u c e 4 16 64 4 0 .00A l l r e d u c e 8 16 64 4 0 .00A l l r e d u c e 12 16 64 4 0 .00. . .. . .
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 33 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
Using mpiP at Runtime
-
7/29/2019 class16_profiling-handout.pdf
31/102
Using mpiP at Runtime
Now lets examine an example using mpiP at runtime.
This example is solves a simple Laplace equation with Dirichletboundary conditions using finite differences.
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 34 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
32/102
#PBS -S /bin/bash
#PBS -q debug
#PBS -l walltime=00:20:00
#PBS -l nodes=2:MEM24GB:ppn=8
#PBS -M [email protected]
#PBS -m e
#PBS -N test
#PBS -o subMPIP.out
#PBS -j oe
module load intel
module load intel-mpi
module load mpip
module list
cd $PBS_O_WORKDIR
which mpiexec
NNODES=`cat $PBS_NODEFILE | uniq | wc -l`
NPROCS=`cat $PBS_NODEFILE | wc -l`
export I_MPI_DEBUG=5
# Use LD_PRELOAD trick to load mpiP wrappers at runtime
export LD_PRELOAD=$MPIPDIR/lib/libmpiP.so
mpdboot -n $NNODES -f $PBS_NODEFILE -v
mpiexec -np $NPROCS -envall ./laplace_mpi
-
7/29/2019 class16_profiling-handout.pdf
33/102
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
34/102
@ mpiP
@ Command : ./laplace_mpi
@ Version : 3.3.0
@ MPIP Build date : Oct 14 2011, 16:16:34
@ Start time : 2011 10 14 16:26:15
@ Stop time : 2011 10 14 16:28:11@ Timer Used : PMPI_Wtime
@ MPIP env var : [null]
@ Collector Rank : 0
@ Collector PID : 8618
@ Final Output Dir : .
@ Report generation : Single collector task
@ MPI Task Assignment : 0 d15n33.ccr.buffalo.edu
@ MPI Task Assignment : 1 d15n33.ccr.buffalo.edu
@ MPI Task Assignment : 2 d15n33.ccr.buffalo.edu@ MPI Task Assignment : 3 d15n33.ccr.buffalo.edu
@ MPI Task Assignment : 4 d15n33.ccr.buffalo.edu
@ MPI Task Assignment : 5 d15n33.ccr.buffalo.edu
@ MPI Task Assignment : 6 d15n33.ccr.buffalo.edu
@ MPI Task Assignment : 7 d15n33.ccr.buffalo.edu
@ MPI Task Assignment : 8 d15n23.ccr.buffalo.edu
@ MPI Task Assignment : 9 d15n23.ccr.buffalo.edu
@ MPI Task Assignment : 10 d15n23.ccr.buffalo.edu
@ MPI Task Assignment : 11 d15n23.ccr.buffalo.edu@ MPI Task Assignment : 12 d15n23.ccr.buffalo.edu
@ MPI Task Assignment : 13 d15n23.ccr.buffalo.edu
@ MPI Task Assignment : 14 d15n23.ccr.buffalo.edu
@ MPI Task Assignment : 15 d15n23.ccr.buffalo.edu
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 37 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
35/102
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@ - - MPI Time (seconds) - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Task AppTime MPITime MPI%
0 116 16.9 14.57
1 115 16.4 14.18
2 115 16.8 14.53
3 115 17.7 15.34
4 115 16.6 14.39
5 115 14.3 12.37
6 115 13.7 11.90
7 115 11.1 9.658 115 12.5 10.87
9 115 13.8 12.00
10 115 14.5 12.60
11 115 16.9 14.67
12 115 17.5 15.15
13 115 19.4 16.82
14 115 16.4 14.22
15 115 17.5 15.13
* 1.85e+03 252 13.65
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 38 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
36/102
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@ - - Callsites: 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ID Lev File/Address Line Parent_Funct MPI_Call
1 0 laplace_mpi.f90 118 MAIN__ Allreduce
2 0 laplace_mpi.f90 143 MAIN__ Recv
3 0 laplace_mpi.f90 48 __paramod_MOD_xchange Sendrecv
4 0 laplace_mpi.f90 80 MAIN__ Bcast
5 0 laplace_mpi.f90 46 __paramod_MOD_xchange Sendrecv
6 0 laplace_mpi.f90 138 MAIN__ Send
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@ - - Aggregate Time (top twenty, descending, milliseconds) - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Call Site Time App% MPI% COV
Allreduce 1 2.34e+05 12.69 92.94 0.15
Sendrecv 3 8.88e+03 0.48 3.52 0.19
Sendrecv 5 8.47e+03 0.46 3.36 0.16
Send 6 321 0.02 0.13 0.51
Bcast 4 97.7 0.01 0.04 0.26
Recv 2 14.4 0.00 0.01 0.00
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 39 / 108
Parallel Profiling Tools Statistical MPI Profiling With mpiP
-
7/29/2019 class16_profiling-handout.pdf
37/102
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@ - - Callsite Time statistics (all, milliseconds): 80 - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Name Site Rank Count Max Mean Min App% MPI%
Allreduce 1 0 23845 28.5 0.666 0.036 13.69 94.00
Allreduce 1 1 23845 28.6 0.644 0.035 13.30 93.79
Allreduce 1 2 23845 28.6 0.661 0.037 13.66 93.98
Allreduce 1 3 23845 28.6 0.695 0.038 14.36 93.63
Allreduce 1 4 23845 29.2 0.651 0.038 13.46 93.54
Allreduce 1 5 23845 29.2 0.555 0.035 11.46 92.71
Allreduce 1 6 23845 29 0.528 0.037 10.91 91.64
Allreduce 1 7 23845 26.3 0.405 0.038 8.37 86.74Allreduce 1 8 23845 26.8 0.468 0.034 9.67 88.92
Allreduce 1 9 23845 28.4 0.538 0.033 11.11 92.59
Allreduce 1 10 23845 29.7 0.566 0.033 11.70 92.90
Allreduce 1 11 23845 29.7 0.661 0.033 13.65 93.10
Allreduce 1 12 23845 28.6 0.686 0.039 14.17 93.49
Allreduce 1 13 23845 28.8 0.768 0.041 15.87 94.35
Allreduce 1 14 23845 28.6 0.641 0.036 13.25 93.19
Allreduce 1 15 23845 28.7 0.694 0.038 14.33 94.76
Allreduce 1 * 381520 29.7 0.614 0.033 12.69 92.94
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 40 / 108
-
7/29/2019 class16_profiling-handout.pdf
38/102
-
7/29/2019 class16_profiling-handout.pdf
39/102
Parallel Profiling Tools Intel Trace Analyzer/Collector
ITAC Example
-
7/29/2019 class16_profiling-handout.pdf
40/102
Note that you do not have to recompile your application to use ITAC(unless you are building it statically), you can just build it usual, usingIntel MPI:
[ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ m od ul e l o a d i n t e lmpi
[ b on o : ~/ d _ l a p l a c e / d _ i ta c ]$ make l a p l a ce _ mp im p i i f o r t i p o O3V a x l i b g c l a p l a c e _ mp i . f 9 0m p i i f o r t i p o O3V a x l i b g o l a p la c e _m p i l a p la c e _m p i . oi p o : r em ar k # 11 00 1: p e r f or m i n g s i n g l ef i l e o p t i m i z a t i o n si p o : r em ar k # 11 00 5: g e n e r at i n g o b j e c t f i l e / t mp / i p o _ i f o r t j O z Y A n . o[ bono : ~ / d _ l a p l a c e / d _ i t a c ] $
You can turn trace collection on at run-time ...
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 43 / 108
-
7/29/2019 class16_profiling-handout.pdf
41/102
Parallel Profiling Tools Intel Trace Analyzer/Collector
-
7/29/2019 class16_profiling-handout.pdf
42/102
Running the preceding batch job on U2 produces a bunch (many!)ofprofiling output files, the most important of which can be is the name of
your binary with a .stf suffix, in this case laplace_mpi.stf, which wefeed to the Intel Trace Analyzer using the traceanalyzer command ...and we should see a profile that looks very much like what you can see
using jumpshot.
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 45 / 108
Parallel Profiling Tools Intel Trace Analyzer/Collector
-
7/29/2019 class16_profiling-handout.pdf
43/102
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 46 / 108
Parallel Profiling Tools Intel Trace Analyzer/Collector
More ITAC Documentation
-
7/29/2019 class16_profiling-handout.pdf
44/102
Some helpful pointers to more ITAC documentation:[ k 07n14 : ~ ] $ w hi ch t r a c e a n a l y z e r
/ u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / b i n / t r a c e a n a l y z e r[ k 07n14 : ~ ] $ l s l / u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / doc / t o t a l 4211r - r - r - 1 r o o t r o o t 91050 Aug 25 0 6 : 02 FAQ . p d fr - r - r - 1 r o o t r o o t 15566 Aug 25 0 6: 02 G e t t in g _ S ta r t e d . h tm ldrwxrxrx 3 r o o t ro o t 57 Oct 4 08:02 h t ml
r - r - r - 1 r o o t r o o t 61598 Aug 25 0 6: 02 I NSTALL . h t m lr - r - r - 1 ro o t ro o t 2 05 10 29 Aug 25 0 6 :0 2 ITA_ Refe re n ce _ Gu i de . p d fr - r - r - 1 ro o t ro o t 1 05 02 76 Aug 25 0 6 :0 2 ITC_ Refe re n ce _ Gu i de . p d fr - r - r - 1 r o o t r o o t 20681 Aug 2 5 0 6 :0 2 R el ea se _N ot es . t x t
Generally a good idea to refer to the documentation for the same
version that you are using (you can check with module showintel-mpi).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 47 / 108
Performance API (PAPI) Introduction to PAPI
Introduction
-
7/29/2019 class16_profiling-handout.pdf
45/102
Performance Application Programming Interface
Implement a portable(!) and efficient API to access existing
hardware performance countersEase the optimization of code by providing base infrastructure forcross-platform tools
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 49 / 108
-
7/29/2019 class16_profiling-handout.pdf
46/102
-
7/29/2019 class16_profiling-handout.pdf
47/102
-
7/29/2019 class16_profiling-handout.pdf
48/102
-
7/29/2019 class16_profiling-handout.pdf
49/102
-
7/29/2019 class16_profiling-handout.pdf
50/102
Performance API (PAPI) Reminder: U2 Hardware
Nehalem Xeon Block Diagram
-
7/29/2019 class16_profiling-handout.pdf
51/102
Block diagram for Nehalem architecture:
10/14/11 5:25 PM
Page 1 of 2file:///Users/jonesm/Desktop/Intel_Nehalem_arch.svg
quadruple associative Instruction Cache 32 KByte,
128-entry TLB-4K, 7 TLB-2/4M per thread
Prefetch Buffer (16 Bytes)
Predecode &
Instruction Length Decoder
Instruction Queue
18 x86 Instructions
Alignment
MacroOp Fusion
Complex
Decoder
Simple
DecoderSimple
Decoder
Simple
Decoder
Decoded Instruction Queue (28 OP entries)
MicroOp Fusion
Loop
Stream
Decoder
2 x Register Allocation Table (RAT)
Reorder Buffer (128-entry) fused
2 x
Retirement
Register
File
Reservation Station (128-entry) fused
Store
Addr.
Unit
AGU
Load
Addr.
Unit
AGUStore
Data
Micro
Instruction
Sequencer
256 KByte
8-way,
64 Byte
Cacheline,
private
L2-Cache
512-entry
L2-TLB-4K
Integer/MMX ALU,
Branch
SSE
ADD
Move
Integer/MMX
ALU
SSE
ADD
Move
FP
ADD
Integer/MMX ALU,
2x AGU
SSE
MUL/DIV
Move
FP
MUL
Memory Order Buffer (MOB)
octuple associative Data Cache 32 KByte,
64-entry TLB-4K, 32-entry TLB-2/4M
Branch
Prediction
global/bimodal,
loop, indirect
jmp
128
Port 4 Port 0Port 3 Port 2 Port 5 Port 1
128 128
128 128 128
Result Bus256
Quick Path
Inter-
connect
DDR3
Memory
Controller
Common
L3-Cache
8 MByte
Uncore
4 x 20 Bit
6,4 GT/s
3 x 64 Bit
1,33 GT/s
Intel Nehalem microarchitecture
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 55 / 108
Performance API (PAPI) Reminder: U2 Hardware
Superscalar - EM64T Irwindale
-
7/29/2019 class16_profiling-handout.pdf
52/102
Characteristics of the Irwindale Xeons that form (part of) CCRs U2
cluster:
Clock Cycle 3.2 GHzTPP 6.4 GFlop/s
Pipeline 31 stagesL2 Cache Size 2 MByte
L2 Bandwidth 102.4 GByte/sCPU-Memory Bandwidth 6.4 GByte/s (shared)
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 56 / 108
-
7/29/2019 class16_profiling-handout.pdf
53/102
-
7/29/2019 class16_profiling-handout.pdf
54/102
-
7/29/2019 class16_profiling-handout.pdf
55/102
-
7/29/2019 class16_profiling-handout.pdf
56/102
Performance API (PAPI) High vs. Low-level PAPI
High-level PAPI
-
7/29/2019 class16_profiling-handout.pdf
57/102
Intended for coarse-grained measurements
Requires little (or no) setup code
Allows only PAPI preset
Allows only aggregate counting (no statistical profiling)
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 61 / 108
Performance API (PAPI) High vs. Low-level PAPI
Low-level PAPI
-
7/29/2019 class16_profiling-handout.pdf
58/102
More efficient (and functional) than high-level
About 60 functions
Thread-safe
Supports presets and native events
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 62 / 108
Performance API (PAPI) High vs. Low-level PAPI
Preset vs. Native Events
-
7/29/2019 class16_profiling-handout.pdf
59/102
preset or pre-defined events, are those which have been
considered useful by the PAPI community and developers:http://icl.cs.utk.edu/projects/papi/presets.html
native events are those countable by the CPUs hardware.These events are highly platform specific, and you would
need to consult the processor architecture manuals forthe relevant native event lists
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 63 / 108
http://icl.cs.utk.edu/projects/papi/presets.htmlhttp://icl.cs.utk.edu/projects/papi/presets.html -
7/29/2019 class16_profiling-handout.pdf
60/102
-
7/29/2019 class16_profiling-handout.pdf
61/102
Performance API (PAPI) High vs. Low-level PAPI
How to Access PAPI at CCR
-
7/29/2019 class16_profiling-handout.pdf
62/102
Consider a simple example code to measure Flop/s using thehigh-level PAPI API:
1 #include < s t d i o . h>2 #include < s t d l i b . h>3 #include " p a p i .h "4
56 i n t main ( )7 {8 f l o a t re a l _ t i me , p ro c_ t i me , mf l o p s ;9 l o n g _ l o n g f lp o p s ;
10 f l o a t i r e a l _ t i m e , i p r oc _ t im e , i m f l o p s ;11 l o n g _ l o n g i f l p o p s ;12 i n t r e t v a l ;
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 66 / 108
Performance API (PAPI) High vs. Low-level PAPI
13 i f ( ( r e t v a l = P A P I _ f l op s ( & i r e a l _ t i m e , & i p r o c _ t i m e , & i f l p o p s ,& i m f l o p s ) ) < PAPI_OK )14 {15 p r i n t f ( " Could n ot i n i t i a l i s e PAPI f lo ps \ n " ) ;
-
7/29/2019 class16_profiling-handout.pdf
63/102
15 p r i n t f ( Could n ot i n i t i a l i s e PAPI_ f lo ps \ n ) ;16 p r i n t f ( " Your p la t fo r m may no t s u pp or t f l o a t i n g p o in t o pe ra ti on ev en t . \ n " ) ;17 p r i n t f ( " r e t v a l : %d \ n" , r e t v a l ) ;18 e x i t ( 1 ) ;19 }
2021 y our _s lo w_c ode ( ) ;2223 i f ( ( r e t v a l = P A P I_ f l op s ( & r e a l _ t i m e , & pr o c _t i m e , & f l p o p s , & m fl o p s ) ) < PAPI_OK )24 {25 p r i n t f ( " r e t v a l : %d \ n" , r e t v a l ) ;26 e x i t ( 1 ) ;27 }28
29 p r i n t f ( " R ea l_ t im e : % f P r oc _t i me : % f T o t a l f l p o p s : % l l d MFLOPS : %f \ n " ,30 r ea l _ t i m e , pro c_time , f lp o p s , mfl ops ) ;31 e x i t ( 0 ) ;32 }33 i n t your_slow_code ( )34 {35 i n t i ;36 double tmp =1 .1 ;37
38 f o r ( i =1 ; i
-
7/29/2019 class16_profiling-handout.pdf
64/102
On U2, you access the papi module and compile accordingly:[ k07 n1 4 : ~/ d _p a p i ] $ q su b q debug lnode s =1:MEM48GB: ppn=12 , wa ll ti me =01: 00:00 Iqs ub : w a i t i n g f o r j o b 1 14 44 85 .d15n41 . c c r . b u f f a l o . e du t o s t a r tq su b : j o b 1 1 44 4 85 .d1 5n 41 . ccr . b u ff a l o . e du re a d y
Job 1 1 44 4 85 .d1 5n 41 . ccr . b u ff a l o . ed u h as re q u e ste d 12 co re s / p ro ce sso rs p e r n od e .PBSTMPDIR is / sc rat ch /1144485. d15n41. cc r . bu ff al o . edu[ k16n01b : ~ ] $ cd $PBS_O_WORKDIR
[ k16 n0 1b :~ / d _ p a pi ] $ mod ul e l o a d p a p i' p a p i / v 4 . 1 . 4 ' l o a d c o mp l et e .[ k16 n0 1b :~ / d _ p a pi ] $ g cc I$PAPI / inc lu de o P A P I _f l o ps P A P I _f l o ps . c L$PAPI/ l ib l p a p i[ k16 n0 1b :~ / d _ p a pi ] $ . / PAPI_ f l o p sRe a l _ t ime : 0 .00 0 04 2 Pro c_ t i me : 0 .00 0 02 9 To t a l f l p o p s : 7 98 3 MFLOPS: 2 76 .57 2 44 9
N.B., PAPI needs to support the underlying hardware, and this
version does not support the 32-core Intel nodes (including thefront end).
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 68 / 108
Performance API (PAPI) High vs. Low-level PAPI
-
7/29/2019 class16_profiling-handout.pdf
65/102
... and on Lennon (SGI Altix):
[ j on esm@le nn on ~/ d _ p a pi ] $ mod ul e l o a d p a p i' p a p i / v 3 . 2 . 1 ' l o a d c o mp l et e .[ jonesm@lennon ~/ d_papi ] $ echo $PAPI
/ u t i l / p e r f t o o l s / p ap i3.2.1[ jonesm@lennon ~/ d_papi ] $ gcc I$PAPI / inc lu de o P A PI _ fl o ps P A PI _ fl o ps . c \
L$PAPI/ l ib l p a p i[ j on esm@le nn on ~/ d _ p a pi ] $ l d d PAPI_ f l o p s
l i b p a p i . s o => / u t i l / p e r f t o o l s / p a pi3 . 2. 1/ l i b / l i b p a p i . s o(0x2000000000040000)
l i b c . s o . 6 . 1 => / l i b / t l s / l i b c . s o . 6 . 1 ( 0 x20000000000ac000 )l i b p fm . so .2 => / u sr / l i b / l i b p fm . so .2 (0 x2 00 00 00 00 03 18 00 0 )/ l i b / l dl i n u xi a 64 . so . 2 => / l i b / l dl i n u xia64 . so. 2 (0 x2000000000000000 )
[ j on esm@le nn on ~/ d _ p a pi ] $ . / PAPI_ f l o p sReal_time : 0.000143 Proc_time : 0.000132 To ta l f l po ps : 48056 MFLOPS: 363.964111
N.B., The Altix is dead, but this is a good example of the cross-platform
portability of PAPI accessign the hardware performance counters.
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 69 / 108
Performance API (PAPI) High vs. Low-level PAPI
papi_avail Command
Y i il h k il bili (diff CPU
-
7/29/2019 class16_profiling-handout.pdf
66/102
You can use papi_avail to check event availability (different CPUs
support various events):
[k16n01b:~/d_papi]$ papi_avail
Available events and hardware information.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PAPI Version : 4.1.4.0
Vendor string and code : GenuineIntel (1)
Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44)
CPU Revision : 2.000000
CPUID Info : Family: 6 Model: 44 Stepping: 2
CPU Megahertz : 2400.391113
CPU Clock Megahertz : 2400
Hdw Threads per core : 1
Cores per Socket : 6
NUMA Nodes : 2
CPU's per Node : 6
Total CPU's : 12
Number Hardware Counters : 16
Max Multiplex Counters : 512
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The following correspond to fields in the PAPI_event_info_t structure.
Name Code Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses
PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 70 / 108
Performance API (PAPI) High vs. Low-level PAPI
PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses
PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses
PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses
PAPI L3 ICM 0x80000005 No No Level 3 instruction cache misses
-
7/29/2019 class16_profiling-handout.pdf
67/102
PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses
PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses
PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses
PAPI_L3_TCM 0x80000008 Yes No Level 3 cache misses
PAPI_CA_SNP 0x80000009 No No Requests for a snoopPAPI_CA_SHR 0x8000000a No No Requests for exclusive access to shared cache line
PAPI_CA_CLN 0x8000000b No No Requests for exclusive access to clean cache line
PAPI_CA_INV 0x8000000c No No Requests for cache line invalidation
PAPI_CA_ITV 0x8000000d No No Requests for cache line intervention
PAPI_L3_LDM 0x8000000e Yes No Level 3 load misses
PAPI_L3_STM 0x8000000f No No Level 3 store misses
PAPI_BRU_IDL 0x80000010 No No Cycles branch units are idle
PAPI_FXU_IDL 0x80000011 No No Cycles integer units are idle
PAPI_FPU_IDL 0x80000012 No No Cycles floating point units are idlePAPI_LSU_IDL 0x80000013 No No Cycles load/store units are idle
PAPI_TLB_DM 0x80000014 Yes No Data translation lookaside buffer misses
PAPI_TLB_IM 0x80000015 Yes No Instruction translation lookaside buffer misses
PAPI_TLB_TL 0x80000016 Yes Yes Total translation lookaside buffer misses
PAPI_L1_LDM 0x80000017 Yes No Level 1 load misses
PAPI_L1_STM 0x80000018 Yes No Level 1 store misses
PAPI_L2_LDM 0x80000019 Yes No Level 2 load misses
PAPI_L2_STM 0x8000001a Yes No Level 2 store misses
PAPI_BTAC_M 0x8000001b No No Branch target address cache misses
PAPI_PRF_DM 0x8000001c No No Data prefetch cache misses
PAPI_L3_DCH 0x8000001d No No Level 3 data cache hits
PAPI_TLB_SD 0x8000001e No No Translation lookaside buffer shootdowns
PAPI_CSR_FAL 0x8000001f No No Failed store conditional instructions
PAPI_CSR_SUC 0x80000020 No No Successful store conditional instructions
PAPI_CSR_TOT 0x80000021 No No Total store conditional instructions
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 71 / 108
-
7/29/2019 class16_profiling-handout.pdf
68/102
-
7/29/2019 class16_profiling-handout.pdf
69/102
Performance API (PAPI) High vs. Low-level PAPI
-
7/29/2019 class16_profiling-handout.pdf
70/102
PAPI_FML_INS 0x80000061 No No Floating point multiply instructions
PAPI_FAD_INS 0x80000062 No No Floating point add instructionsPAPI_FDV_INS 0x80000063 No No Floating point divide instructions
PAPI_FSQ_INS 0x80000064 No No Floating point square root instructions
PAPI_FNV_INS 0x80000065 No No Floating point inverse instructions
PAPI_FP_OPS 0x80000066 Yes Yes Floating point operations
PAPI_SP_OPS 0x80000067 Yes Yes Floating point operations; optimized to count
scaled single precision vector operations
PAPI_DP_OPS 0x80000068 Yes Yes Floating point operations; optimized to count
scaled double precision vector operations
PAPI_VEC_SP 0x80000069 Yes No Single precision vector/SIMD instructionsPAPI_VEC_DP 0x8000006a Yes No Double precision vector/SIMD instructions
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Of 107 possible events, 57 are available, of which 14 are derived.
avail.c PASSED
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 74 / 108
-
7/29/2019 class16_profiling-handout.pdf
71/102
Performance API (PAPI) High vs. Low-level PAPI
PAPI_L1_TCM 0x80000006 Yes Level 1 cache misses
PAPI_L2_TCM 0x80000007 No Level 2 cache misses
PAPI_L3_TCM 0x80000008 No Level 3 cache misses
PAPI_L3_LDM 0x8000000e No Level 3 load misses
PAPI TLB DM 0 80000014 N D t t l ti l k id b ff i
-
7/29/2019 class16_profiling-handout.pdf
72/102
PAPI_TLB_DM 0x80000014 No Data translation lookaside buffer misses
PAPI_TLB_IM 0x80000015 No Instruction translation lookaside buffer misses
PAPI_TLB_TL 0x80000016 Yes Total translation lookaside buffer misses
PAPI_L1_LDM 0x80000017 No Level 1 load misses
PAPI_L1_STM 0x80000018 No Level 1 store misses
PAPI_L2_LDM 0x80000019 No Level 2 load misses
PAPI_L2_STM 0x8000001a No Level 2 store misses
PAPI_BR_UCN 0x8000002a No Unconditional branch instructions
PAPI_BR_CN 0x8000002b No Conditional branch instructions
PAPI_BR_TKN 0x8000002c No Conditional branch instructions taken
PAPI_BR_NTK 0x8000002d Yes Conditional branch instructions not taken
PAPI_BR_MSP 0x8000002e No Conditional branch instructions mispredicted
PAPI_BR_PRC 0x8000002f Yes Conditional branch instructions correctly predicted
PAPI_TOT_IIS 0x80000031 No Instructions issuedPAPI_TOT_INS 0x80000032 No Instructions completed
PAPI_FP_INS 0x80000034 No Floating point instructions
PAPI_LD_INS 0x80000035 No Load instructions
PAPI_SR_INS 0x80000036 No Store instructions
PAPI_BR_INS 0x80000037 No Branch instructions
PAPI_RES_STL 0x80000039 No Cycles stalled on any resource
PAPI_TOT_CYC 0x8000003b No Total cycles
PAPI_LST_INS 0x8000003c Yes Load/store instructions completed
PAPI_L2_DCH 0x8000003f Yes Level 2 data cache hitsPAPI_L2_DCA 0x80000041 No Level 2 data cache accesses
PAPI_L3_DCA 0x80000042 Yes Level 3 data cache accesses
PAPI_L2_DCR 0x80000044 No Level 2 data cache reads
PAPI_L3_DCR 0x80000045 No Level 3 data cache reads
PAPI_L2_DCW 0x80000047 No Level 2 data cache writes
PAPI_L3_DCW 0x80000048 No Level 3 data cache writes
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 76 / 108
-
7/29/2019 class16_profiling-handout.pdf
73/102
-
7/29/2019 class16_profiling-handout.pdf
74/102
-
7/29/2019 class16_profiling-handout.pdf
75/102
PAPI API Examples
PAPI Naming Scheme
-
7/29/2019 class16_profiling-handout.pdf
76/102
The C interfaces:PAPI C interface
(return type) PAPI_function_name(arg1, arg2, ...)
andFortran
interfacesPAPI Fortran interfaces
PAPIF_function_name(arg1, arg2, ..., check)
note that the check parameter is the same type and value as the C
return value.
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 81 / 108
-
7/29/2019 class16_profiling-handout.pdf
77/102
PAPI API Examples High-level API Example in C
High-level API Example in C
Lets consider the following example code for using the high level API
-
7/29/2019 class16_profiling-handout.pdf
78/102
Let s consider the following example code for using the high-level APIin C.
#include " p a p i .h "#include < s t d i o . h># d e fi n e NUM_EVENTS 2# d e fi n e THRESHOLD 10 000# d e fi n e ERROR_RETURN ( r e t v a l ) { f p r i n t f ( s t d e r r , " E r r o r %d %s : l i n e %d : \ n " , \
r e t v a l , _ _FILE__ , __LINE__ ) ; e x i t ( r e t v a l ) ; }void c o m p u ta t i o n_ m u l t ( ) { /* s t u p i d c ode s t o be m o ni t or e d */
double tmp =1 .0 ;
i n t i =1;f o r ( i = 1 ; i < THRESHOLD ; i + + ) {
tmp = tmp * i ;}
}void co mp uta t i on _ a dd ( ) { /* s t u p i d c od es t o be m o ni t or e d */
i n t tmp = 0 ;i n t i =0;
f o r ( i = 0 ; i < THRESHOLD ; i + + ) {tmp = tmp + i ;
}}
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 83 / 108
-
7/29/2019 class16_profiling-handout.pdf
79/102
PAPI API Examples High-level API Example in C
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *P AP I nu m co un te rs r e t u r n s t h e number o f h ar dw ar e c o u n t e r s t h e p l a t f o r m
-
7/29/2019 class16_profiling-handout.pdf
80/102
* P AP I_ nu m_ co un te rs r e t u r n s t h e number o f h ar dw ar e c o u n t e r s t h e p l a t f o r m ** has o r a n e ga t iv e number i f t h e re i s an e r r o r ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
i f ( ( num_hwcntrs = PAPI_num_counters ( ) ) < PAPI_OK) {p r i n t f ( " T her e a re no c o un t er s a v a i l a b l e . \ n " ) ;e x i t ( 1 ) ;
}
p r i n t f ( " T h er e a r e %d c o u n t e r s i n t h i s s ys te m \ n " , n um _h wc nt rs ) ;
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** P AP I_ st ar t_ co un te rs i n i t i a l i z e s t he PAPI l i b r a r y ( i f n ec es sa ry ) and *
* s t a r t s c o u nt i ng t he e ve nt s named i n t he e ve nt s a r r ay . T hi s f u n c t i o n ** i m p l i c i t l y s to p s and i n i t i a l i z e s any c ou n t er s r un n in g as a r e s u l t o f ** a p r ev i ou s c a l l t o P A PI _ st a rt _ co u nt e rs . ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
i f ( ( r e t v a l = PAP I_s tart _co unte rs ( Events , NUM_EVENTS) ) != PAPI_OK)ERROR_RETURN( r e t v a l ) ;
p r i n t f ( " \ n Co un te r S t a r t e d : \ n " ) ;
/* You r cod e g oe s h e re */co mp uta t i on _ a dd ( ) ;
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 85 / 108
PAPI API Examples High-level API Example in C
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** P AP I_ re ad _c ou nt er s r ea ds t h e c o u nt e r v a l ue s i n t o v a l ue s a r r a y ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
-
7/29/2019 class16_profiling-handout.pdf
81/102
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */i f ( ( r e t v a l =PAPI_read_counters ( values , NUM_EVENTS) ) != PAPI_OK)
ERROR_RETURN( r e t v a l ) ;
p r i n t f ( " Read s u c c e s s f u l l y \ n " ) ;
p r i n t f ( " The t o t a l i n s t r u c t i o n s ex ec ut ed f o r a d d i t i on a re %l l d \ n " , v a lu es [ 0 ] ) ;p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;
p r i n t f ( " \ nNow we t r y t o u se PAPI_accum t o a c cu m ul a te v a l u e s \ n " ) ;
/* Do some co mp u ta t i o n h e re */co mp uta t i on _ a dd ( ) ;
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** What P AP I_ ac cu m_ co un te rs d oe s i s i t a dd s t h e r u n n i n g c o u n t e r v a l u e s ** t o what i s i n t he v a lu es a r r ay . The h ar dw ar e c o un t er s a re r e s e t and ** l e f t r un n in g a f t e r th e c a l l . ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
i f ( ( r e t v a l =PAPI_accum_counters ( value s , NUM_EVENTS) ) != PAPI_OK)ERROR_RETURN( r e t v a l ) ;
p r i n t f ( "We d i d an a d d i t i o n a l %d t i m e s a d d i t i o n ! \ n " , THRESHOLD ) ;
p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d di t i on are %l l d \ n " ,v al ue s [ 0 ] ) ;p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 86 / 108 PAPI API Examples High-level API Example in C
-
7/29/2019 class16_profiling-handout.pdf
82/102
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** S to p c o u n t in g e v en t s ( t h i s r ea ds t h e c o u nt e r s as w e l l as s t o ps them ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
p r i n t f ( " \ nNow we t r y t o do some m u l t i p l i c a t i o n s \ n " ) ;c o m p u ta t i o n _m u l t ( ) ;
/* * * * * * * * * * * * * * * * * * * PAPI_stop_counters * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */i f ( ( r e t v a l =PAPI _stop_ count ers ( value s , NUM_EVENTS) ) != PAPI_OK)
ERROR_RETURN( r e t v a l ) ;
p r i n t f ( " The t o t a l i n s t r u c t i o n ex ecuted f o r m u l t i p l i c a t i o n are %l l d \ n " ,v al ue s [ 0 ] ) ;
p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;e x i t ( 0 ) ;
}
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 87 / 108 PAPI API Examples High-level API Example in C
Running on E5645 U2 nodes:
-
7/29/2019 class16_profiling-handout.pdf
83/102
g
[k11n30b:~/d_papi]$ module load papi
'papi/v4.1.4' load complete.[k11n30b:~/d_papi]$ gcc -I$PAPI/include -o highlev highlev.c -L$PAPI/lib -lpapi
[k11n30b:~/d_papi]$ ./highlev
There are 16 counters in this system
Counter Started:
Read successfully
The total instructions executed for addition are 54977
The total cycles used are 73894
Now we try to use PAPI_accum to accumulate values
We did an additional 10000 times addition!
The total instructions executed for addition are 112352
The total cycles used are 147814
Now we try to do some multiplications
The total instruction executed for multiplication are 77459
The total cycles used are 126981
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 88 / 108 PAPI API Examples High-level API Example in C
PAPI Initialization
-
7/29/2019 class16_profiling-handout.pdf
84/102
The preceding example used PAPI_library_init to initialize PAPI,which is also used for the low-level API, but you can also use thePAPI_num_counters, PAPI_start_counters, or one of the ratecalls, PAPI_flips, PAPI_flops, or PAPI_ipc.
Events are counted, as we saw in the example, usingPAPI_accum_counters, PAPI_read_counters, and
PAPI_stop_counters.
Lets look at an even simpler example just using one of the rate
counters.
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 89 / 108 PAPI API Examples High-level Example in F90
High-level Example in F90
-
7/29/2019 class16_profiling-handout.pdf
85/102
For something a little different we can look at our old friend, matrixmultiplication, this time in Fortran:
! A s i m pl e example f o r t he use o f PAPI , t he number o f f l o p s you s ho ul d! g et i s a bo ut INDEX^3 on m ac hines t h a t c o ns i de r add a nd m u l t i p l y one f l o p! such as S GI , and 2 * ( INDEX ^ 3 ) t h a t don ' t c o n s i de r i t 1 f l o p s uc h as INTEL! Kevin Londonprogram f l o p s
i m p l i c i t noneinclude " f9 0 p a p i . h "
i n te g e r , parameter : : i 8 =SELECTED_INT_KIND( 1 6 ) ! i n te g er *8i n te g e r , parameter : : i n d e x =1000r e a l : : ma tr i xa ( i n d e x , i n d e x ) , ma tr i xb ( i n d e x , i n d e x ) ,mre s( i n d e x , i n d e x )r e a l : : p r oc _ ti m e , m fl op s , r e a l _ t i m ei n t e g e r ( kind= i8 ) : : f l p i n si n t e g e r : : i , j , k , r e t v a l
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 90 / 108
-
7/29/2019 class16_profiling-handout.pdf
86/102
PAPI API Examples High-level Example in F90
! M t i M t i M l t i l
-
7/29/2019 class16_profiling-handout.pdf
87/102
! M at ri xM a tr i x M u l t i p l ydo i =1 , i n d e x
do j =1 , i n d e xdo k = 1 , i n d e xm res ( i , j ) = m res ( i , j ) + m a t r i x a ( i , k ) * ma tr i xb (k , j )
end doend do
end do
! C o l l e c t t he d at a i n t o t he V a ri a bl e s p assed i nc a l l P A PI f _f l op s ( r e al _ t im e , p ro c_ ti me , f l p i n s , m fl op s , r e t v a l )i f ( r et v al .NE.PAPI_OK) then
p r i n t * , ' F a i l u r e on P A PI f _f l op s : ' , r e t v a lend i fp r i n t * , ' R ea l_ ti me : ' , r e a l _ ti m ep r i n t * , ' P ro c_ ti me : ' , p ro c _t im ep r i n t * , ' T o t al f l p i n s : ' , f l p i n sp r i n t * , ' MFLOPS: ' , mf l o p s
end program f l o p s
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 92 / 108
-
7/29/2019 class16_profiling-handout.pdf
88/102
-
7/29/2019 class16_profiling-handout.pdf
89/102
PAPI API Examples Low-level API
PAPI Low-level Example
A simple example using the low-level API:
-
7/29/2019 class16_profiling-handout.pdf
90/102
p p g
#include #include < s t d i o . h>
# d e fi n e NUM_FLOPS 1000 0
m ai n ( ) {i n t re tv al , EventSet=PAPI_NULL ;l o n g_ l o ng v a lu e s [ 1 ] ;
/* I n i t i a l i z e th e PAPI l i b r a r y */
r et v a l = PA P I_ li br ar y _i ni t (PAPI_VER_CURRENT) ;i f ( r e t v a l != PAPI_VER_CURRENT) {
f p r i n t f ( s t d er r , " PAPI l i b r a r y i n i t e r r o r ! \ n " ) ;e x i t ( 1 ) ;
}
/* C re at e t h e E ve nt S et */i f ( PAPI_ cre a te _ e ve n tse t(&Eve n tSet ) != PAPI_OK) h a n d l e _ e rr o r ( 1 ) ;
/* Add T o t a l I n s t r u c t i o n s Ex ec ut ed t o o ur E ve nt S et */
i f ( PAPI_add_event ( EventSet , PAPI_TOT_INS) != PAPI_OK) han dle _er ror ( 1 ) ;
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 95 / 108 PAPI API Examples Low-level API
/* S t a r t c o un t in g e ve nt s i n t he E ve nt S et */i f ( P A P I _ s t a r t ( E v en t Se t ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ;
/* D ef in ed i n t e s t s / d o lo op s . c i n t he PAPI s ou rc e d i s t r i b u t i o n */
-
7/29/2019 class16_profiling-handout.pdf
91/102
_ pdo _ f l o ps ( NUM_FLOPS ) ;
/* Read t h e c o u n t in g e v en t s i n t h e E ve nt S et */i f ( PAPI_ re ad ( Even tSet , va l u e s ) != PAPI_OK) h a n d l e _ e rro r ( 1 ) ;
p r i n t f ( " A f t e r r e ad i ng t he c o un t er s : % l l d \ n " , v a l ue s [ 0 ] ) ;
/* Res et t he c o un t in g e ve nt s i n t h e E ve nt S et */i f ( P A PI _ re s et ( E v en t Se t ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ;
do _ f l o ps ( NUM_FLOPS ) ;
/* Add t he c o un t er s i n t he E ve nt S et */i f (PAPI_accum( Even tSe t , va l u e s ) != PAPI_OK) h a n d l e _ e rr o r ( 1 ) ;p r i n t f ( " A f t e r a dd in g t he c o un t er s : %l l d \ n " , v a l ue s [ 0 ] ) ;
do _ f l o ps ( NUM_FLOPS ) ;
/* S top t he c o un t in g o f e ve nt s i n t he E ven t S et */i f ( PAPI_ sto p ( Even tSet , va l u e s ) != PAPI_OK) h a n d l e _ e rro r ( 1 ) ;
p r i n t f ( " A f t e r s t op p in g t he c o un t er s : %l l d \ n " , v a l ue s [ 0 ] ) ;}
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 96 / 108 PAPI API Examples PAPI in Parallel
PAPI in Parallel
-
7/29/2019 class16_profiling-handout.pdf
92/102
threads PAPI_thread_init enables PAPIs thread support, and
should be called immediately after
PAPI_library_init.
MPI codes are treated very simply - each process has its ownaddress space, and potentially its own hardware counters.
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 97 / 108
-
7/29/2019 class16_profiling-handout.pdf
93/102
High-level Tools Popular Examples
Tool Examples
-
7/29/2019 class16_profiling-handout.pdf
94/102
A list of such high-level tool examples (not exhaustive):
TAU, Tuning and Analysis Utility,http://www.cs.uoregon.edu/Research/tau/home.php
Open|SpeedShop,funded by U.S. DOEhttp://www.openspeedshop.org
IPM, Integrated Performance Managementhttp://ipm-hpc.sourceforge.net
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 100 / 108 High-level Tools Specific Example: IPM
Example: IPM
http://www.cs.uoregon.edu/Research/tau/home.phphttp://www.openspeedshop.org/http://ipm-hpc.sourceforge.net/http://ipm-hpc.sourceforge.net/http://www.openspeedshop.org/http://www.cs.uoregon.edu/Research/tau/home.php -
7/29/2019 class16_profiling-handout.pdf
95/102
IPM is relatively simple to install and use, so we can easily walkthrough our favorite example. Note that IPM does:
MPI
PAPI
I/O profiling
Memory
Timings: wall, user, and system
M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 101 / 108
-
7/29/2019 class16_profiling-handout.pdf
96/102
-
7/29/2019 class16_profiling-handout.pdf
97/102
-
7/29/2019 class16_profiling-handout.pdf
98/102
High-level Tools Specific Example: IPM
Script to Generate HTML from XML Results
-
7/29/2019 class16_profiling-handout.pdf
99/102
1 #!/bin/bash23 if [ $# -ne 1 ]; then4 echo "Usage: $0 xml_filename"5 exit6 fi7 XMLFILE=$1
89 export IPM_KEYFILE=/projects/jonesm/ipm/src/ipm/ipm_key
10 export PATH=${PATH}:/projects/jonesm/ipm/src/ipm/bin11 /projects/jonesm/ipm/src/ipm/bin/ipm_parse -html $XMLFILE
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 105 / 108
-
7/29/2019 class16_profiling-handout.pdf
100/102
High-level Tools Specific Example: IPM
Visualize Results in Browser
Transfer compressed tar file to your local machine, unpack, and
browse the results:
-
7/29/2019 class16_profiling-handout.pdf
101/102
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 107 / 108
High-level Tools Specific Example: IPM
Summary
-
7/29/2019 class16_profiling-handout.pdf
102/102
Summary of high-level tools
IPM is pretty easy to use, provides some good functionality
TAU and Open|SpeedShop have steeper learning curves, muchmore functionality
M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 108 / 108